Edit: LRS

Playground Research has launched a new generation of text-to-image model PGv3, with 24 billion parameters, using a deeply integrated large language model, which even surpasses human designers in graphic design and following text prompt instructions, while supporting accurate RGB color control and multi-language recognition.

Since last year, there has been tremendous progress in text-to-image generation models, and the architecture of the models has gradually changed from the traditional UNet-based model to the Transformer-based model.

Playground Research recently published a paper detailing the team's newest DiT-based diffusion model, Playground v3 (PGv3 for short), which scales the number of model parameters to 24 billion, achieves state-of-the-art performance across multiple test benchmarks, and is better at graphic design.

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

Link to paper: https://arxiv.org/abs/2409.10695

Data link: https://huggingface.co/datasets/playgroundai/CapsBench

Unlike traditional text-to-image generation models that rely on pre-trained language models such as T5 or CLIP text encoders, PGv3 is fully integrated with large language models (LLMs) and is based on a new Deep-Fusion architecture that leverages the knowledge of decoder-only large language models for text-to-image generation tasks.

In addition, in order to improve the quality of image descriptions, the researchers developed an in-house captioner capable of generating descriptions with different levels of detail, enriching the diversity of text structures, and introduced a new benchmark, CapsBench, to evaluate the performance of detailed image descriptions.

Experimental results show that PGv3 performs well in text prompt following, complex reasoning and text rendering accuracy. User preference studies have shown that the PGv3 model has superior graphic design capabilities in common design applications such as stickers, posters, and logo design, as well as precise control of RGB colors and multi-language understanding.

PGv3 model architecture

Playground v3 (PGv3) is a latent diffusion model (LDM) that is trained using the EDM formula. Like other models such as DALL-E 3, Imagen 2, and Stable Diffusion 3, PGv3 is designed to perform text-to-image (t2i) generation tasks.

PGv3 is fully integrated with a large language model (Llama3-8B) to enhance its ability to comprehend and follow prompts.

Text encoders

Each layer in the Transformer model captures a different representation and contains different levels of word-level and sentence-level information, and the standard practice is to use the last layer of output from a T5 encoder or CLIP text encoder, or to combine the output of the penultimate layer, but the researchers found that choosing the best layer to tune the text-to-image model is cumbersome, especially when using decoder-style large language models with more complex internal representations.

The researchers believe that the continuity of the flow of information through each layer of the LLM is the key to its ability to generate, and that the knowledge in the LLM spans all layers rather than being encapsulated by the output of one layer, so PGv3 is designed to replicate all the Transformer blocks of the LLM to obtain hidden embedding outputs from each corresponding layer of the LLM.

This approach can take full advantage of the LLM's complete "thought process" and lead the model to mimic the LLM's reasoning and generation process, so that when generating images, better prompt compliance and consistency can be achieved.

Model structure

PGv3 adopts a DiT-style model structure, each Transformer block in the image model is set up to be the same as the corresponding block in the language model (Llama3-8B), containing only one attention layer and one feedforward layer, and the same parameters, such as the size of the hidden dimension, the number of attention heads, and the dimension of the attention head, and only the image model part is trained.

During diffusion sampling, the language model part only needs to be run once to generate all intermediate hidden embeddings.

Unlike most traditional CNN-based diffusion models, the Transformer model separates the self-attention of image features from the cross-attention between image and text features, and then performs joint attention operations to extract relevant features from the combined pool of image and text values, and can reduce computational cost and inference time, and the following are some useful operations for performance improvement:

1. Transformer块之间的U-Net跳跃连接。

2. Token downsampling in the middle layer, in layer 32, the sequence length of image keys and values is reduced by a factor of four in the middle layer, making the entire network similar to a traditional convolutional U-Net with only one downsampling, slightly speeding up training and inference times, and no performance degradation.

3. Position embedding, the same as Rotational Position Embedding (RoPE) in llama3, since the image is a two-dimensional feature, the researchers explored a 2D version of RoPE:

The "interpolating-PE" method interpolates the position ID in the middle regardless of the sequence length, keeping the start and end position IDs fixed, but this method is severely overfitting at the training resolution and cannot generalize to unseen aspect ratios.

In contrast, the "expand-PE" method increased the position ID proportionally to the length of the sequence, performed well without any tricks or normalization, and showed no signs of resolution overfitting.

New VAE

The variational autoencoder (VAE) of the latent diffusion model (LDM) is important for determining the fine-grained image quality upper limit of the model.

The researchers increased the number of submerged channels for VAE from 4 to 16, enhancing the ability to synthesize details, such as smaller faces and text; In addition to training at 256×256 resolutions, it is also extended to 512×512 resolutions, further improving reconstruction performance.

CapsBench描述基准

Image description evaluation is a complex issue, and the current evaluation indicators are mainly divided into two categories:

1. Reference-based metrics, such as BLEU, CIDEr, METEOR, SPICE, use a true description or set of descriptions to calculate similarity as a quality measure, and the model score is limited by the reference format;

2. No reference metrics, such as CLIPScore, InfoMetIC, TIGEr, use the semantic vector of the reference image or multiple regions of the image to calculate the similarity metric of the proposed description, but the disadvantage is that for dense images and long and detailed descriptions, the semantic vector is not representative because there are too many concepts included.

A novel assessment method is problem-based indicators, generating questions from descriptions and using those questions to evaluate the proposed descriptions, helping to comprehensively evaluate text-to-image models.

Inspired by DSG and DPG-bench, the researchers proposed an inverted image description evaluation method to generate "yes-no" question and answer pairs across 17 image categories: generic, image type, text, color, position, relation, relative position, entity, entity size, entity shape, counting, emotion, blur, image artifacts, proper nouns (world knowledge), color palette, and color grading.

During the evaluation process, the language model is used to answer questions based on candidate descriptions only, with answer options of "yes", "no", and "not applicable".

CapsBench contains 200 images and 2471 questions, with an average of 12 questions per image, covering movie scenes, cartoon scenes, movie posters, invitations, advertisements, leisure photography, street photography, landscape photography, and interior photography.

Experimental results

The researchers compared Ideogram-2 (top left), PGv3 (top right), and Flux-pro (bottom left), and when viewed as thumbnails, the images of the three models looked similar with little qualitative difference.

When you zoom in to examine the details and textures, you can see a noticeable difference: the skin textures generated by Flux-pro are too smooth, similar to the effect of 3D rendering, and not realistic enough; The Ideogram-2 provides a more realistic skin texture, but doesn't do well at following the cue words, which are long and lose key details.

In contrast, the PGv3 excels at following prompts and generating realistic images, as well as showing a cinematic feel that is significantly better than other models.

Instructions followed

The colored text represents specific details that the model failed to capture, and you can see that PGv3 is always able to follow the details. The benefits of PGv3 become especially evident as test prompts become longer and include more details, and the researchers attribute this performance improvement to our model structure that integrates large language models (LLMs) and advanced visual-language model (VLM) image description systems.

Text rendering

Capable of generating images in a variety of categories, including posters, logos, memes, book covers, and presentation slides, PGv3 is also able to reproduce memes with custom text, and with its powerful cue follow and text rendering capabilities, create new memes with unlimited characters and compositions.

RGB color control

PGv3 goes beyond standard color palettes with exceptionally fine color control in generated content, and with its powerful prompt following capabilities and professional training, PGv3 enables users to precisely control the color of each object or area in an image with precise RGB values, making it ideal for professional design scenarios that require precise color matching.

Multilingual ability

Thanks to the innate ability of language models to understand multiple languages and construct good representations of related words, PGv3 is able to naturally interpret prompts in various languages, and multilingualism is sufficient with only a small number of multilingual text and image pairs on a dataset (tens of thousands of images).

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

PGv3 model architecture

CapsBench描述基准

Experimental results

Read on

Less than 1% of human beings have developed microwave ovens! These 10 "hidden usages" are unknown to many people

Humans are not indigenous to Earth? Scientists' astonishing discovery sparks heated discussions!

logos: A new look for creativity! A set of fun and trendy minimalist logo designs to share

The 2024 "Qin Chuangyuan-China Stone Association • Gem Machinery Cup" The 11th China Graduate Energy Equipment Innovation Design Competition was opened

2024 Design Business Value-added Award, award application is underway!

Why is it designed as a mortise and tenon structure? An article reveals the "lunar soil bricks" that are about to fly into space

It's a big deal! Egg fried rice can also be premade! Five yuan a pack to refresh the three views! Is it true that human beings are immune to poison?

United States expert: China has again started its anti-human operation, digging the largest reservoir in Asia out of the desert

Good news! In addition, 2 (sections) of high-speed railways have been approved for preliminary design, and the number of approved lines has increased to 8 during the year

AVATR 07, a question: what kind of design can be called the ultimate urban luxury SUV?

NEW REPORT! Original concept! Chinese universities first proposed "Comprehensive Human Development 2050 (CDGs2050)"

What will happen one day when humanity as a species may disappear completely?

In the long history of the evolution of life, the most outstanding product is none other than the human brain

If a solar system is cloned in the universe, can human beings evolve on the earth in this solar system?

What do you want rural villagers to build houses like? Municipal Commission of Housing and Urban-Rural Development Open Call for Design Proposals (Atlas)

These 5 kitchen designs have "fallen off the altar", many people don't understand, and they still follow the trend to install!

Don't waste balcony space, in addition to hanging clothes, you can also design it like this, absolutely!

The door is facing the living room, how to design the partition without embarrassment?

Accurate "sickness counting"! The accuracy rate of AI cancer detection is as high as 94%, and human beings will no longer talk about cancer discoloration in the future?

100 years ago, humanity predicted 2023: cancer will be eliminated and people will live to be 300 years old?

Is the level design of Black Myth: Wukong really that bad? Uncover the truth behind it!

【Design of 100 catering brands】 The orange and yellow color scheme of the duck design brand presents a different visual experience

OnePlus 13 makes its debut on a real machine! Blue and black unveiled: family-style hinge design is history

After looking at the Chinese drone, Musk scolded: the F-35 is a "shit design", and it is produced by fools

20 uncommon pictures to see how strong the human genetics are

AI dominates mental work in 4 years, and humans move bricks? Musk predicts that 30 billion robots will take over the world

8,000 yuan to buy show clothes, 90-year-old on the runway... The designer who designed the "Northeast Big Cotton Jacket" for Zhang Xinyu is also doing this

It's on fire! Over half a hundred years old, "debuted" as a model, and some people walked the catwalk at the age of 90! "A set of clothes costs 8,000 yuan, and there are dozens of pairs of hating the sky"! A well-known designer went down and once designed a "big padded jacket" for Zhang Xinyu

A flu claimed 100 million lives: the greatest catastrophe of mankind that changed the fate of the world

A hundred years ago, there were rare images of the "human zoo" in the United States, and the indigenous people had no dignity at all!

Humanity may have missed the window of time to save itself

Each of the six theories that subvert cognition in human history has promoted the progress of human civilization