laitimes

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

Edit: LRS

Playground Research has launched a new generation of text-to-image model PGv3, with 24 billion parameters, using a deeply integrated large language model, which even surpasses human designers in graphic design and following text prompt instructions, while supporting accurate RGB color control and multi-language recognition.

Since last year, there has been tremendous progress in text-to-image generation models, and the architecture of the models has gradually changed from the traditional UNet-based model to the Transformer-based model.

Playground Research recently published a paper detailing the team's newest DiT-based diffusion model, Playground v3 (PGv3 for short), which scales the number of model parameters to 24 billion, achieves state-of-the-art performance across multiple test benchmarks, and is better at graphic design.

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

Link to paper: https://arxiv.org/abs/2409.10695

Data link: https://huggingface.co/datasets/playgroundai/CapsBench

Unlike traditional text-to-image generation models that rely on pre-trained language models such as T5 or CLIP text encoders, PGv3 is fully integrated with large language models (LLMs) and is based on a new Deep-Fusion architecture that leverages the knowledge of decoder-only large language models for text-to-image generation tasks.

In addition, in order to improve the quality of image descriptions, the researchers developed an in-house captioner capable of generating descriptions with different levels of detail, enriching the diversity of text structures, and introduced a new benchmark, CapsBench, to evaluate the performance of detailed image descriptions.

Experimental results show that PGv3 performs well in text prompt following, complex reasoning and text rendering accuracy. User preference studies have shown that the PGv3 model has superior graphic design capabilities in common design applications such as stickers, posters, and logo design, as well as precise control of RGB colors and multi-language understanding.

PGv3 model architecture

Playground v3 (PGv3) is a latent diffusion model (LDM) that is trained using the EDM formula. Like other models such as DALL-E 3, Imagen 2, and Stable Diffusion 3, PGv3 is designed to perform text-to-image (t2i) generation tasks.

PGv3 is fully integrated with a large language model (Llama3-8B) to enhance its ability to comprehend and follow prompts.

Text encoders

Each layer in the Transformer model captures a different representation and contains different levels of word-level and sentence-level information, and the standard practice is to use the last layer of output from a T5 encoder or CLIP text encoder, or to combine the output of the penultimate layer, but the researchers found that choosing the best layer to tune the text-to-image model is cumbersome, especially when using decoder-style large language models with more complex internal representations.

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

The researchers believe that the continuity of the flow of information through each layer of the LLM is the key to its ability to generate, and that the knowledge in the LLM spans all layers rather than being encapsulated by the output of one layer, so PGv3 is designed to replicate all the Transformer blocks of the LLM to obtain hidden embedding outputs from each corresponding layer of the LLM.

This approach can take full advantage of the LLM's complete "thought process" and lead the model to mimic the LLM's reasoning and generation process, so that when generating images, better prompt compliance and consistency can be achieved.

Model structure

PGv3 adopts a DiT-style model structure, each Transformer block in the image model is set up to be the same as the corresponding block in the language model (Llama3-8B), containing only one attention layer and one feedforward layer, and the same parameters, such as the size of the hidden dimension, the number of attention heads, and the dimension of the attention head, and only the image model part is trained.

During diffusion sampling, the language model part only needs to be run once to generate all intermediate hidden embeddings.

Unlike most traditional CNN-based diffusion models, the Transformer model separates the self-attention of image features from the cross-attention between image and text features, and then performs joint attention operations to extract relevant features from the combined pool of image and text values, and can reduce computational cost and inference time, and the following are some useful operations for performance improvement:

1. Transformer块之间的U-Net跳跃连接。

2. Token downsampling in the middle layer, in layer 32, the sequence length of image keys and values is reduced by a factor of four in the middle layer, making the entire network similar to a traditional convolutional U-Net with only one downsampling, slightly speeding up training and inference times, and no performance degradation.

3. Position embedding, the same as Rotational Position Embedding (RoPE) in llama3, since the image is a two-dimensional feature, the researchers explored a 2D version of RoPE:

The "interpolating-PE" method interpolates the position ID in the middle regardless of the sequence length, keeping the start and end position IDs fixed, but this method is severely overfitting at the training resolution and cannot generalize to unseen aspect ratios.

In contrast, the "expand-PE" method increased the position ID proportionally to the length of the sequence, performed well without any tricks or normalization, and showed no signs of resolution overfitting.

New VAE

The variational autoencoder (VAE) of the latent diffusion model (LDM) is important for determining the fine-grained image quality upper limit of the model.

The researchers increased the number of submerged channels for VAE from 4 to 16, enhancing the ability to synthesize details, such as smaller faces and text; In addition to training at 256×256 resolutions, it is also extended to 512×512 resolutions, further improving reconstruction performance.

CapsBench描述基准

Image description evaluation is a complex issue, and the current evaluation indicators are mainly divided into two categories:

1. Reference-based metrics, such as BLEU, CIDEr, METEOR, SPICE, use a true description or set of descriptions to calculate similarity as a quality measure, and the model score is limited by the reference format;

2. No reference metrics, such as CLIPScore, InfoMetIC, TIGEr, use the semantic vector of the reference image or multiple regions of the image to calculate the similarity metric of the proposed description, but the disadvantage is that for dense images and long and detailed descriptions, the semantic vector is not representative because there are too many concepts included.

A novel assessment method is problem-based indicators, generating questions from descriptions and using those questions to evaluate the proposed descriptions, helping to comprehensively evaluate text-to-image models.

Inspired by DSG and DPG-bench, the researchers proposed an inverted image description evaluation method to generate "yes-no" question and answer pairs across 17 image categories: generic, image type, text, color, position, relation, relative position, entity, entity size, entity shape, counting, emotion, blur, image artifacts, proper nouns (world knowledge), color palette, and color grading.

During the evaluation process, the language model is used to answer questions based on candidate descriptions only, with answer options of "yes", "no", and "not applicable".

CapsBench contains 200 images and 2471 questions, with an average of 12 questions per image, covering movie scenes, cartoon scenes, movie posters, invitations, advertisements, leisure photography, street photography, landscape photography, and interior photography.

Experimental results

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans
The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

The researchers compared Ideogram-2 (top left), PGv3 (top right), and Flux-pro (bottom left), and when viewed as thumbnails, the images of the three models looked similar with little qualitative difference.

When you zoom in to examine the details and textures, you can see a noticeable difference: the skin textures generated by Flux-pro are too smooth, similar to the effect of 3D rendering, and not realistic enough; The Ideogram-2 provides a more realistic skin texture, but doesn't do well at following the cue words, which are long and lose key details.

In contrast, the PGv3 excels at following prompts and generating realistic images, as well as showing a cinematic feel that is significantly better than other models.

Instructions followed

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

The colored text represents specific details that the model failed to capture, and you can see that PGv3 is always able to follow the details. The benefits of PGv3 become especially evident as test prompts become longer and include more details, and the researchers attribute this performance improvement to our model structure that integrates large language models (LLMs) and advanced visual-language model (VLM) image description systems.

Text rendering

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

Capable of generating images in a variety of categories, including posters, logos, memes, book covers, and presentation slides, PGv3 is also able to reproduce memes with custom text, and with its powerful cue follow and text rendering capabilities, create new memes with unlimited characters and compositions.

RGB color control

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

PGv3 goes beyond standard color palettes with exceptionally fine color control in generated content, and with its powerful prompt following capabilities and professional training, PGv3 enables users to precisely control the color of each object or area in an image with precise RGB values, making it ideal for professional design scenarios that require precise color matching.

Multilingual ability

The number of Wensheng diagram parameters is 24 billion! Playground v3 Released: Graphic Design Capabilities Surpass Humans

Thanks to the innate ability of language models to understand multiple languages and construct good representations of related words, PGv3 is able to naturally interpret prompts in various languages, and multilingualism is sufficient with only a small number of multilingual text and image pairs on a dataset (tens of thousands of images).

Read on