Concise, vivid, illustrating how the "old painter" DALL-E 2 works

Selected from assemblyai

Written by Ryan O'Connor

Machine Heart Compilation

Edit: Egg sauce

How does the stunning DALL-E 2 work?

In early April 2022, OpenAI's groundbreaking model, DALL-E 2, was unveiled, setting a new benchmark for image generation and processing. By simply entering a short text prompt, DALL-E 2 can generate entirely new images that combine different and unrelated objects in a semantically sound way, just as by typing prompt "a bowl of soup that is a portal to another dimension as digital art" generates the following image.

Concise, vivid, illustrating how the "old painter" DALL-E 2 works

DALL-E 2 can even modify existing images, create image variants that retain their distinctive features, and interpolate between the two input images. The impressive results of DALL-E 2 have left many wondering how such a powerful model would work.

In this article, we'll take a closer look at how DALL-E 2 creates so many stunning images. A lot of background information will be provided, and the level of interpretation will cover the scope, so this article is suitable for readers with multiple levels of machine learning experience.

Overall, the highlights of DALL-E 2 are as follows:

1. First, DALL-E 2 demonstrates the power of the Diffusion Model in deep learning, as both the a priori and image generation submodels in DALL-E 2 are based on diffusion. Although it has only become popular in the past few years, diffusion models have proven their worth, and some people who follow deep learning research expect to see more progress in the future.

2. Second, the necessity and power of using natural language as a means of training deep learning SOTA models is demonstrated. This does not stem from DALL-E 2, but it is important to recognize that the power of DALL-E 2 stems from natural language/image data that can be matched at scale on the Internet. Using this data eliminates the high cost and associated bottlenecks of manually labeling datasets, but the noisy, unprocessed nature of this data also reflects the robustness that deep learning models must be exposed to real-world data.

3. Finally, DALL-E 2 reaffirms transformer status, which is paramount to models trained on network-scale datasets, given their impressive parallelism.

How DALL-E 2 works: Aerial view

Before we dive into how DALL-E 2 works, let's take a general look at how DALL-E 2 generates images. While DALL-E 2 can perform a variety of tasks, including the image processing and interpolation mentioned above, we will focus on image generation tasks in this article.

At the highest level, the work of the DALL-E 2 is very simple:

1. First, enter the text prompt into a text encoder that has been trained to map prompt to the representation space;

2. Next, a model called a priori maps the text encoding to the corresponding image encoding, which captures the semantic information of the prompt contained in the text encoding;

3. Finally, the image decoding model randomly generates an image that is a visual representation of the semantic information.

From a bird's-eye view, that's all it is. Of course, there are a lot of interesting implementation details, which we'll discuss below.

Detailed description

Now it's time to dive into each of these steps separately. Let's start with how DALL-E 2 learns to link related text and visual concepts.

Step 1: Link text and visual semantics

After entering "Teddy Bear Rides a Skateboard in Times Square", the DALL-E 2 outputs the following image:

How does DALL-E 2 know how a textual concept like "teddy bear" should be embodied in visual space? The connection between text semantics and its visual representation in DALL-E 2 was learned by another OpenAI model called CLIP.

CLIP has been trained on hundreds of millions of images and their associated titles to understand how well a given piece of text relates to an image. That is, clip is not trying to predict the title of a given image, but rather to learn how relevant any given title relates to the image. This contrasting rather than predicted goal enables CLIP to learn the connections between text and visual representations of the same abstract target. The entire DALL-E 2 model depends on CLIP's ability to learn semantics from natural language, so let's see how clips can be trained to understand how it works inside.

CLIP training

The basic principles for training CLIP are simple:

1. First, all images and their associated captions map all objects to an m-dimensional space through their respective encoders.

2. Then, calculate the cosine similarity of each (image, text) pair.

3. The training goal is to simultaneously maximize the cosine similarity between N correctly encoded image/title pairs and minimize the cosine similarity between N2 - N incorrectly encoded image/title pairs.

The training process is visualized as follows:

CLIP is important to DALL-E 2 because it ultimately determines how semantically correlated the natural language fragment is with the visual concept, which is critical for text-conditional image generation.

Step 2: Generate images from visual semantics

After training, the CLIP model is frozen and DALL-E 2 moves on to the next task, learning to reverse the image encoding mapping that CLIP just learned. CLIP learns a representational space in which it is easy to determine the relevance of text and visual encoding, but our interest lies in image generation. Therefore, we must learn how to use representation space to accomplish this task.

In particular, OpenAI performs this image generation using a modified version of its previous model GLIDE (https://arxiv.org/abs/2112.10741). The GLIDE model learns to invert the image encoding process in order to randomly decode clip image embeddings.

As shown in the figure above, it should be noted that the goal is not to build an autoencoder and accurately reconstruct the image given the embedding, but to generate an image that maintains the salient features of the original image given the embedding. To perform this image generation, GLIDE used a diffusion model.

What is the diffusion model?

The diffusion model is a thermodynamically inspired invention that has gained significant popularity in recent years. Diffusion models learn to generate data by inverting gradual noise processes. As shown in the figure below, the noise process is seen as a parametric Markov chain that gradually adds noise to the image to destroy it, eventually (asymptotically) producing pure Gaussian noise. The diffusion model learns to navigate backwards along this chain, gradually removing noise over a series of time steps to reverse the process.

If the diffusion model is then "split in two" after training, it can be used to generate an image by randomly sampling Gaussian noise and then denouncing it to produce a realistic image. Some people may recognize that this technique is easily reminiscent of using autoencoders to generate data, whereas diffusion models and autoencoders are actually related.

GLIDE training

Although GLIDE is not the first diffusion model, its important contribution is that it modified them to allow the generation of text conditional images. In particular, one will notice that diffusion models start with randomly sampled Gaussian noise. At first, it was unclear how to adjust this process to produce a specific image. If you train a diffusion model on a face dataset, it will reliably generate realistic images of faces; but what if someone wants to generate a face with specific features, such as brown eyes or blond hair?

GLIDE extends the core concepts of the diffusion model by using additional text information enhancement training, resulting in a text-conditional image. Let's take a look at the training process for GLIDE:

Here are some examples of images generated using GLIDE. The authors note that GLIDE performs better than DALL-E in terms of photorealism and subtitle similarity.

An example of an image generated by GLIDE

DALL-E 2 uses the modified GLIDE model to embed projected CLIP text in two ways. The first is to add them to GLIDE's existing time step embeddings, and the second is by creating four additional context tokens that are connected to the output sequence of the GLIDE text encoder.

GLIDE is important to DALL-E 2 because it allows authors to easily port GLIDE's text conditional photorealistic image generation capabilities to DALL-E 2 by adjusting image encoding in presentation space. As a result, DALL-E 2's modified GLIDE learning produces semantically consistent images conditioned on CLIP image encoding. It is also important to note that the reverse diffusion process is random, so it is easy to change by entering the same image encoding vector multiple times through a modified GLIDE model.

Step 3: Map from text semantics to the appropriate visual semantics

While the modified GLIDE model successfully generates images that reflect the semantics captured by image encoding, how do we actually look for these encoding representations? In other words, how do we inject text conditional information from prompt into the image generation process?

Recall that in addition to our image encoder, CLIP also learned a text encoder. DALL-E 2 uses another model, which the authors call the a priori model, to map from the text encoding of the image title to the image encoding of its corresponding image. The DALL-E2 authors experimented with a priori autoregressive and diffusion models, but found that they produced comparable performance. Due to the computational efficiency of the diffusion model, it was chosen as the a priori model for DALL-E 2.

A priori mapping from text encoding to its corresponding image encoding

Training beforehand

The diffusion a priori in DALL-E 2 consists of a decoder-only Transformer. It uses the causal attention mask to run on an ordered sequence:

1. Tokenized text/title.

2. The CLIP text encoding of these tokens.

3. Coding of diffusion time steps.

4. Noise images via clip image encoder.

5. Final encoding, whose output from the Transformer is used to predict noise-free CLIP image encoding.

Put them together

At this point, we have all the functional components of DALL-E 2, and simply link them together to generate text conditional images:

1. First, the CLIP text encoder maps the image description to the representation space.

2. Then the diffusion a priori maps from the CLIP text encoding to the corresponding CLIP image encoding.

3. Finally, the modified GLIDE generation model is mapped from the representation space to the image space by reverse diffusion, generating one of many possible images that convey semantic information in the input description.

A high-level overview of the DALL-E 2 image generation process

Reference information

1. Deep Unsupervised Learning using Nonequilibrium Thermodynamics (https://arxiv.org/abs/1503.03585)

2. Generative Modeling by Estimating Gradients of the Data Distribution (https://arxiv.org/abs/1907.05600)

3. Hierarchical Text-Conditional Image Generation with CLIP Latents (https://arxiv.org/pdf/2204.06125.pdf)

4. Diffusion Models Beat GANs on Image Synthesis (https://arxiv.org/abs/2105.05233)

5. Denoising Diffusion Probabilistic Models (https://arxiv.org/pdf/2006.11239.pdf)

6. Learning Transferable Visual Models From Natural Language Supervision (https://arxiv.org/pdf/2103.00020.pdf)

7. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models (https://arxiv.org/pdf/2112.10741.pdf)

Concise, vivid, illustrating how the "old painter" DALL-E 2 works

Read on