laitimes

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

Reporting by XinZhiyuan

EDIT: LRS

【New Zhiyuan Introduction】Big artists are upgraded! Recently OpenAI released an upgraded version of DAL · E 2, not only has the resolution increased by 4 times, the accuracy rate is higher, and the business is also wider: in addition to generating images, you can also create a second time!

In January 2021, OpenAI put a big move: DALL-E model, let natural language and images successfully hold hands, enter a piece of text no matter how outrageous, can generate pictures!

For example, the classic "avocado-shaped armchair" and the novel creature "A Giraffe Tortoise".

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"
The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

It was magical enough to look at that time, right?

A year later, OpenAI combined with another multimodal model, CLIP, to release a second version, CALL · E 2!

Compared to the previous generation, DALL · The E 2 is even more magical, simply going straight from two-dimensional stick figures to ultra-high-definition big pictures: the resolution has been increased by four times, from 256x256 to 1024 x 1024, and the accuracy of generating images is also higher!

For example, the title of "Drawing a Fox sitting in the field at sunrise in the style of Claude Monet" shows the difference between the two at a glance. (a painting of a fox sitting in a field at sunrise in the style of Claude Monet)

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

DALL· The E 2's generation is also much broader, such as the realistic astronaut riding a horse + in a photorealistic style, arranged! In the mountains, outer space, grass, etc., all kinds of scenes are readily available.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

An astronaut + riding a horse + as a pencil drawing

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

There is also a bowl of soup made of knitted wool that looks like a monster! (A bowl of soup + that looks like a monster + knitted out of wool)

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

Dall· E 2 can also edit existing images from natural language titles, and also consider shadows, reflections, and textures when adding and removing elements.

For example, adding a puppy to the painting has no sense of violation.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

You can also give a painting as the title, let Dall · E 2 recreates a picture of itself.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

In addition to the official examples, some netizens posted their own trial play Dall · E 2 generates images, such as pandas skateboarding.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

Application link: https://labs.openai.com/waitlist

Currently Dall · E 2 is still in a state of research, and has not yet officially provided an API to the outside world, but with OpenAI's consistent strict standards and high ethical requirements for itself, it is certainly indispensable to limit the development and deployment of models to prevent models from being abused.

Although Dall · The E 2 can draw anything you can imagine, but OpenAI still limits the model in terms of functionality, mainly in three main points:

1. Prevent harmful content from being generated

OpenAI restricts Dall · E 2's ability to produce this type of content minimizes The Dall · E 2 Cognition of these concepts. Some technical means are also used to prevent the generation of hyper-realistic photos of faces, especially some public figures.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

2. Prevention of abuse

The Content Policy states that users are not allowed to generate violent, adult, or political content. If the filter recognizes a text cue or image upload that may violate the policy, the system will not return the generated image. There are also automated and manual monitoring systems in the system to prevent abuse.

3. Learning-based phased deployment

OpenAI has been working with external experts and is open to previewing Dall · E2 permissions, these users can help developers understand the capabilities and limitations of the technology. The development team plans to improve the security system through learning, iteratively, and invite more people to participate in the preview over time.

How did the master artist become?

DALL· E 2 and generation 2 are also based on CLIP, but OpenAI research scientist Prafulla Dhariwal said, "DALL · E 1 simply extracted the GPT-3 method from the language and applied it to the generated image: compressing the image into a series of words and learning to predict what would happen next."

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

Address: https://cdn.openai.com/papers/dall-e-2.pdf

The training data consists of a pair of data (x, y), where x is the image and y is the corresponding image caption. Given an image x, z_i and z_t are the corresponding CLIP image vectors and text vectors.

Contrastive models like CLIP have been shown to learn very Lupin image representations, capable of capturing semantics and styles.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

To utilize these representations for image generation, the researchers proposed a two-stage model: a priori for generating CLIP image embeddings for a given text caption, and a decoder for generating images based on conditional image embeddings.

Prior P (z_i | y) is the generation of a CLIP image vector under the caption y condition.

Decoder P(x |z_i, y) is used to generate images z_i x.

The decoder is able to invert the image according to the CLIP image vector, while the prior allows the model to learn a generative model of an image vector itself. By superimposing these two parts, you can get a generative model P(x|y)

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

Because this process is generated by inverting the CLIP image encoder, the image generation stack for the new model is also known as unCLIP.

In the overall architecture of uncLIP, the dotted line above describes the training process of CLIP. Through this process, model learning can learn a joint representation space of text and images. Below the dotted line describes the text-to-image generation process: the text embedding of clips is first fed into an autoregressive or diffusion a priori to produce an image vector. This vector is then used as a condition for the diffusion decoder to produce a final generated image. Note that during the training of both the a priori and the decoder, the parameters of the CLIP model are frozen.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

Another distinctive feature of the new decoder model is to explore the uncharted space of CLIP by directly visualizing what the CLIP image encoder sees!

For example, using a CLIP to encode an image, and then using a diffusion decoder to decode its image vector, we can get changes in the image, these changes can tell us what information is captured in the CLIP image vector (retained in different samples) and what information is lost (there are changes in different samples).

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

With this idea as a guide, it is possible to interpolate between CLIP vectors to mix information between two images, such as a vector space in which there is a continuous change between the two pictures of the night sky and the dog.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

This feature also provides protection against typographic attacks, such as in some pictures, where text is draped in front of the object, which causes the CLIP model to be more inclined to predict the object described in the text than the object depicted in the image. For example, apples with iPods written on them will cause some apples to be misclassified into iPods.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

In the new model, it can be found that the decoder still generates pictures of Apple with high probability, and although the relative prediction probability of this title is very high, the model never produces pictures of the iPod. Another possibility is to probe the structure of the CLIP subteam itself.

The researchers also tried to take a small number of source image CLIP image vectors and reconstruct them with increasing PCA dimensions, and then visualize the image embeddings with decoders and FIXED-SEED DIMs, which also allowed us to see the semantic information encoded by the different dimensions.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

It can be observed that the early PCA dimension retained coarse-grained semantic information, such as the type of object in the scene, while the later PCA dimension encoded more fine-grained details such as the shape and concrete form of the object. For example, in the first scenario, the early dimensions seem to encode the presence of food, containers, etc., while subsequent dimensions encode more specific things like tomatoes, bottles, and so on.

In comparison of datasets on MS-COCO, it is already standard practice to use FID as an evaluation metric on the validation set of MS-COCO.

Like GLID and DALL-E, unCLIP does not train directly on the MS-COCO training set, but it can still achieve certain generalization performance on the zero-shot on the MS-COCO verification set.

The experimental results can be found that compared with other zero-shot models, unCLIP achieves a new optimal FID result when sampling with diffusion a priori, which is 10.39.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

An intuitive comparison of unCLIP with various recent text condition image generation models on several titles of MS-COCO shows that, like other methods, the real-world scenes generated by unCLIP conform to text cues.

The new multimodal king ascends the throne! OpenAI Releases DAL · E 2, generate the image "which to play which"

Overall, the generated image characterization of DALL-E2 significantly increases image diversity with minimal gaps in fidelity and title similarity.

The decoder proposed in the article that is conditional on image representation can also make the image change, retaining its semantics and style while changing non-essential details that are not present in the image representation.

After comparative experiments with autoregressive and diffusion models, it can be found that the diffusion models are computationally more efficient and produce higher quality samples.

Resources:

https://openai.com/dall-e-2/

Read on