Byte the latest text generation image AI, the training set actually does not have a picture with a text description?!

2022-03-23 19:33:10

A text-image does not use data, but also allows AI to learn to read and make pictures?

The latest text2image model from Byte does just that.

Experimental data show that its effect is more realistic than VQGAN-CLIP, especially the generalization ability is much better than many models trained with a large amount of text-image data.

Byte the latest text generation image AI, the training set actually does not have a picture with a text description?!

Well? How does the AI know what each picture represents without giving text annotations?

How exactly was this model trained?

Images can be generated from text without text training

First, the choice of this approach, the authors say, is due to the high cost of collecting a large set of images with text.

Once we get rid of the text-image need for data, we can train powerful and versatile text2image generators directly with large textless image datasets such as ImageNet.

This model of byte implementation is called CLIP-GEN, how exactly does it work?

There are three major steps.

First, for an image without a text label, an image encoder using CLIP extracts the image's epimediation in the language-vision joint embedding space.

Next, the image is converted into a series of discrete tokens in the VQGAN codebook space.

That is, the image is represented in the same way as natural language, which is convenient for subsequent processing with Transformer.

Among them, the VQGAN model that acts as the image tokenizer can be trained using the unlabeled image dataset in hand.

Finally, an autoregressive Transformer is trained to map the image markers from the Transformer's linguistic-visual unified representation.

After such training, faced with a string of text descriptions, transformers can generate corresponding image tokens based on text embedding extracted from CLIP's text embedding.

So the text-image generator without text data participating in the training in the whole process, is the effect okay?

Performance is comparable to Tsinghua CogView

The authors trained and evaluated CLIP-GEN on the ImageNe and MSCOCO datasets, respectively.

First, a sample is generated using MS-COCO to validate the six text descriptions in the set.

Compare the effect of CLIP-GEN and other text-images on trained text2image generated models as follows:

Among them, the results of VQGAN-CLIP are relatively unreal and accompanied by severe shape distortion.

CogView from Tsinghua claims to be better than DAL-E, and in the experiment here, it can indeed produce a good image structure, but it is almost a problem in texture detail.

DF-GAN can produce sensible images with rich detail, but it is also prone to local artifacts.

The authors argue that clip-GEN's images are more detailed and of higher quality than these contrasting models, such as a good interpretation of the "reflection in water" required in the second set of texts (though not very understanding of the numerical concept in "Three Stuffed Bears").

Quantitative experimental results basically prove this conclusion:

CLIP-GEN scored the highest FID-0 and FID-1 scores; the CapS score (a measure of semantic similarity between input text and generated images) was much higher than other models, in addition to being 4% lower than CogView.

In addition, the authors also found that CLIP-GEN's generalization ability seems to be good.

In the following set of unconventional text descriptions, such as "a flying penguin", "a dog that smokes a cigar", "a lemon with a face and hair"... CLIP-GEN is basically achievable, but other models are not very understandable.

About the author

The five authors of this model are all from bytes.

One is named Wang Zihao.

The corresponding author, named Yi Zili, graduated from Nanjing University with a bachelor's degree, graduated from Memorial University of Newfoundland in Canada with a Ph.D., and is currently working as an artificial intelligence expert at Byte (mainly researching multimodal, super-resolution, and face effects), and before that, he worked at Huawei.

Byte the latest text generation image AI, the training set actually does not have a picture with a text description?!

Read on