laitimes

See how AI painting generates images in one go!

author:Everybody is a product manager
Why do we type a sentence and artificial intelligence can produce a painting? How exactly does AI painting generate images? In this article, the author disassembled the AI painting process into 5 core problems, maybe by understanding these 5 core problems, you can understand the working principle of AI painting, let's take a look.
See how AI painting generates images in one go!

Write on the front

When talking with the leader about AIGC some time ago, I mentioned the topic of the principle of AI painting, and I have always only known that artificial intelligence is the principle of noise reduction painting, but many details are not very clear, I am curious why I can enter a sentence, artificial intelligence can understand, and draw a "remarkable" painting.

I took advantage of the weekend to climb some materials, and also found academic friends to recommend some research papers, probably understood how AI painting works, and shared it with everyone.

PS: Mainly through the vernacular way to explain the principle of AI painting, manually drawn illustrations and flowcharts as a schematic, the principle of the algorithm omits a lot of details, suitable for pan-AIGC enthusiasts to read and learn to understand.

In the whole AI painting process, I disassembled into five core problems, and if I understand these five problems, AI painting will be clear:

  1. I only entered one sentence, how does the AI know what this sentence describes?
  2. Where does the noise map mentioned in the AI painting process come from?
  3. Even with a noise map, how is the noise map removed from the "mosaic" little by little?
  4. So how does AI remove the useless "mosaic", and finally remove it in line with the effect?
  5. Even if there is a final result, why is the result of repainting different?

Don't worry, let's take a look at the drawing process of AI painting, and it is clear what these five questions are asking.

Let's look at the overview: the drawing process of AI painting

AI painting has developed rapidly, the most typical is last year's "Space Opera House", which won the first prize of the Colorado Art Award, which was quite shocking at that time.

Because looking a few years ahead, in fact, the effect of AI painting is like this (12 years, Wu Enda and his team used 16,000 GPUs and tens of thousands of data to draw cats in 3 days... ):

See how AI painting generates images in one go!

Let's take a look at the effect of AI painting now (ordinary people enter a sentence, a few seconds to draw the work):

See how AI painting generates images in one go!

Source: https://liblib.ai/ website

It can be seen that the picture quality is high-definition, exquisite, and the sense of realism is very strong, and it can even be compared to the works of photographers.

So how is AI getting better and better, and can draw such a good work according to a sentence and a few parameters?

Throwing the conclusion first, the principle of AI painting is:

Remove the mosaic and you can see clearly.

In fact, N years ago, some adult networks had similar technology, but that is 1vs1 reduction, AI painting essence is 1vsN reduction, the core is to erase the mosaic little by little, and finally "leak" the base map, complete the so-called "AI painting".

See how AI painting generates images in one go!

Image drawn by Designed by Liunn

Let's first look at the use scenarios of AI painting, all software or models, basically the first step is to let the user enter the painting keyword, that is, Prompt.

Taking the illustration of the Diffusion Model as follows, we take the rightmost as a normal picture, and it is constantly blurred from right to left, until finally it is impossible to see what it is, this process is the superposition noise of the algorithm.

You can understand it as constantly mosaicking the image, which is the most famous "Diffusion" process.

See how AI painting generates images in one go!

Source:https://mp.weixin.qq.com/s/ZtsFscz7lbwq_An_1mo0gg

To use an analogy, we think of this process as when you post a Moments photo, you want to block some information, so use the "Edit" function to constantly smear certain areas until the area is no longer able to see the original content.

And each noise iteration is actually only associated with the previous state, so this is essentially a Markov chain model (simply understood as a random model, details can be moved to Google).

At this time, if the process is reversed and processed from left to right, then it is a process of gradually removing noise and making a picture clearer step by step.

That is, your circle of friends photo mosaic is getting less and less, this process is the principle of Diffusion Model.

OK, seeing this, we understand the general process and principle, next, let's look at the five core problems in turn.

The first question: how to understand text information

The text you type, how does the AI know what you're trying to describe?

According to the principle mentioned above, the picture is erased little by little mosaic, but how does the text information I wrote match a certain mosaic picture?

We all know that the most mainstream use of AI painting is to enter a sentence (commonly known as Prompt) in the model or software, you can write the subject, background, character, style, parameters, etc., and then send it, you can get a picture.

For example, "a chicken playing ball in suspenders", the effect is as follows:

See how AI painting generates images in one go!

Image drawn by Source: Designed by Liunn

The bottom layer of AI painting is also a large model, which is an image model.

In the earliest days, the text control model was to let the model generate a bunch of pictures, and then let the classifier choose the one that best matched, this method is not bad, the only drawback is that when the amount of data is large to a certain extent, it will collapse (imagine using Excel to process tens of billions of rows of data, is it a big burden).

So on the one hand, it needs a lot of image data to train, on the other hand, it needs efficient and fast processing, and it is OpenCLIP launched by OpenAI in 21.

The working principle of CLIP can actually be simply understood as: crawler, text + picture information pair.

First, let's look at CLIP's crawler and database.

One of the biggest highlights of CLIP is that it uses a lot of data to form a huge database.

Each time CLIP crawls an image, it will be labeled and described (the actual CLIP is trained on the image grabbed from the web and its "alt" tag)

See how AI painting generates images in one go!

Source: https://jalammar.github.io/illustrated-stable-diffusion/, quoted from Jay Alammar's blog

This information is then recoded from 768 dimensions (you can understand this diagram from 768 different perspectives).

Then, based on this information, a multidimensional database is built, and each dimension will intersect with other dimensions.

At the same time, similar dimensions will be relatively close together, according to this way CLIP continues to crawl, and finally build a database of about 4~500 million.

See how AI painting generates images in one go!

Image drawn by Source: Designed by Liunn

Second, let's look at CLIP's text-image matching capabilities.

OK, with the database, how do the images in the library match the entered text? Here are two steps:

Step 01, how to have the ability of text-image matching.

First look at the figure below, it is the schematic diagram of the algorithm, it doesn't matter if you don't understand it, I redrew a schematic diagram of the dimensionality reduction version below.

See how AI painting generates images in one go!

Source: https://github.com/openai/CLIP

Let's look at the following diagram of how CLIP recognizes the association between text and images.

See how AI painting generates images in one go!

Drawing reference source: https://jalammar.github.io/illustrated-stable-diffusion/,Designed By Liunn

Here is a simplified algorithm model, the essence of which is to continuously train CLIP through a large amount of data to associate, recognize pictures and text, and constantly correct according to the comparison with the answer, and finally achieve accurate matching keywords and feature vectors.

Step 02, how to do text-image matching associations.

Okay, let's look at how CLIP does text image matching.

When we start drawing, we will enter a text description (i.e. Prompt), and the CLIP model will match the similarity from 768 dimensions in the database above according to Prompt, and then calculate a similarity matrix with the image and text encoded features.

Then, according to maximizing the diagonal element while minimizing the constraint of the non-diagonal element, the encoder is continuously optimized and adjusted, and finally the semantics of the text and image encoder are strongly associated.

See how AI painting generates images in one go!

Image drawn by Source: Designed by Liunn

Finally, when the most similar dimension description is found, all these image features are fused together to construct the total image feature vector set of the images to be produced this time.

At this point, the input paragraph is converted into all the feature vectors needed to generate the image this time, which is what the AI calls "understanding what kind of picture you want to draw".

This leap is already a "small step to the moon" in the AI world

With this innovative initiative of CLIP, it basically completely bridges the gap between text and images, builds a bridge between text and images, and no longer needs the previous labeling methods in the image processing industry.

The second question: the source of the original noise map

As mentioned above, AI painting erases the "mosaic" little by little, so how did the so-called "mosaic" diagram, that is, the noise map, come from?

The noise map is generated by the diffusion model, remember this concept "diffusion model" first.

Before talking about diffusion models, we need to talk about another concept, the process of AI generating pictures, which is actually a branch of the field of artificial intelligence, generative models (Generative models).

The generative model mainly generates images, and by throwing in a large number of real pictures, the AI continues to understand, recognize and learn, and then generates pictures by itself according to the training effect.

In the generative model, there's an autoencoder thing, which has two parts: the encoder and the decoder.

The encoder can compress a relatively large amount of data into a smaller amount of data, and the premise of compression is that this small amount of data can represent the initial amount of big data;

The decoder can restore to the initial amount of big data under appropriate conditions according to this small amount of data.

So this is where it is interesting:

Can you directly give it a smaller amount of data and see what kind of big data it can randomly expand into?

See how AI painting generates images in one go!

Image drawn by Source: Designed by Liunn

The answer is yes, but the test results are mediocre.

So the autoencoder can't work, what to do, scientists invented another thing, called VAE (Variational Auto-encoder).

What VAE does is mainly to regularize the small amount of data so that it conforms to the probability of the Gaussian distribution.

In this way, you can adjust a picture information according to the change of probability to change accordingly, but there is a problem, this depends too much on probability, most of the probability is the ideal situation of the hypothesis, so what to do?

So at this time, scientists wondered, can we do two AIs, one is responsible for generation, one is responsible for testing whether it is generated or not, that is, AI evaluates each other's true and false, this is GAN, and the adversarial neural network was born.

GAN on the one hand to generate pictures, on the other hand to detect whether it is OK, for example, sometimes some picture details are not generated as required, when the detection GAN found, it will continue to strengthen this piece, and finally let yourself feel that the result can be, so that it continues to iterate hundreds of millions of times, and the final generated result, when the detection is OK, it is to generate an AI picture.

But here comes the problem again

GAN is too busy to be an athlete on the one hand, and a referee on the other, which not only consumes a lot of computing resources, but also is prone to errors and has poor stability, so what to do? Can you make AI not so complicated and complete it with a set of processes?

The answer is yes, this is the era of diffusion models after crossing the era of generative models.

Back to the diffusion model.

The diffusion model was first proposed by Stanford and Berkeley academic experts in a related paper in 2015, gradually adding noise to the image according to the normal distribution, and in 2020, the process of adding noise was changed to the law of cosine similarity. (Links to original academic papers from 15 and 20 years are attached at the end of the article, you can read them by yourself if you are interested)

Gradually spreading the original diagram in a positive direction according to the cosine scheduling is like dismantling a complete puzzle step by step until it is completely disrupted.

See how AI painting generates images in one go!

Image drawn by Source: Designed by Liunn

At this point, the second problem is also solved. When you see this, the input information for AI painting is basically ready.

The third question: how does the model remove noise

AI has turned text into feature vectors and obtained noise pictures, but how is the noise map removed from the "mosaic" little by little?

How does it get rid of the mosaic? There are two steps:

Step 1: Reduce the dimensionality of data to improve the operation efficiency;

Step 2: Design a noise reduction network, identify unwanted noise, and accurately reduce noise.

Let's look at step 1: Remember the autoencoder mentioned above?

The image feature vector and noise map are thrown into the encoder together for noise reduction, that is, the process of removing mosaic.

But there is a problem here, that is, a 512*512 RGB picture needs to be calculated 786432 times, that is, 512*512*3=786432 data, which is too much operation

So before these data enter the encoder, they will be compressed into the latent space, reducing the dimensionality to 64*64*4=16384 pieces of data (I don't know if you have noticed when using SD, when we resize the image in Stable Diffusion, the minimum can only be dragged to 64px, which is why).

See how AI painting generates images in one go!

Source:https://jalammar.github.io/illustrated-stable-diffusion,Designed byLiunn

In this way, the task of the entire Wensheng graph can be reduced to the consumer-grade GPU (although the computing power is still a problem, A100 does not have it, right?). If there is a private me! )

The landing threshold has been reduced, and the efficiency of computing and configuration has been greatly improved.

Let's look at step two: design a noise reduction network.

Understanding the problem of data dimensionality reduction, let's continue to see, how does AI gradually remove noise to generate a new image, and how does the image encoder denoise the image, thereby generating a new picture?

See how AI painting generates images in one go!

Image drawn by Designed by Liunn

Regarding noise reduction methods, a related paper by DDPM at the end of 2020 predicted three things:

  1. Mean of noise: Predict the mean noise for each time step.
  2. Original image: Directly predict the original image, in one step.
  3. Noise of image: Directly predict the noise in the image to obtain an image with less noise.

Most of today's models use the third way.

How is this noise-removing network designed?

This is mainly due to the U-Net (Convolutional Neural Network-Image Segmentation) in the encoder.

See how AI painting generates images in one go!

Source:https://jalammar.github.io/illustrated-stable-diffusion/

U-Net is a funnel-shaped network similar to an encoder-decoder (left in the picture above), the difference is that U-Net adds a direct connection channel to the encoding and decoding layers at the same level (you can understand that between two buildings, a bridge is added between the same floor, and you can walk)

The advantage of this is that when processing pictures, information in the same position can be easily and quickly transmitted in the process of encoding and decoding.

So how does it work?

As we said earlier, DDPM mentioned that basically all models today use direct prediction of noise in images in order to obtain a picture with less noise.

The same goes for U-Net.

U-Net obtains all the feature vector sets of the images mentioned in the first section, extracts a part of the feature vectors from the vector set by sampling, and then identifies the useless noise according to these vectors

Then use the initial total noise map and the current predicted noise to do subtraction (the actual processing process is more complicated than this), and then get a graph with less noise than the beginning, and then take this figure, repeat the above process, extract a part of the feature vector by sampling again, and then do noise prediction, and then subtract the image N-2 times and the image N-1 times to get the N-3 times

Continue to repeat the above process until the final image is clear, there is no noise, or no unwanted noise is recognized, and finally a suitable image is produced.

See how AI painting generates images in one go!

Source: https://jalammar.github.io/illustrated-stable-diffusion/, Designed by Liunn

In this, some students have noticed that it also involves a sampling method.

Each sampling, according to different sampling methods, can be in the same way, or in different sampling methods. Different sampling methods will extract feature vectors of different dimensions, different characteristics, and different scales, and ultimately will indeed have an impact on the output result (which is also one of the factors affecting the controllability of AI painting).

Finally, remember the data dimensionality reduction just mentioned?

Dimensionality reduction is to reduce the amount of calculation, speed up, after dimensionality reduction is actually into a latent space, then after the image is all noise reduction, it will be restored back through the image decompressor, that is, the VAE model, and is re-released into the pixel space (can be understood as the photos stored in the cloud in IPhone, you first look at the thumbnail, when you click on the big picture to see, it will slowly download from the cloud and become high-definition).

The above is the simple process of noise model network denoising.

The fourth question: what useless noise should be removed

How can AI remove specific mosaics as I described, instead of me writing "dog" and drawing a "cat"?

How does the U-Net model identify which noises should be removed? In fact, this is a model training process.

Before explaining model training, you need to popularize several concepts:

  • Training set: A data collection used to continuously let AI learn and correct errors, so that AI can continue to grow, you can understand that when playing basketball, the coach takes you to train on the training ground.
  • Reinforcement learning: When an AI makes a mistake, tell it that it is wrong; When the AI is right, tell it right; You can understand that the basketball coach is constantly correcting your shooting posture to make you train faster and stronger.
  • Test set: After training for a period of time with a training set, a data set to see how the AI ability is, you can understand that training for half a year while playing basketball organized a friendly match.

Let's first look at how U-Net's training set is built, which is mainly divided into four steps:

  1. Randomly select photos from the graphic dataset;
  2. Produce noise of different intensities, arranged sequentially;
  3. Randomly select noise of a certain intensity;
  4. Add this noise to the picture.
See how AI painting generates images in one go!

Source: https://jalammar.github.io/illustrated-stable-diffusion/, Designed by Liunn

Let's look at how U-net handles it.

U-Net's training set is a lot of databases that have been superimposed with random noise, which can be understood as a lot of mosaic pictures (basketball training grounds), and then let the AI continuously extract pictures from this database, try to erase the noise by itself, erase them all, and then compare with the original picture of this picture to see how big the difference is.

See how AI painting generates images in one go!

Calculate this gap, and then go to the library to extract again, and then try to erase the noise (reinforcement learning), cycle countless times, and the final effect is that no matter how randomly selected, and change to a new noise image library (test set), the image after the noise erased by the AI can also be very similar to the original image (the style is similar, not necessarily the original image, which is why the AI is different every time).

In this case, even if it passes, the model is ready (it can go live).

This is the process by which U-Net identifies and removes unwanted noise.

The fifth question (stability control), how should I control the plotting effect?

Those who often play AI painting will find that in fact, the most uncontrollable part of the current large model is its instability.

So if you want to slightly control the effect of AI painting, is there any good way?

Here are four ways for your reference.

First: Adjust Prompt (that is, change the descriptor, essentially adjust the CLIP feature of the image)

By entering different descriptors, as well as changing local Prompts, step by step to guide the AI model to output different images, its essence is to change the set of image feature vectors to be processed corresponding to the matched CLIP, so the final plot will be constantly adjusted and optimized (there are some metaphysical techniques here, such as naming parts in some Prompts, you can also obtain stability, the essence is to mark part of the Prompt structure, which is convenient for AI algorithm identification... )。

Second: pad diagram (commonly known as img2img, essentially adding noise)

Now mainstream AI painting software and models support the pad drawing function, that is, you upload a picture, and then generate another picture according to the outline or approximate style of your picture.

Its essence is to superimpose several layers of noise on the image you upload, and then take this superimposed image as the basis and let the AI denoising the operation, and the subsequent process remains unchanged, so the final style, structure and original image are similar to each other.

However, it is worth mentioning that many Webui now also support the operation of choosing how similar to the original image, corresponding to the algorithm is actually asking how many layers of noise you want to superimpose, of course, the less noise superimposed, the more similar to the original image, and vice versa may be less like (but this is also a matter of probability, there will be more superposition when the generated graph is more similar than when the superposition is less).

Third, plugins (assisted control through third-party plugins/tools, essentially training models)

Take the most typical and classic ControlNet, you can control the generated effect through any conditions or requirements, which can basically be said to refer to which effect.

Its essence you can understand is to train the model through a graph to achieve the effect you want.

It copies the denoising model in its entirety, and then processes the two models in parallel, one for normal denoising and one for conditional denoising and finally merged to achieve the effect of stable control.

See how AI painting generates images in one go!

Fourth, training the model (training alone with a large amount of data, essentially Finetune)

This does not explain, that is, you have a lot of pictures, build an image library by yourself, and then continuously train large models to recognize these images, and finally give the model a word or two, the large model can recognize and generate similar images, so as to achieve the effect of Finetune a small model of its own.

Note: Finetune needs to pay attention to the boundary and the degree of force, and the evaluation indicators of the effect of the test set should be done well, otherwise when the training time is long, it will overfit the small sample data, which will lose the generalization of the large model, which may outweigh the loss (there are also solutions, such as Reply, let the large model learn again, or regularize the model, or do a parallel model, the details are not expanded).

Congratulations, when you read here, you should basically have understood the ins and outs of AI painting, because many algorithm articles are abstracted into vernacular text, so many details have also been omitted, throw bricks and jade, there are omissions or inappropriate places, welcome to communicate with everyone and learn from each other.

Say good benefits are coming, I believe AIGC die-hard fans will like it.

Surprise: Share 7 common datasets generated by text and graphics

COCO(COCO Captions)

COCO Captions is a subtitle dataset that captures picture data from everyday life scenes with the goal of scene understanding, and manually generates image descriptions. The dataset contains 330K graph-text pairs.

Dataset download link: https://cocodataset.org/

Visual Genome

Visual Genome is a large-scale image semantic understanding dataset released by Feifei Li in 2016, including image and question answering data. Annotations are dense and semantically diverse. The dataset contains 5M graph-text pairs.

Dataset download link: http://visualgenome.org/

Conceptual Captions(CC)

Conceptual Captions (CC) is a non-human-annotated, multimodal data containing image URLs as well as captions. The corresponding caption description is filtered from the alt-text attribute of the website. The CC dataset is divided into two versions: CC3M (about 3.3 million graphic pairs) and CC12M (about 12 million graphic pairs) due to the different amount of data.

Dataset download link: https: //ai.google.com/research/ConceptualCaptions/

YFCC100M

The YFCC100M database is a Yahoo Flickr-based imagery database from 2014. The library consists of 100 million pieces of media data generated between 2004 and 2014, including 99.2 million photos and 800,000 videos. YFCC100M dataset is a text data document built on top of a database, and each row in the document is a piece of metadata for a photo or video.

Dataset download link: http://projects.dfki.uni-kl.de/yfcc100m/

ALT200M

ALT200M is a large-scale image-text dataset built by the Microsoft team to study the characteristics of scaling trends in describing tasks. The dataset contains 200M image-text pairs. The corresponding text description is filtered from the alt-text attribute of the website. (Private dataset, no dataset link)

LAION-400M

LAION-400M captures text and images from web pages from 2014-2021 through CommonCrwal, and then uses CLIP to filter out image and text embedding similarities with less than 0.3 image and text embedding, and finally retains 400 million image-text pairs. However, LAION-400M contains a large number of uncomfortable images, which has a great impact on the text generation task. Many people use this dataset to generate pornographic images that have a bad impact. As a result, larger and cleaner datasets become demanded.

Dataset download link: https://laion.ai/blog/laion-400-open-dataset/

LAION-5B

LAION-5B is the largest known and open source multimodal dataset. It takes text and images through CommonCrawl, then uses CLIP to filter out image and text embedding similarities below 0.28, and finally retains 5 billion image-to-text pairs. The dataset contains 2.32 billion English descriptions, 2.26 billion 100+ other languages, and 1.27 billion unknown languages.

Dataset download link: https://laion.ai/blog/laion-5b/

Finally, some digressions:

The development of AIGC technology is in addition to data breakthroughs, computing power breakthroughs, algorithm breakthroughs and so on.

I think the most important thing is: open source.

Open source represents openness, transparency, sharing, common progress, and looks forward to co-creation.

Including CLIP (OpenAI shares model weights) mentioned above, it is undeniable that some national core technologies cannot be understood by open source, but AI open source can indeed allow many researchers, scientists, scholars and even wild enthusiasts to obtain the maximum amount of information and transparency.

With this;

On this basis, rapid, healthy and diversified derivation and development is extremely conducive to the long-term, sustainable and benign development of the entire AI ecosystem.

Sharing is learning, and the new era of AI will always be the main theme of sharing and transparency.

Try to abandon the state of having a good idea and building a car behind closed doors, and jointly create an AIGC environment and atmosphere.

So that when you sit on a Boeing, you don't have to worry too much about sitting in the front or back, because you yourself are already speeding ...

With some references and CLIP source documentation:

  • OpenCLIP's GitHub URL: https://github.com/mlfoundations/open_clip
  • 15 years of diffusion modeling paper: "Deep Unsupervised Learning using Nonequilibrium Thermodynamics," https://arxiv.org/abs/1503.03585
  • 20 years of DDPM paper: Denoising Diffusion Probabilistic Models, https://arxiv.org/abs/2006.11239
  • 《High-Resolution Image Synthesis with Latent Diffusion Models》:https://arxiv.org/abs/2112.10752
  • 《Hierarchical Text-Conditional Image Generation with CLIP Latents》:https://arxiv.org/pdf/2204.06125.pdf
  • 《Adding Conditional Control to Text-to-Image Diffusion Models》:https://arxiv.org/abs/2302.05543
  • The original source of the 7 datasets at the end of the article: Quoted from the Integer Intelligence AI Research Institute "Creating Art from Text, AI Image Generator's Dataset Is Built If"
  • Part of the drawing reference idea source: Quoted from Amazon Web developer "Generative AI New World | Enter the field of text-to-image
  • Reference source for some ideas: Quoted from Tencent Cloud developer "[Vernacular Science] 10 minutes to understand the principle of AI painting from zero"
  • 【Popular Science】How did your text become a picture? https://v.douyin.com/iemGnE9L/
  • and blog posts from some bloggers

Columnist

Nan Shen, public number: Sonic Nan Shen, everyone is a product manager columnist. Senior product manager of AI in large factories, AIGC business model explorer, long-term exploration of AI industry opportunities, good at AI+ industry solution design and AIGC outlet, traffic perception.

This article was originally published by Everyone is a product manager, and reproduction without permission is prohibited

The title image is from Unsplash and is based on the CC0 protocol

The views of this article only represent the author himself, everyone is a product manager, the platform only provides information storage space services.

Read on