Preface

With the concept of "AI mapping", "AI cover" and so on continue to appear in our sight, in the circle of friends, and even in PRD, you may be curious about how AI draws according to our requirements. This article only talks about the main principles and processes, and does not talk about the details and use, and is aimed at students who have no experience in AI mapping.

Quickly understand the differences between mainstream AI mapping tools

At present, the mainstream AI mapping tools are Stable Difussion (hereinafter referred to as SD), Midjourney, and Dalle3.

Kuriko 1:

我："I have a dream when a was young"

Dalle3: "At that time, you longed for fantasy, you longed for the sky, and your dreams were full of colorful worlds, I think it was like this..."

Then start drawing:

Hima Practice: 5 minutes to understand how the AI draws

我："I have a dream when a was young"

SD: "What the hell are you?"

Then start drawing:

It can be seen that Dalle3 does not pick the input quality very much, and can even associate abstract or fuzzy input, and finally draw a picture of okay quality, while SD has higher requirements for input information, and if it does not tell SD what to draw, it will choose to lie flat directly.

Kuriko 2:

我："给你一个奇幻风格模型，再给你一个写实人物Lora，再给你亿点点提示词In Casey Baugh's evocative style, art of a beautiful young girl cyborg with long brown hair, futuristic, scifi, intricate, elegant, highly detailed, majestic, Baugh's brushwork infuses the painting with a unique combination of realism and abstraction, greg rutkowski, surreal gold filigree, broken glass, (masterpiece, sidelighting, finely detailed beautiful eyes: 1.2), hdr, realistic painting, natural skin, textured skin, closed mouth, crystal eyes, butterfly filigree, chest armor, eye makeup, robot joints, long hair moved by the wind, window facing to another world, Baugh's distinctive style captures the essence of the girl's enigmatic nature, inviting viewers to explore the depths of her soul, award winning art，严格按照我的要求画一张图"

Dalle3: "No, I can't handle custom models, I can't handle Lora"

Then start drawing:

SD: "Understood!"

Then start drawing:

It can be seen that Dalle3 does not support custom master models and Lora, while SD plotting effects have strong controllability and customization capabilities, for example, in the example, you can customize the master model, specify a certain character Lora, and even specify the expression and posture of the characters in the picture.

Through the above examples, we find that Dalle3 and SD have obvious differences in the following aspects:

Semantic understanding	Difficulty to get started	Paint control	expenses
From 3	Top level understanding	Simple	Weaker	toll
SD	A little, but not much	There is a certain degree of difficulty	Extremely high	gratis

In the creator ecosystem, the input of AI cover scenes (generating covers through sound titles or album titles) is mostly abstract, and there is no demand for image control, so Dalle3, which has excellent semantic understanding, is more suitable for our scene, and our current AI cover is Dalle3.

Understand the plotting process

Through the example at the beginning, we have a general impression of what AI plotting is, and then the main content of this article - understanding how AI plots is generated. Since SD is open source and has stronger customization capabilities, this article will take SD as an example to explain the AI drawing process, which will omit some obscure links and principles that do not affect understanding (mainly I don't understand), and focus on a package to understand.

Basic Principles

扩散模型（Diffusion model）

Diffusion models are a broad class of models that are designed to generate new data similar to the data they were trained on. For SD, it is to generate a new image similar to the training sample image.

Take a chestnut:

Using a cat map as the training image, by continuously adding noise to it, a noise map is finally obtained, and this process of adding noise is recorded by the noise predictor

When we need to generate a cat diagram, we can do it by doing the following:

1. Generate a completely random noise image as the original image.

2. Ask the noise predictor to tell us the noise needed to generate the cat map (the logger records how to turn a cat map into a random noise map by adding noise during the previous training, so it can predict the noise required for reverse operation)

3. Then we subtract this predicted noise from the original image

4. Keep repeating, and you will get a cat picture.

The amount of computation required for the diffusion process in the example is difficult to complete on an ordinary computer, so the diffusion model of SD is no longer completed in pixel space, but in a low-dimensional latent space, which brings a direct result that the broken computer in our hands can also produce drawings locally by AI.

词向量（word embedding）

For the sake of further understanding, let's first talk about a concept: word vectors. In simple terms, it can convert representations in natural language into vectors in high-dimensional space, and then calculate the relationship between them by cosine similarity or Euclidean distance.

For example, if we train a word vector model on the text content of a novel (assuming it is long enough), then when we give it the words "yellow croaker noodles", "Cadillac", and "pork rib rice cake", it will be able to get a closer approximation between "yellow croaker noodles" and "pork rib rice cake". Because in the training, we give it enough text that contains these words, it vectorizes each word and word, and then calculates the relationship between each word and word in a sense according to the position and frequency of each word and word. Although it does not know what the three words mean in human language, it does know the "relationship" of the three words to each other in the vector dimension. So, the word vector model can calculate the similarity.

Another example: Liu Huan trained a word vector model from his diary and text on Weibo (assuming there is enough of it), so that when he typed in the word "I want", the model could predict that there was a high probability that he would lose the "holiday" next. So, word vector models can predict probabilities.

It can be concluded that although the word vector model does not know what the text content input to it represents in the human world, it can use vectors to determine what these contents are in its world, and even calculate the relationship between these contents before each other.

CLIP

The previous talk about word vectors is actually for the convenience of understanding clips. The word vector model can handle text-to-text matching (e.g., knowing that "dog" and "dog" are probably the same thing), while the clip model can handle text-to-image matching (e.g., knowing that "dog" corresponds to a picture of a dog).

The SD clip contains a text encoder and an image encoder, the text encoder is responsible for tokenizing the text we enter, and then converting it into tokens, and then vectorizing each token (Embedding). The image encoder has already mastered a large number of samples during training, so here we can match the text vector results according to the cosine distance, and the process is as follows:

Now let's go back to the image generation process and what happens when we type in "a cat" as a prompt:

1、提示词被拆分（clip text encoder）

2、拆分后的词被转为标记（clip text encoder）

3、标记被向量化（clip text encoder）

4、从向量维度匹配文本和图像（clip text encoder + clip image encoder）

5、预测生成目标图像所需要的噪声（difussion model）

6、在随机噪声的图上不断减去预测的噪声（difussion model）

7、最终形成一张猫图（difussion model）

This is the process of SD generating images through text,SD also supports generating images through images,Interested partners can go to understand the generation process,Don't expand it here。

Image control

The ability to generate images is only the first step, the application in the actual scene, when we can give the prompt words that can be given, the generation effect may still be different from the expectation, at this time, we must use the randomness of image generation to try our luck by regenerating, which is the so-called "card drawing".

So how to reduce the probability of card drawing? is to enhance the control ability of image generation, SD image control is mainly completed through prompt words, master model, Lora model, ControlNet.

1. Prompt words, reverse prompt words

In the process of generating images from SD text, prompts are the least costly way to control, such as defining the theme, scene, and lens of the image through text.

The prompt word tells the SD what to do in the generated image, and if you want to tell the SD to avoid what appears in the image, the reverse prompt is used.

Take a chestnut:

提示词：master piece,high qulity,a beautifull girl,black long straigh hair,pretty face,moonlight

反向提示词：nsfw,sexy

The above prompt requires the girl to appear in the picture, black and straight, moonlight, and requires to avoid social death, as follows:

The rendering effect basically meets the requirements, but this is the style that Liu Huan likes to watch rather than what we like to watch, so we added a reverse prompt to avoid bare legs:

提示词：master piece,high qulity,a beautifull girl,black long straigh hair,pretty face,moonlight

反向提示词：nsfw,sexy,leg skin

If you plot the picture, you will try to avoid the appearance of leg skin:

2. Master model

As mentioned earlier, one of the core principles of SD generated images is the diffusion model, the official large model is obtained by spending a huge amount of money to train a large number of pictures, it is characterized by large and complete, but it is a little difficult to generate a specific style of image, so the friends are trained or fused on top of these official basic models, and finally produce a variety of models of different styles. Using the same prompt words, the performance under different models obviously has the style of the corresponding model.

Or take the prompt word "girl, black and straight, moonlight" just now as an example, and draw the effect under the realistic style model:

The effect of plotting under the two-dimensional style model:

The effect of drawing under the Guoman style model:

3. Lora model

The training of the SD main model requires a large number of image samples and computing resources, if you only need to customize the characters, painting style and other elements, you can achieve it through the corresponding Lora model, the resources consumed by training the Lora model are much smaller than the main model, and the customization effect on the characters, painting style and other elements is immediate

Or take the "girl, black long straight, moonlight" as an example as an example, using iu's character Lora model:

Using the Lora model of Yae Miko:

4、ControNet

Lora is mainly used to control the character, and the expression and pose control of the character is controlled by ControlNet, which can control the drawing effect by referring to the image or drawing a match on the canvas.

Or take the prompt word "girl, black and straight, moonlight" as an example, we specify a pose for the screenshot of the top dance bear dance:

And set the processor of ControlNet to openpose, and the rendering effect will be biased towards this pose:

When we draw a horizontal figure on the canvas as a reference pose:

The picture will be as close as possible to this villain with a horizontal person:

The above is the commonly used image generation control method in SD, there are other means in SD that can interfere with the quality and effect of the image, such as VAE, sampler, number of iterations, etc., interested partners can understand in depth, I will not expand here. Now let's briefly summarize the process of AI mapping:

1. Enter prompt words (enter prompt words in the Wensheng diagram scene, and enter the reference diagram in the Tusheng diagram scene)

2. SD clip processes the input text, converts it into a vector through a series of processes, and knows what kind of graph to produce by calculating the proximity

3. Main model, Lora, ControlNet out the graph control map (VAE, sampler, number of steps, random seeds, etc. are omitted here)

4. Continuously iterate to generate the final renderings

Conclusion

A simple prompt word can allow the tool to generate an image as required, specify the expression and posture of the character with a little control, and train your own character model. Since it draws faster than us and draws better than us, let's embrace AI together in line with the principle of joining if you can't beat it!

Source-WeChat public account: Himalaya technical team

Source: https://mp.weixin.qq.com/s/k8OEDIvvXb-dY1C3nR1YrQ

Hima Practice: 5 minutes to understand how the AI draws

Preface

Quickly understand the differences between mainstream AI mapping tools

Kuriko 1:

Kuriko 2:

Understand the plotting process

Basic Principles

扩散模型（Diffusion model）

词向量（word embedding）

CLIP

Image control

1. Prompt words, reverse prompt words

3. Lora model

4、ControNet

Conclusion