laitimes

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Machine Heart analyst network

Author: Wang Zijia

Editor: H4O

This article takes you through the Transformer family.

Recently, large language models have occupied most of the circle of friends, and there have been many articles on what these models can do and what commercial value they have. However, as a small researcher who has been immersed in the field of artificial intelligence for many years, I am more concerned with the technical rationale behind this arms race and how these models are engineered and benefit humanity. Rather than looking at how these models can be monetized and engineered to benefit more people, I want to explore the reasons behind this phenomenon and what else we researchers can do to "be replaced by AI and retire with honor" before AI replaces humans.

Three years ago, when GPT-3 made an uproar in the technology world, I tried to dissect the vast family behind GPT in a history book. I combed through the technical context behind GPT chronologically (Figure 1) and tried to explain the technical rationale behind GPT's success. This year, GPT-3's youngest son, ChatGPT, seems smarter and able to communicate with people in chat, which has kept more people informed about the latest advances in natural language processing. At this historic moment, as AI historians, we should perhaps take a moment to review what has happened in recent years. The first article was based on GPT-3, so this series is actually a record of the post-GPT era (post-GPT books), and while exploring the changes in the GPT family, I realized that most of the stories are about Transformers, hence the name of this article is Transformer Family.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 1.  GPT Old Family Tree

Review of the past

Before we get started with the Transformer family, let's take a look back at what happened in the past following Figure 1. Starting with Word Embedding [1,2], vectors (strings of numbers) incorporate the semantics of text in a peculiar but effective way, and Figure 2 shows an illustration of this representation: represented by numbers (king - man + woman = queen). Based on this, this huge NLP (natural language processing) family was created.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 2.  Word2Vec illustration (King - Man + Woman = Queen)

After this, his eldest son ELMo [3] discovered the importance of context, such as the following two sentences:

"Oh! You bought my favorite pizza and I love you to death! ”

"Ah, I love you to death! Did you rub my favorite pizza to the floor? ”

"I love you to death" means something obviously different. ELMo solved this problem by "given a model a string of words, and then let the model predict the next word and the previous word (context)".

At the same time, a distant cousin of Word Embedding discovered another problem - people will focus on a part of the word when understanding a sentence, and it is obvious that many typos are easily ignored when we read our native language, because we do not pay attention to it when understanding the passage. Therefore, he proposed the Attention mechanism [4], but the Attention mechanism at this time was very early and could not work alone, so it could only be attached to sequence models similar to RNNs and LSTMs. Figure 3 shows how the attention mechanism binds to RNNs and why attention itself cannot work alone. Here is a brief description of the working process of the NLP model, first we have a sentence, such as "I love you China", which is five characters, which can be turned into the x_1-x_5 in Figure 3, and then each character will become the word embedding (a string of numbers) just mentioned, that is, the h_1-h_5 in Figure 3, and then they finally become output, such as "I love China" (translation task), which is the x_1'-x_3' in Figure 3. The rest of the unspoken part of Figure 3 is the attention mechanism, which is the A in Figure 3, which gives a weight to each h, so that we know which words are more important when converting the current word. For specific details, please refer to the article I wrote at the beginning (

Starting with word2vec, let's talk about GPT's huge family genealogy

)。 It can be seen that the numerical representation here is the basis of the entire task, which is why the Attention mechanism cannot work alone.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 3.  Source: Attention for RNN Seq2Seq Models (1.25x speed recommended) - YouTube

At this time, as a proud direct relative of the royal family, Transformer did not approve of this way of relying on others, and proposed his own independent way in the paper "Attention is all you need"[5], adding a word to "attention mechanism" into "self-attention mechanism", using only the attention mechanism to generate that string of numbers. We use TCM to prescribe medicine to illustrate this change. The initial Attention mechanism can be said to be the dose of each material, but when the drug is finally taken, the drug is in the hands of RNN or LSTM, and the prescription we prescribe is of course based on what drugs are in the pharmacy (RNN, LSTM). All Transformer did was take the right to take the drug back (add the value matrix), and then change the way to prescribe the drug (add the key and query matrix). At this time, Source can be regarded as a storage box of a Chinese medicine shop, and the drugs in the storage box are composed of the address Key (drug name) and the value Value (medicine), and there is currently a Key=Query query, the purpose is to take out the corresponding Value value (medicine) in the storage box, that is, the Attention value. The reason why it is soft addressing is that we not only find a drug from the storage box, but may take out the content from each key address, and the importance of the content (amount) is determined according to the similarity of Query and Key, and then the value is weighted to sum so that the final Value value (a pair of Chinese medicine), that is, the Attention value. Therefore, many researchers regard the Attention mechanism as a special case of soft addressing, which is also very reasonable [6].

Since then, Transformer has officially begun to lead the family to prosperity.

Transformer succeeded

In fact, it can be seen from Figure 1 that transformer is the most prosperous heir in the grandfather's family, which also confirms that the topic of "Attention is all you need" is indeed justified. Although I just talked about what the self-attention mechanism he proposed is, the previous article (

Starting with word2vec, let's talk about GPT's huge family genealogy

The evolution of the transformer has been talked about in detail, and here is a quick review of what the transformer architecture is for the new students.

To put it simply, we can think of the Transformer as an "actor", for this "actor", the encoder is like the actor's memory, responsible for converting the lines into an intermediate representation (abstracted into something in the mind that we don't know what it is, that is, the actor's understanding), and the decoder is like the actor's performance, responsible for transforming the mental understanding into a display on the screen. The most important self-attention mechanism here acts as the actor's concentration, which can automatically adjust the actor's attention in different positions, so as to better understand all the lines and make it perform more naturally and smoothly in different situations.

More specifically, we can think of the Transformer as a large "language processing factory." In this factory, each worker (encoder) is responsible for processing a position (say a word) in the input sequence, processing and converting it, and then passing it on to the next worker (encoder). Each worker has a detailed statement of work (self-attention mechanism) that describes in detail how to process input from the current position and how to relate to the previous position. In this factory, each worker can work on their own tasks at the same time, so the entire factory can efficiently process large amounts of input data.

As soon as Transformer appeared, he won the throne without suspense because of his great strength and two competing sons (BERT and GPT). BERT (Bidirectional Encoder Representations from Transformers) [1] inherited the Encoder part of the Transformer and won the first half of the race, but lost to GPT in terms of versatility due to its limitations. Honest GPT (Generative Pre-trained Transformer)[7-10] inherited the Decoder part, honestly learned from scratch, learned the way humans communicated, and finally achieved a reverse in the second half.

Of course, Transformer's ambitions obviously don't stop there, and "Attention is all you need" doesn't just refer to the NLP field. Before introducing the feud between GPT and BERT, let's take a look at what their old father did.

New genealogy - princes abound

"Father, times have changed. Our family will go to true glory because of my efforts. ”

——Transform

After understanding the mechanics of Transformer, let's take a look at how far the Transformer family has evolved under the strong development of Transformer (new family tree). As you can see from the previous "actor" example, Transformer represents a way of learning in accordance with human logic, so it can process not only words, but also images. Figure 2 summarizes the strong family background of the Transformer family. In addition to allowing GPT and BERT to continue to break new ground in the initial NLP (natural language processing) field, Transformer also began to enter the field of computer vision. Its younger sons (ViT, etc. proposed by Google) are also shining in this space. In 2021, Vision Transformer ushered in a big explosion, and a large number of Vision Transformer-based work swept through computer vision tasks. Naturally, as a family, the Transformer family will always communicate with each other, and CLIP that connects text and images (AI painting) was born. At the end of 2022, Stable Diffusion was on the ground before ChatGPT. In addition, CLIP opens new doors for the Transformer family in multimodality. In addition to words and images, can words also make music and draw pictures? Multimodal and multitasking transformers are also emerging. In short, each field is a prince, a self-made transformer in the NLP field, and after working hard to develop, it has become a "king of Zhou" who can divide the princes.

There are many princes, and it should be a prosperous age.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

 Figure 4.  The growing family tree of the Transformer family

Vision Transformer [12]

Before talking about GPT, let's talk about the first bold attempt made by Transformer - that is, let the younger son mix the CV field. Let's take a look at the life of the youngest son:

His father, Transformer, was born in a 2017 paper called Attention is All You Need.

In 2019, Google proposed a Vision Transformer (ViT) architecture that can process images directly without the need for convolutional layers (CNNs). The title of the paper is as straightforward as ever: "An image is worth 16x16 words" (a picture is 16*16 words). As shown in Figure 5, the basic idea is to divide the input image into a series of small pieces, each of which can be understood as a text when processing the article in the past, and then convert these small pieces into vectors, just like processing text in a normal transformer. If in the field of natural language processing (NLP), the attention mechanism of the Transformer tries to capture the relationship between different words in the text, in the field of computer vision (CV), ViT tries to capture the relationship between different parts of the image.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

图 5.  ViT 如何处理图片(source: Are Transformers better than CNN’s at Image Recognition? | by Arjun Sarkar | Towards Data Science)

Since then, various Transformer-based models have emerged, and they have achieved results beyond CNNs in corresponding tasks. So what are the advantages of Transformer, let's go back to the example of the movie and see the difference between Transformer and CNN:

Imagine that you are a director, and to shoot a film, you need to position the actors, put different elements in the right places, for example, put the actors in the right background, use the right light, and make the whole picture look harmonious and beautiful. For CNNs, it's like a professional photographer who shoots each frame pixel by pixel, and then extracts some low-level features such as edges and textures. It then combines these features to form higher-level features, such as faces, actions, etc., and finally obtains a frame. As the film progresses, CNN repeats the process until the entire film is shot.

For ViT, it's like an art director who sees the whole picture as a whole, taking into account the background, light, color, etc., assigning the right position and angle to each actor to create a perfect picture. ViT then aggregates this information into a vector and processes it using a multilayer perceptron to get a frame. As the film progresses, ViT repeats the process until the entire film is created.

Going back to the image processing task, suppose we have a picture of a cat with 224x224 pixels, and we want to classify it with a neural network. If we use a traditional convolutional neural network, it may employ multiple convolutional layers and pooling layers to gradually reduce the size of the image, eventually getting a smaller feature vector, which is then classified by fully connected layers. The problem with this approach is that during the process of convolution and pooling, we gradually lose information in the image because we cannot consider the relationship between all pixels at the same time. In addition, due to the order limitations of the convolution and pooling layers, we cannot interact with information globally. In contrast, if we use the Transformer and self-attention mechanism to process this image, we can directly treat the entire image as a sequence and perform a self-attention calculation on it. This approach does not lose any relationship between pixels and allows for global information interaction.

In addition, since the self-attention calculation is parallelizable, we can process the entire image at the same time, which greatly speeds up the calculation. For example, let's say we have a sentence: "I like to eat ice cream", which contains 6 words. Now assuming we are using a model based on the self-attention mechanism to understand this sentence, the Transformer can:

Minimize the total computational complexity of each layer: In the model based on the self-attention mechanism, we only need to calculate the attention weight between each word and all other words, so that the amount of computation in each layer depends only on the input length and not the size of the hidden layer. In this example, the input length is 6 words, so the computational complexity of each layer depends only on the number of these 6 words.

Maximize the amount of computation that can be parallelized: Models based on self-attention mechanisms can simultaneously compute attention weights between each word and all other words, so computation can be highly parallelized, speeding up model training and inference.

However, ViT requires large-scale datasets and high-resolution images to reach its full potential, so while Vision Transformers excel in CV, CNNs are still more widely used and researched in computer vision and have advantages in tasks such as object detection and segmentation.

But that's okay, you're doing well enough, your father's intention to get involved in CV wasn't to replace CNN, he had more ambitious goals.

The basis of this goal is what I said earlier, "in addition."

Emerging - CLIP [13]

As I said earlier, Transformer has a more ambitious goal, which is "big models", super super large models. In addition to what I said in the previous article, transformers can better obtain global information, smaller computational complexity and better parallelism have become the basis for supporting large models.

In 2021, in addition to the great progress of Vision Transformer, GPT is still in full swing preparing for GPT3.5, and the idle Labor Model Transformer has led a new climax - connecting text and images. This climax also fired the first shot for the "Big Model" project outside the NLP field. At this time, the shortcomings of Transformer in visual tasks have become advantages here. "ViT requires large-scale datasets and high-resolution images to reach its full potential," to put it another way, "ViT can handle large-scale datasets and high-resolution images."

The old rule, let's start with what CLIP is.

The full name of CLIP is Contrastive Language-Image Pre-Training, and it is clear that its basic idea is Contrastive learning in the traditional CV field. When we learn something new, we read different books and articles and get a lot of information. However, we don't just memorize all the words and sentences in every book or article. Instead, we try to find similarities and differences between these messages. For example, we may notice that in different books, a topic may be described differently and key concepts may be expressed, but the concepts they describe are essentially the same. This way of finding similarities and differences is one of the basic ideas of contrastive learning. We can think of each book or article as a different sample, while a book or article on the same topic can be seen as different instances from the same category. In contrast learning, we train a model to learn how to distinguish between these different classes of samples to learn their similarities and differences.

Next, a little more academic, suppose you want to train a model to recognize car brands. You can have a set of tagged car images, each with a brand label, such as "Mercedes", "BMW", "Audi" and so on. In traditional supervised learning, you can feed images along with brand labels into the model and let the model learn how to predict the correct brand label.

But in contrast learning, you can use unlabeled images to train a model. Suppose you have a set of unlabeled car images, and you can divide these images into two groups: positive samples and negative samples. Positive samples are images from different angles of the same brand, while negative samples are images of different brands. Next, you can use contrastive learning to train the model to bring positive samples of the same brand closer to each other and negative samples from different brands away from each other. In this way, the model can learn to extract brand-specific features from images without having to explicitly tell it the brand label of each image.

Obviously, this is a self-supervised learning model, and CLIP is a similar self-supervised learning model, except that its goal is to connect language and images so that computers can understand the relationship between text and images.

Imagine that you are learning a vocabulary where each word has its definition and corresponding image. For each word and its corresponding image, you can think of them as a pair. Your task is to find out how these words and images relate to each other, i.e. which words match which images and which don't.

As shown in Figure 6, for contrastive learning algorithms, these word and image pairs are the so-called "anchor" and "positive" samples. "anchor" refers to the object we want to learn, while "positive" refers to the sample that matches "anchor". The opposite is "negative", i.e. a sample that does not match "anchor".

In contrastive learning, we pair "anchor" and "positive" and try to distinguish them. At the same time, we will also combine "anchor" and "negative" into a pair and try to distinguish them. This process can be understood as looking for similarities between "anchor" and "positive", and excluding similarities between "anchor" and "negative".

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 6. Illustration of Contrastive Learning [14]. Anchor is the original image, positives are generally the original image after cropping, rotating, or known images of the same category, negatives can be simply and crudely defined as unknown images (possibly the same category), or known images of different categories.

To achieve this, CLIP first pre-trains a large number of images and text, and then uses the pre-trained model for downstream tasks such as classification, retrieval, and generation. The CLIP model employs a new self-supervised learning approach that processes text and images at the same time, training to learn how to relate them. It shares attention mechanisms between text and images, and uses a simple set of tunable parameters to learn this mapping. It uses a transformer-based text encoder and a CNN-based image encoder, and then calculates the similarity between the image and the text embedding. CLIP learns to associate images and text by using a contrastive learning objective that maximizes consistency between image-text pairs present in the data and minimizes consistency between randomly sampled image-to-text pairs.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 7. CLIP illustration [13]. Compared with Figure 6, it can be simply understood that the positive and negative in Figure 6 are both words.

For example, if we want to use CLIP to identify whether an image is a "red beach", we can enter this text description and an image, and CLIP will generate a vector pair to represent their connection. If the distance of this vector pair is small, then the image may be a "red beach", and vice versa. With this approach, CLIP enables tasks such as image classification and image search.

Going back to the full name, the last word for CLIP is pretraining, so its essence is still pre-trained model, but it can be used for various downstream tasks involving matching images and text, such as image classification, zero-sample learning, and image description generation. For example, CLIP can be used to classify images into categories given by natural language labels, such as "pictures of dogs" or "landscapes". CLIP can also be used to generate captions for images by using a language model conditioned on the features of the image extracted by CLIP. In addition, CLIP can be used to generate images from text by using a generative model conditioned on the features of the text extracted by CLIP.

DALL-E & Stable Diffusion

With the help of CLIP, a new prince arose - he called AIGC (AI generated content). In fact, ChatGPT is essentially a kind of AIGC, but in this section, we are mainly talking about AI painting. Let's take a look at the history of the small family of AI painting:

2021.01, OpenAI released DALL-E [15] (AI painting software), which improves GPT-3 so that GPT-3 generates images instead of text (Image Transformer Network)

Almost simultaneously (2021.01), OpenAI released CLIP [13]

In 2021.05, Google Brain and DeepMind released Stable diffusion [17], and continue to iterate new versions. It uses the frozen CLIP text encoder to adjust the model based on text prompts. Stable diffusion breaks down the image generation process into a run-time "diffusion" process. Starting with only noise, it gradually corrects the image until there is no noise, bringing it closer to the text description provided. 

2022.04, DALL-E-2 [16] released. It can create realistic images and artwork based on descriptions in natural language. The DALL-E-2 uses a two-part model consisting of a priori and a decoder. A priori is a GPT-3 model that generates a CLIP image embedding based on a text prompt. The decoder is a diffusion model that generates an image based on CLIP embedding. DALL-E-2 can also outpaint, inpaint, and change existing images.

The eldest brother CLIP connects image and text, and its twin brother DALL-E follows the trend of proposing the task of text-to-image. To improve this task, a distant cousin, Stable diffusion, improved the algorithm for generating images, and finally DALL-E-2 complemented each other, combining the advantages of GPT-3, CLIP and stable diffusion to complete its own AI painting system.

For the original DALL-E, let's say you're a painter and DALL-E is your toolbox. In this metaphor, there are two main tools in the toolbox: one is the brush and the other is the color palette.

A brush is a decoder for DALL-E that converts a given textual description into an image. The color palette is the encoder of DALL-E, which can convert arbitrary text descriptions into a feature vector.

When you get a text description, you first use the color palette to generate a feature vector. Then you can pick up the brush and use the feature vector to generate an image that matches the description. When you need details, you use finer brushes, and vice versa.

Unlike painters, DALL-E uses neural networks instead of brushes and palettes. This neural network uses a structure called the Image Transformer Network. When generating images, DALL-E uses the GPT-3 model mentioned earlier to generate CLIP image embeddings that correspond to text descriptions. DALL-E then uses a beam search algorithm to generate a series of possible images that match the text description of the input and feed them into a decoder to generate the final image. This embedding vector is trained by using a technique called contrastive learning, which embeds similar images and text into adjacent spaces to make it easier to combine them. NOTE THAT DALLE DOES NOT DIRECTLY INCLUDE CLIPS HERE, BUT IT USES CLIP TEXT AND IMAGE EMBEDDINGS TO TRAIN TRANSFORMERS AND VAE.

As for the beam search algorithm used in the process of image generation, it is actually a greedy search algorithm that can find the optimal sequence in a limited set of candidates. The basic idea of bundle search is that each time the current sequence is expanded, only the k candidates with the highest probability (k is called the beam width) are retained, and the other candidates with low probability are discarded. This reduces search space and increases efficiency and accuracy. The specific steps to generate an image using beam search in DALLE are as follows:

Encodes the text description of the input as a vector and serves as the initial input to the transformer model.

Starting with a special start symbol, an image sequence is generated pixel by pixel. Each time a pixel is generated, the probability distribution of the next pixel is predicted by the transformer model and the k candidate pixels with the highest probability are selected as an extension of the current sequence.

For each extended sequence, calculate its cumulative probability, and keep the k sequences with the highest probability and discard the others.

Repeat steps 2 and 3 until a special ending symbol is generated or the maximum length limit is reached.

Returns the sequence with the highest probability as the final resulting image.

How did the same painting, Stable Diffusion, be painted? When we want to paint a work of art, we usually need a good composition and some specific elements to build. Stable diffusion is one such method of image generation, which divides the image generation process into two parts: the diffusion process and the reconstruction process. The diffusion process can be imagined as mixing together a bunch of scattered brushes, paints, and artboards, slowly creating more and more elements on the canvas. During this process, we don't know what the final picture will look like, and we can't determine the final position of each element. However, we can gradually add and adjust these elements until the whole painting is complete. The input text description is then like our general description of the work to be painted, using a beam search algorithm to make a fine match between the text description and the resulting image. It's like we're constantly modifying and tweaking elements to better match the picture we want. Ultimately, the resulting image will closely match the text description, presenting the artwork we imagined.

As shown in Figure 8, the diffusion model here is a generative model that learns the distribution of data by gradually adding noise to the data and then reversing the process of restoring the original data. stable diffusion uses a pre-trained variational autoencoder (VAE) to encode images into low-dimensional latent vectors and a transformer-based diffusion model to generate images from latent vectors. stable diffusion also uses a frozen CLIP text encoder to condition the diffusion model by converting text prompts into image embeddings.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 8. Stable Diffusion process. First of all, the arrow above, a picture is constantly added noise, and finally becomes a pure noise map, then take the arrow below, gradually eliminate the noise, and then rebuild the original picture. (Source: From DALL・E to Stable Diffusion: how do text-to-image generation models work? |.) Tryolabs)

It is worth noting that the diffusion process in Stable Diffusion is a random process, so each resulting image will be different, even if it is the same text description. This randomness makes the generated images more diverse, while also increasing the uncertainty of the algorithm. To make the resulting image more stable, Stable Diffusion uses techniques such as adding progressively increasing noise during diffusion and using multiple reconstruction processes to further improve image quality.

Stable Diffusion is a big step forward on DALL-E:

Resolution: Stable diffusion can generate images up to 1024×1024 pixels, while DALL-E currently only produces images of 256×256 pixels.

Speed: Stable diffusion requires multiple iterations to produce an image, so it is slower. DALL-E generates images in one go, so it is faster.

Flexibility: Stable diffusion can expand, patch, and vary existing images, while DALL-E can only generate images from text prompts.

Authenticity: Stable diffusion can produce more realistic and detailed images, especially under complex and abstract descriptions. DALL-E may produce images that do not conform to the laws of physics or common sense.

This is why DALL-E-2 also added the diffusion model to its model.

Lurking Strong - GPT3.5 [18] 

& Instruct GPT [19]

While other princes are in full swing with their reforms, the GPT has also been working silently. As mentioned at the beginning, GPT-3 already had strong capabilities when it was first released, but the way it was used was not so "non-technical friendly", so the waves set off were in the technical world, which were not very enthusiastic waves, and they were increasingly dissipated because of its low fees.

Transformer is very dissatisfied, GPT thought about it, then reform!

The first to respond to the call for reform was GPT 3.5:

"I'm stupid and can't think of any good way to reform, so let's lay a solid foundation first."

Therefore, GPT3.5 is based on GPT-3 and uses a training data called Text+Code, that is, on the basis of text data, some programming code data is added. Simply put, a larger data set is used. This allows the model to better understand and generate code, increasing the diversity and creativity of the model. Text+Code is text- and code-based training data that is collected and curated by OpenAI from the web. It consists of two parts: text and code. Text is something that is described in natural language, such as articles, comments, conversations, etc. Code is something written in a programming language like Python, Java, HTML, etc.

Text+Code training data can make the model better understand and generate code, improving the diversity and creativity of the model. For example, in a programming task, the model can generate corresponding code based on the text description, and the code is highly correct and readable. In the content generation task, the model can generate corresponding text according to the code description, and the text has high consistency and interestingness. Text+Code training data can also make the model better at handling multilingual, multimodal, multi-domain data and tasks. For example, in a language translation task, the model can accurately and fluently translate according to the correspondence between different languages. In the image generation task, the model can generate corresponding images according to text or code descriptions, and the images have high clarity and fidelity.

The second response to the call was Instruct GPT, which identified new problems:

"To get in touch with humans, we need to listen to them more effectively."

As a result, there is the famous new foreign aid, that is, the RLHF training strategy. RLHF is a reinforcement learning-based training strategy, and its full name is Reinforcement Learning from Human Feedback. Its core idea is to give the model some instructions during the training process, and give rewards or punishments based on the output of the model. This allows the model to better follow the instructions and improve the controllability and credibility of the model. In fact, GPT-3.5 also has human feedback, so what has changed since the addition of reinforcement learning?

The human feedback of GPT3.5 is used directly to fine-tune the parameters of the model, while the RLHF of the Instruct GPT is used to train a reward model and then use this reward model to guide the model's behavior.

The human feedback for GPT3.5 is based on the evaluation of a single output, while the RLHF for Instruct GPT is based on comparisons between multiple outputs.

While human feedback for GPT3.5 occurs only once, Instruct GPT's RLHF can perform multiple iterations, constantly collecting new comparative data, training new reward models, and optimizing new strategies.

That is, less human input, but greater benefits to the model.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 9. RLHF process (Source: GPT-4 (openai.com))

As shown in Figure 9, the RLHF training strategy is divided into two phases: pre-training and fine-tuning. In the pre-training phase, the model uses the same dataset as GPT-3 for unsupervised learning to learn the basics and laws of language. In the fine-tuning phase, the model uses some human-annotated data for reinforcement learning to learn how to generate appropriate outputs based on instructions.

Manually labeled data consists of two parts: instructions and feedback. Instructions are tasks described in natural language, such as "write a poem about spring" or "give me a joke about dogs." Feedback is some numerical rating, such as "1" for poor and "5" for good. Feedback is given by human annotators based on the output of the model, reflecting the quality and plausibility of the model's output.

During the fine-tuning phase, the model uses an algorithm called Actor-Critic for reinforcement learning. The Actor-Critic algorithm consists of two parts: Actor and Critic. The Actor is a generator that generates output based on instructions. Critic is an evaluator that evaluates the reward value of the output based on feedback. Actor and Critic collaborate and compete with each other, constantly updating their parameters to increase the reward value.

The RLHF training strategy can make the model better follow the instructions and improve the controllability and confidence of the model. For example, in a writing task, the model can generate text of different styles and themes according to instructions, and the text has high coherence and logic. In dialogue tasks, the model can generate replies with different emotions and tones according to instructions, and replies are highly relevant and polite.

Finally, after the reform and accumulation of predecessors, the more flexible youngest son ChatGPT in the GPT family felt that it was time, and based on Instruct GPT launched a dialogue mode that was more in line with human communication methods, directly setting off a huge wave in human society (hundreds of millions of users), and it was free, after several years of dormancy, the GPT family finally became the most favored prince of the Transformer family, directly winning the battle for the throne and becoming the prince.

Meanwhile, for ChatGPT, the prince is not all, ChatGPT inherits Transformer's huge ambitions:

"The current situation is too chaotic, the powerful dynasty does not need so many princes, it is time to unify them."

Unification of the princes – the era of the large model

GPT-4: "This era is the era of big models, I said." (bushi)

Now ChatGPT is already based on GPT-4. The GPT-4 is afraid of the quick response of its competitors, but most of the technical details are closed. But from its function, it has been seen that the GPT family's ambition to unify the princes, in addition to text dialogue, GPT-4 also added AI mapping function. The GPT family has learned from dormant experience in the past few years that the big model is justice, and wants to extend this truth to all fields.

If you delve into the confidence behind this truth, it may be the way to train large models. GPT-3 is currently one of the largest language models, it has 175 billion parameters, 100 times more than its predecessor GPT-2, 10 times more than the previous largest NLP model of the same kind, and can also be regarded as the pioneer of the big prediction model.

So, let's first look at how GPT-3's model architecture and training methods achieve such scale and performance:

Distributed training: GPT-3 uses a distributed training approach, that is, the model and data are distributed across multiple computing nodes, coordinated and synchronized through communication protocols. This can take advantage of the compute resources and memory space of multiple nodes, speed up the model training process, and support larger scale models and data.

GPT-3 uses about 2,000 GPU nodes for distributed training, each with multiple GPUs and each GPU having the same amount of memory.

GPT-3 uses two methods of distributed training: data parallelism and model parallelism.

Data parallelism refers to dividing the data into subsets, processing a subset for each node, updating the model's parameters on each node, and then synchronizing the parameters across all nodes.

Model parallelism refers to dividing the model into parts, each node processing a part, calculating the output and gradient of the part on each node, and then passing the output and gradient across all nodes.

GPT-3 uses a hybrid approach to data parallelism and model parallelism, that is, data parallelism within each node and model parallelism between different nodes. This makes full use of the GPU's computing power and communication bandwidth while reducing communication overhead and memory footprint.

Activation function checkpointing: GPT-3 uses a technique called activation function checkpointing, which saves only the values of the activation function of some layers, not all layers, during the forward propagation of the model. This saves memory space, because the value of the activation function takes up most of the video memory. During the backpropagation of the model, if the values of the activation functions of certain layers need to be used, they are recalculated instead of reading from the video memory. This allows some computation time to be sacrificed in exchange for more memory space to support larger models and batch sizes.

Sparse attention mechanism: GPT-3 uses a technique called sparse attention mechanism, which considers only words in part of the input sequence, not all words, when calculating self-attention. This reduces the computation and memory footprint because the complexity of self-attention is squared to the length of the input sequence. GPT-3 uses a sparse attention mechanism based on local windows and global blocks, that is, the input sequence is divided into multiple blocks, and each block is only attentively calculated with a few adjacent blocks, while each block is also paid attention to some randomly selected global blocks. This ensures that the model can capture both local and global information, while also reducing computational complexity and memory usage.

Seeing this, ChatGPT frowned slightly, and seemed to be a little dissatisfied with the GPT-3 scheme: "This is not enough. ”

"Big models are indeed the trend of the day, but you shouldn't blindly pursue scale just to compete. Before training a large model, we need to consider more details and technical challenges to ensure that it can operate stably, efficiently, and produce useful results. "

"First of all, choosing the appropriate training hyperparameters and model initialization is critical. The choice of hyperparameters such as learning rate, batch size, and number of iterations has a significant impact on the convergence speed, stability, and performance of the model. The model initialization determines the weight value before the training starts, which affects the quality of the final result. These parameters need to be carefully tuned based on empirical experiments or theoretical analysis to ensure optimal performance of the model. ”

"Second, in order to achieve high throughput and avoid bottlenecks, we need to optimize various aspects of the training process, such as hardware configuration, network bandwidth, data loading speed, and model architecture. Optimizing these steps can significantly improve the speed and efficiency of model processing. For example, using faster storage devices or data formats can reduce data load times; Using larger batch sizes or gradient accumulation can reduce communication overhead; Using simpler or sparser models can reduce calculation time, etc. ”

"Finally, training large models can encounter various instability and failure situations, such as numerical errors, overfitting, hardware failures, data quality issues, and so on. To avoid or recover from these issues, we need to closely monitor the behavior and performance of the model, and use debugging tools and techniques to identify and fix any bugs or defects. In addition, we can also use various security measures and safeguard mechanisms, such as cropping, regularization, discarding, noise injection, data filtering, data enhancement, etc., to improve the robustness and reliability of the model. ”

"In this day and age, big models are indeed important, but the mere pursuit of scale does not allow models to produce useful results. Only after careful training and optimization can large models truly realize their potential and bring more value to humans. ”

The prince was right.

The Fallen Strong Princes - BERT

Finally, the skinny camel is bigger than the horse, and although BERT has recently been overwhelmed by GPT, after all, it is a former strong prince, and under the unstoppable development of GPT, BERT still retains his own fiefdom. When it comes to natural language processing models, BERT (Bidirectional Encoder Representations from Transformers) was once a very popular model because it performed very well at many tasks. When it was first released, it was almost unbeatable, even more successful than GPT. This is because the design of BERT has different goals and advantages than GPT.

BERT's goal is to take the power of contextual modeling to a whole new level to better support downstream tasks such as text classification and question answering. It does this by training a bidirectional Transformer encoder. This encoder is able to consider both the left and right sides of the input sequence, resulting in a better representation of the context, so BERT can better model the context and improve the performance of the model in downstream tasks.

However, over time, the emergence of GPT family models allowed GPT-3 to surpass BERT in several tasks. One possible reason is that the GPT series of models are designed to focus more on generative tasks, such as text generation and dialogue systems, while BERT focuses more on classification and question answering tasks. In addition, GPT series models are trained with larger parameters and more data, which also enables them to perform better on a wider range of tasks.

Of course, BERT is still a very useful model, especially for tasks that require classifying text or answering questions. The GPT family of models is better suited for generative tasks such as text generation and dialogue systems. In general, both models have their own unique advantages and limitations, and we need to choose the right model according to the needs of the specific task. 

The menacing Segment Anything Model (SAM) [20]

As mentioned earlier, when the eldest brother GPT worked silently, the Lamor Transformer made a lot of waves in the CV field (ViT) and the multimodal field (CLIP), but eventually became experience babies, and was taught by the old father Transformer to the favored prince GPT, and finally achieved the so-called unification of GPT-4.

ViT and CLIP, who have Transformer blood flowing in their bones, are of course very unhappy: "Wang Hou will have a kind of Xiangning? Isn't the eldest brother learning from us, we can also learn from him. ” 

"However, he is too strong in the NLP field, and we need to find a new battlefield."

And so SAM was born. On the official website, they themselves describe it like this:

Segment Anything Model (SAM): a new AI model from Meta AI that can "cut out" any object, in any image, with a single click

In simple terms, we can think of SAM as an efficient "image editing master" that is able to accurately identify and segment various objects in an image through various input prompts. For example, when we click on a point with the mouse in an image, SAM, like an experienced painter, automatically cuts out the object where the point is located; When we enter the word "cat", SAM, like a clever detective, automatically finds and cuts out all the cats in the image; When we give SAM an object detection frame, SAM will cut out the objects inside the box accurately like a skilled surgeon. SAM's zero-sample generalization capabilities make it a true "universal editing master". This means that whether it's common objects like cars, trees, and buildings, or rare objects like dinosaurs, aliens, and magic wands, SAM can effortlessly identify and cut. This powerful capability stems from its advanced model design and huge data sets. I selected four examples of very complex scenarios from the original paper (Figure 10) to illustrate what SAM can do.

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 10. An instance of the effect of SAM. You can edit and extract each color in the picture, which is equivalent to an efficient PS master (image editing master).

To put it simply, when others asked us in the past, we always had to helplessly ask, wait a minute, what kind of data can you provide? Now it is not needed, at least in the field of CV, which is closer to the understanding of AI by non-technical people.

To achieve the power mentioned above, let's take a look at how ViT and CLIP conspired loudly:

ViT: "Although I used to do mainly image classification tasks, my architecture is also suitable for image segmentation. Because I'm using the Transformer architecture to break up images into a series of chunks and process them in parallel, if I integrate my strengths, SAM can inherit my strengths of parallel processing and global attention to achieve efficient image segmentation. ”

CLIP: "Okay, then I'll take my joint training approach with me, and based on this idea, SAM can also handle different types of input prompts (question prompts and visual cues). ”

Thus, the model architecture of SAM was formed (Figure 11), ViT was used as the image encoder, and CLIP was used to encode the prompt information. The idea is good, how to do it - of course, learn big brother!

"We want to use pre-trained language models for image segmentation tasks, just like we use prompts to let language models generate or predict text. With CLIP, our prompts can be rich, which can be points, boxes, masks, and texts that tell the language model what to segment in the image. Our goal is to give any hint and get a valid split mask. A valid mask means that even if the prompt is ambiguous (e.g. a shirt or a person), the output should be a reasonable mask for one of the objects. It's like Big Brother GPT (Language Model) can also give a coherent response to an ambiguous prompt. We chose this task because it allows us to pre-train the language model in a natural way and migrate zero samples to different segmentation tasks through prompts. ”

Post-GPT book: Starting with GPT-3, continue the huge family genealogy of Transformer

Figure 11. SAM model schema

As for the results, the aforementioned power of the idea has confirmed the viability of the idea. However, it must be mentioned that while SAM does not need to retrain models, it still has some limitations like chatGPT when it was first launched. In the Limitation section of the paper, the author page clearly points out some of the limitations and shortcomings of SAM, such as shortcomings in detail, connectivity, boundaries, etc., as well as challenges in tasks such as interactive segmentation, real-time, text prompts, semantics, and panoramic segmentation, while also acknowledging the advantages of some domain-specific tools.

For example, I did two simple tests in the demo: one is the detection of lesions in the field of medical imaging, because the lesions are too small to detect; The second is portrait cutting, the cut out portrait looks good at first glance, but the hair is still not very natural, and you can still see the cut marks if you look closely.

Of course, this is a good start after all, these little brothers are still in business, still working hard, what bicycle? So, what is the outcome of this seizure, let's wait and see!

summary 

The vast family of the Transformer family is clearly not what this article can illustrate, and when it comes to Transformer-based results, we can see continued innovation in this area: Vision Transformer (ViT) demonstrates the successful application of Transformer in the field of computer vision, which can directly process image pixel data without the need for manual feature engineering. DALL-E and CLIP apply Transformer to image generation and image classification tasks, demonstrating its superior performance in visual semantic understanding. Stable Diffusion proposes a stable diffusion process that can model probability distributions, which can be applied to tasks such as image segmentation and generation. Together, these results reveal the broad application prospects of the Transformer model, and we have to admit that one day in the future, it may really be "attention is all you need".

In conclusion, we can see the vitality of continuous innovation in the field of artificial intelligence from these achievements. Whether it is GPT or BERT, or Vision Transformer, DALL-E, CLIP, Stable diffusion, etc., these achievements represent the latest progress in the field of artificial intelligence.

At present, the current situation of ChatGPT is roughly like this:

The top students have a good class this semester, and when they open the books, they can recall the smile on the face of the teacher when they said this knowledge point in that class, and even start planning the study plan for the next semester.

The pseudo-students arrive at class every day, occupy the front row, but open the textbook but are confused, and begin to work with the scumbags "one book a day, one semester a week", the only difference is that the textbook is not brand new, and there is still a little memory of the content of the textbook, which is not a complete learning of new knowledge.

 As for the real scumbags ... 

"Knowledge comes, knowledge comes, knowledge comes from all directions"

In fact, I think that whether it is a pseudo-student or a scumbag, you should keep calm in front of the final exam, see what is said this semester, borrow notes from the top students, and even choose to postpone the exam. For top students, fast is a matter of course. For pseudo-students and scumbags, fast is harmful.

In the competition in the field of artificial intelligence, continuous innovation is crucial. Therefore, as researchers, we should keep a close eye on the latest developments in this field and remain humble and open-minded to promote continuous progress in the field of artificial intelligence.

bibliography

[1] Mikolov, Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv (https://en.wikipedia.org/wiki/ArXiv_(identifier)):1301.3781 (https://arxiv.org/abs/1301.3781) [cs. CL (https://arxiv.org/archive/cs.CL)].

[2] Mikolov, Tomas (2013). "Distributed representations of words and phrases and their compositionality". Advances in neural information processing systems.

[3] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, & Luke Zettlemoyer. (2018). Deep contextualized word representations.

[4] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

[5] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[6] attention mechanism and self-attention (transformer). Accessed at: https://blog.csdn.net/Enjoy_endless/article/details/88679989

[7] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).

[8] Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.

[9] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

[10] GPT-4 (openai.com)

[11] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs. CL].

[12] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

[13] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.

[14] Zheng, Laura, Yu Shen, and Ming C. Lin. "Exploring Contrastive Learning with Attention for Self-Driving Generalization."

[15] Reddy, Mr D. Murahari, et al. "Dall-e: Creating images from text." UGC Care Group I Journal 8.14 (2021): 71-75.

[16] Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).

[17] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[18] Chen, Xuanting, et al. "How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks." arXiv preprint arXiv:2303.00293 (2023).

[19] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.

Analyst Introduction:

The author of this article, Wang Zijia, an artificial intelligence scientist in the Office of the Chief Technology Officer of Dell Technologies, graduated from Al College Al in the United Kingdom, and his main research directions are computer vision, 3D reconstruction, AIGC, etc., focusing on the exploration and innovation of new technologies in related fields, and has made many attempts and innovations in the direction of data privacy protection enabled by new Al technology and the application of AIGC technology in data management. He joined Dell Technologies in 2019, during which time he has published 5 papers and 139 patents in related fields.

The Heart of the Machine Global Analyst Network is a global AI expertise sharing network initiated by Heart of the Machine. In the past four years, hundreds of AI students, scholars, engineering experts, and business experts from all over the world have shared their research ideas, engineering experience, and industry insights with the global AI community through online sharing, column interpretation, knowledge base construction, report release, evaluation and project consultation, and gained their own ability growth, experience accumulation and career development in their spare time after their academic work.

Read on