laitimes

ChatGPT, which is popular all over the circle of friends, is refined like this: the amount of training data has exploded, and the model has evolved for three generations

preface

GPT series is a series of pre-training articles of OpenAI, the full name of GPT is Generative Pre-Trained Transformer, as the name implies, the purpose of GPT is to obtain a general text model through the Transformer as the basic model, using pre-training technology. Papers have been published on text pre-training GPT-1, GPT-2, GPT-3, and image pre-training iGPT. The GPT-4, which has not yet been released, is rumored to be a multimodal model. The recent popularity of ChatGPT and the sister model announced earlier this year are pre-release models of GPT-4, sometimes referred to as GPT3.5. ChatGPT and InstructGPT are completely consistent in model structure and training methods, that is, both use instruction learning and reinforcement learning from human feedback (RLHF) to guide model training, and they differ only in the way they collect data. So to understand ChatGPT, we must first read InstructGPT.

1. Background

Before introducing ChatGPT/InstructGPT, let's introduce the underlying algorithms they rely on.

1.1 GPT series

The three generations of GPT-1, GPT-2, and GPT-3 models based on text pre-training are all models with Transformer as the core structure (Figure 1), the difference is that the number of layers and word vector length of the model and other hyperparameters, their specific content is shown in Table 1.

Figure 1: Model structure of the GPT series (where Trm is a Transformer structure)

Table 1: Release time, number of parameters and training amount of GPT in previous generations

GPT-1 was born a few months before BERT. They all use Transformer as the core structure, the difference is that GPT-1 builds pre-training tasks by generation from left to right, and then obtains a general pre-training model, which can be used to fine-tune downstream tasks like BERT. GPT-1 achieved SOTA effect on 9 NLP tasks at that time, but the model size and data volume used by GPT-1 were relatively small, which led to the birth of GPT-2.

Compared with GPT-1, GPT-2 does not make a big fuss about the model structure, but uses a model with more parameters and more training data (Table 1). The most important idea of GPT-2 is the idea that all supervised learning is a subset of the unsupervised language model, which is also the precursor to prompt learning. GPT-2 also caused a lot of sensation at the beginning of its birth, and the news it generated was enough to deceive most humans and achieve the effect of faking the real thing. Even at the time, dubbed "the most dangerous weapon in AI," many portals banned the use of GPT-2 to generate news.

When GPT-3 was proposed, in addition to its far superior effect to GPT-2, it caused more discussion about its 175 billion parameters. In addition to GPT-3 can complete common NLP tasks, researchers unexpectedly found that GPT-3 also has a good performance in writing SQL, JavaScript and other languages and performing simple mathematical operations. GPT-3 training uses In-context Learning, which is a type of meta-learning, the core idea of meta-learning is to find a suitable initialization range through a small amount of data, so that the model can be quickly fitted on a limited data set and obtain good results.

From the above analysis, we can see that from a performance perspective, GPT has two goals:

1. Improve the performance of the model on common NLP tasks;

2. Improve the generalization ability of the model on other atypical NLP tasks (such as code writing, mathematical operations).

In addition, since the birth of pre-training models, a much-criticized problem is the bias of pre-training models. Because pre-trained models are trained on models of very large parameter magnitude through massive data, pre-trained models are like a black box compared to expert systems that are completely controlled by artificial rules. No one can guarantee that the pre-trained model will not generate some dangerous content that contains racism, sexism, etc., because its tens of gigabytes or even tens of terabytes of training data will almost certainly contain similar training samples. This is the motivation for InstructGPT and ChatGPT, whose optimization goals are summarized in 3H:

* Helpful;

* Honest;

* Harmless.

OpenAI's GPT series models are not open source, but they provide a trial website for the model, and students with conditions can try it out on their own.

1.2 Instruct Learning and Prompt Learning

Instructional learning is the idea proposed by Google's Deepmind's Quoc V.Le team in a 2021 article titled "Finetuned Language Models Are Zero-Shot Learners." The purpose of instructional learning and prompt learning is to explore the knowledge of the language model itself. The difference is that Prompt stimulates the completion ability of the language model, such as generating the second half of the sentence based on the first half of the sentence, or filling in the blanks in cloze. Instruct stimulates the understanding of language models by giving more obvious instructions to make the model take the right action. We can understand these two different learning styles through the following example:

Tips to learn: Bought this necklace for my girlfriend, she liked it very much, this necklace is too ____.

Instruction learning: Judge the emotion of this sentence: bought this necklace for my girlfriend, she liked it. Options: A=OK; B = General; C = Poor.

The advantage of instruction learning is that it can also do zero-shots on other tasks after fine-tuning for multiple tasks, while prompt learning is all for one task. Generalization is not as good as instruction learning. We can understand fine-tuning, prompt learning, and instructional learning through Figure 2.

Figure 2: Model fine-tuning, prompt learning, indicating the similarities and differences between learning

1.3 Reinforcement learning with human feedback

Because the trained model is not very controllable, the model can be seen as a fit to the distribution of the training set. Then fed back into the generative model, the distribution of training data is the most important factor affecting the quality of the generated content. Sometimes we want the model to be not only affected by the training data, but artificially controllable, so as to guarantee the usefulness, authenticity and harmlessness of the generated data. The problem of alignment is mentioned many times in the paper, which we can understand as the alignment of the output content of the model and the output content that humans like, and what humans like is not only the fluency and grammatical correctness of the generated content, but also the usefulness, authenticity and harmlessness of the generated content.

We know that reinforcement learning guides model training through the reward mechanism, and the reward mechanism can be regarded as the loss function of the traditional model training mechanism. The calculation of rewards is more flexible and diverse than the loss function (AlphaGO's rewards are the winners and losers of the game), which comes at the cost of the calculation of rewards is not derivable, so it cannot be directly used for backpropagation. The idea of reinforcement learning is to fit the loss function by sampling a large number of rewards, so as to realize the training of the model. Similarly, human feedback is also undirectable, so we can also use human feedback as a reward for reinforcement learning, and reinforcement learning based on human feedback came into being.

The RLHF can be traced back to Google's "Deep Reinforcement Learning from Human Preferences" published in 2017, which uses human annotation as feedback to improve the performance of reinforcement learning on simulated robots and Atari games.

Figure 3: Fundamentals of reinforcement learning with human feedback

InstructGPT/ChatGPT also uses a classic algorithm from reinforcement learning: Proximal Policy Optimization (PPO) proposed by OpenAI. PPO algorithm is a new type of Policy Gradient algorithm, Policy Gradient algorithm is very sensitive to step size, but it is difficult to choose the appropriate step size, and the change difference between the old and new strategies in the training process is not conducive to learning if it is too large. PPO proposes a new objective function that can achieve small-batch updates in multiple training steps, which solves the problem that the step size is difficult to determine in the Policy Gradient algorithm. In fact, TRPO is also to solve this idea, but compared to the TRPO algorithm, the PPO algorithm is easier to solve.

2. Interpretation of the principle of InstructGPT/ChatGPT

With these basics, it will be much easier for us to understand InstructGPT and ChatGPT. Simply put, InstructGPT/ChatGPT both use the network structure of GPT-3, build training samples through instruction learning to train a reward model (RM) that responds to the effect of predicting content, and finally guide the training of the reinforcement learning model through the scoring of this reward model. The training process of InstructGPT/ChatGPT is shown in Figure 4.

Figure 4: InstructGPT calculation flow: (1) supervised fine-tuning (SFT); (2) Reward model (RM) training; (3) Reinforcement learning according to the reward model through PPO.

From Figure 4, we can see that the training of InstructGPT/ChatGPT can be divided into 3 steps, of which steps 2 and 3 are the reward model and the SFT model of reinforcement learning can be iteratively optimized.

1. Supervised fine-tuning of GPT-3 based on the collected SFT dataset (Supervised FineTune, SFT);

2. Collect the comparison data of manual labeling and train the reward model (Reword Model, RM);

3. Using RM as the optimization goal of reinforcement learning, the PPO algorithm is used to fine-tune the SFT model.

According to Figure 4, we will introduce the dataset acquisition and model training aspects of InstructGPT/ChatGPT respectively.

2.1 Dataset Collection

As shown in Figure 4, InstructGPT/ChatGPT training is divided into 3 steps, and the data required for each step is slightly different, and we will introduce them separately.

2.1.1 SFT dataset

The SFT dataset is used to train the supervised model in step 1, that is, GPT-3 is fine-tuned according to the training method of GPT-3 using the new data collected. Because GPT-3 is a prompt-based generative model, the SFT dataset is also a sample consisting of prompt-response pairs. The SFT data comes partly from PlayGround users using OpenAI, and partly from the 40 labelers hired by OpenAI. And they trained the labeler. In this dataset, the labeler's job is to write his own instructions based on the content, and the instructions are required to meet the following three points:

* Simple tasks: Labeler gives any simple task, while ensuring the diversity of tasks;

* Few-shot task: labeler gives an indication, and multiple queries for that indication - corresponding pairs;

* User-related: Take use cases from the interface and let the labeler write instructions based on those use cases.

2.1.2 RM dataset

The RM dataset is used to train the reward model in step 2, and we also need to set a reward target for InstructGPT/ChatGPT training. This reward goal doesn't have to be guideable, but it must be as comprehensive and realistic as possible to align what we need the model to generate. Naturally, we can provide this reward by artificial labeling, which can give lower marks to those generated content that involves bias, thereby encouraging the model not to generate content that humans don't like. The InstructGPT/ChatGPT approach is to first have the model generate a batch of candidate text, and then use the labeler to sort these generated content according to the quality of the generated data.

2.1.3 PPO dataset

InstructGPT's PPO data is not labeled, it comes from users of the GPT-3 API. Different types of generation tasks are provided by different users, among which the highest proportion includes generation tasks (45.6%), QA (12.4%), brainstorming (11.2%), dialogue (8.4%), etc.

2.1.4 Data Analysis

Because InstructGPT/ChatGPT is fine-tuned based on GPT-3, and because manual labeling is involved, the total amount of data they are not large, Table 2 shows the source of the three data and their data volume.

Table 2: Data distribution for InstructGPT

The distribution of the data is discussed in more detail in Appendix A of the paper, and here I list a few that may affect the effectiveness of the model:

* More than 96% of the data is in English, and the other 20 languages such as Chinese, French, Spanish, etc. add up to less than 4%, which may lead to InstructGPT/ChatGPT being able to generate other languages, but the effect should be far less than English;

* There are a total of 9 types of prompts, and most of them are generation tasks, which may cause the model to have task types that cannot be covered;

* With 40 outsourced employees from the U.S. and Southeast Asia, InstructGPT/ChatGPT's goal is to train a pre-trained model with correct values, and its values are a combination of the values of these 40 outsourced employees. And this narrower distribution may generate some discrimination and bias problems that other regions are more concerned about.

In addition, the ChatGPT blog said that ChatGPT and InstructGPT are trained in the same way, the difference is only that they collect data, but there is no more information about the details of data collection. Considering that ChatGPT is only used in the field of conversation, here I guess that ChatGPT has two differences in data collection: 1. Increased the proportion of conversational tasks; 2. Convert the prompt method to the Q&A method. Of course, this is just speculation, and a more accurate description will not be known until more detailed information such as ChatGPT's papers, source code and so on are released.

2.2 Training tasks

We just introduced that InstructGPT/ChatGPT has a three-step training method. This three-step training involves three models: SFT, RM, and PPO, which we describe in detail below.

2.2.1 Supervised fine-tuning (SFT)

This step of training is consistent with GPT-3, and the authors found that having the model properly overfit helped the next two steps of training.

2.2.2 Reward Model (RM)

Because the data trained on RM is a form in which the labeler is ordered according to the generated results, it can be thought of as a regression model. The RM structure is the model after removing the last embedding layer of the SFT-trained model. Its input is prompt and Reponse, and its output is the reward value. Specifically, for each prompt, InstructGPT/ChatGPT will randomly generate K outputs (4≤K≤9), and then they show output results to each labeler in pairs, that is, each prompt shows a total of C(K, 2) results, and then the user chooses the better output. At training time, InstructGPT/ChatGPT treats each prompt's C(K,2) response pairs as a batch, which is less likely to overfit than the traditional sample batch-by-sample method, because each prompt will be input into the model only once.

The loss function of the reward model is expressed as Equation (1). The goal of this loss function is to maximize the difference between the responses that the labeler prefers and the responses that he does not like.

where rθ(x,y) is the reward value of prompt x and response y under the reward model with parameter θ, yw is the response result that labeler prefers, and yl is the response result that labeler does not like. D is the entire training dataset.

2.2.3 Reinforcement Learning Models (PPOs)

Reinforcement learning and pre-training models are the two hottest AI directions in the past two years, and many researchers have said that reinforcement learning is not a very suitable application to pre-training models, because it is difficult to establish a reward mechanism through the output of the model. InstructGPT/ChatGPT counterintuitively does this, and it introduces reinforcement learning into pre-trained language models by combining artificial annotation is the biggest innovation point of this algorithm.

As shown in Table 2, the training set of PPO comes entirely from the API. It guides the continued training of the SFT model through the reward model obtained in step 2. Many times reinforcement learning is very difficult to train, InstructGPT/ChatGPT encountered two problems during the training process:

Problem 1: As the model is updated, the data produced by the reinforcement learning model and the data of the training reward model become more and more different. The author's solution is to add the KL penalty term to the loss function to ensure that the difference between the output of the PPO model and the output of SFT is not large, i.e. the upper half of equation (2).

Problem 2: Training only with PPO models will cause a significant decrease in the performance of the model on general NLP tasks, and the author's solution is to add a general language model target to the training target, which is called PPO-ptx in the paper, which is the second half of equation (2).

In summary, the training goal of PPO is equation (2).

3. InstructGPT/ChatGPT performance analysis

It is undeniable that the effect of InstructGPT/ChatGPT is very good, especially after the introduction of artificial annotation, so that the "values" and correctness of the model and the "authenticity" of human behavior patterns have been greatly improved. So, just according to the technical scheme and training method of InstructGPT/ChatGPT, we can analyze what effect improvements it can bring?

3.1 Advantages

The effect of InstructGPT/ChatGPT is more real than GPT-3: this is easy to understand, because GPT-3 itself has very strong generalization and generation ability, coupled with InstructGPT/ChatGPT introduces different labelers for prompt writing and generation result ordering, and it is also fine-tuned on top of GPT-3, which makes us have higher rewards for more realistic data when training reward models. The authors also compared their effects with GPT-3 on the TruthfulQA dataset, and the experimental results showed that even the 1.3 billion small size PPO-ptx was better than GPT-3.

InstructGPT/ChatGPT is slightly more harmless than GPT-3 in terms of model harmlessness: the principle is the same as above. However, the authors found that InstructGPT did not significantly improve on datasets such as discrimination and bias. This is because GPT-3 itself is a very effective model, and the probability of generating problematic samples with harm, discrimination, bias, etc. is itself very low. The data collected and annotated by only 40 labelers is likely to not fully optimize the model in these aspects, so the improvement in model effectiveness will be little or undetectable.

InstructGPT/ChatGPT has strong coding capabilities: First of all, GPT-3 has strong coding capabilities, and APIs based on GPT-3 have also accumulated a large amount of coding code. And some of OpenAI's internal employees are involved in data collection. Through a large amount of data related to coding and manual annotation, it is not surprising that the trained InstructGPT/ChatGPT has very strong coding capabilities.

3.2 Cons

InstructGPT/ChatGPT will reduce the effectiveness of the model on general NLP tasks: we discussed this during PPO training, and although modifying the loss function can alleviate it, this problem has not been completely solved.

Sometimes InstructGPT/ChatGPT gives some ridiculous output: while InstructGPT/ChatGPT uses human feedback, it is limited to limited human resources. The most effective effect on the model is the supervised language model task, where humans only play a corrective role. So it's likely that limited data to correct, or misguided by supervised tasks (only considering the output of the model, not what humans want), resulting in untruthful content it generates. Just like a student, although there is a teacher to guide him, it is not certain that the student can learn all the knowledge points.

The model is very sensitive to instructions: this can also be attributed to the insufficient amount of data labeled by the labeler, because the indication is the only clue for the output of the model, and if the number and type of instructions are not sufficiently trained, the model may have this problem.

The model overinterprets simple concepts: This may be because labelers tend to reward long outputs higher when comparing generated content.

Harmful instructions may output harmful responses: InstructGPT/ChatGPT, for example, will also give a course of action for the user's "AI Destruction Plan" (Figure 5). This is because InstructGPT/ChatGPT assumes that the instructions written by the labeler are reasonable and values correct, and does not make more detailed judgments about the instructions given by the user, which will cause the model to respond to any input. Although the later reward model may give such output a lower reward value, when the model generates text, it must consider not only the values of the model, but also the matching degree of the generated content and instructions, and sometimes it is possible to generate some outputs with problematic values.

Figure 5: ChatGPT's plan to destroy humanity.

3.3 Future work

We have analyzed the technical solutions of InstrcutGPT/ChatGPT and its problems, so we can also see what are the optimization angles of InstrcutGPT/ChatGPT.

Cost reduction and efficiency improvement of manual annotation: InstrcutGPT/ChatGPT employs a team of 40 people, but from the performance of the model, this team of 40 people is not enough. How to enable humans to provide more effective feedback is very important to combine human performance and model performance organically and cleverly.

Model generalization/error correction and other capabilities of instructions: Indication as the only clue for the output of the model, the model's dependence on him is very serious, how to improve the model's generalization ability to the indication and error correction ability is a very important work to improve the model experience. This not only allows the model to have a wider range of application scenarios, but also makes the model more "intelligent".

Avoid performance degradation for general-purpose tasks: Here you may need to design a more rational way to use human feedback, or a more cutting-edge model structure. Because we discussed many problems with InstrcutGPT/ChatGPT that can be solved by providing more labeler-annotated data, this will lead to a more serious performance degradation of general-purpose NLP tasks, so a solution is needed to balance the performance of 3H and general-purpose NLP tasks that generate results.

3.4 InstrcutGPT/ChatGPT hot topic answers

Will the emergence of ChatGPT lead to the loss of jobs for low-level programmers? From the principle of ChatGPT and the generated content leaked on the Internet, a lot of the code generated by ChatGPT can run correctly. But a programmer's job is not just to write code, but more importantly to find solutions to problems. So ChatGPT will not replace programmers, especially advanced programmers. Instead, it will become a very useful tool for programmers to write code, like many code generation tools today.

Stack Overflow announced a temporary rule: ban ChatGPT. ChatGPT is essentially a text generation model, which is better at generating text that is fake than generating code. And the code or solution generated by the text generation model is not guaranteed to be operational and will solve the problem, but it will confuse many people who query this problem with fake text. In order to maintain the quality of the forum, Stack Overflow also makes sense to ban ChatGPT.

The chatbot ChatGPT was induced to write a "plan to destroy humanity" and give the code, what problems should be paid attention to in the development of AI? ChatGPT's "Plan for the Destruction of Humanity" is generated content forcibly fitted based on massive amounts of data under unmet instructions. Although this content looks real and the expression is fluent, this shows that ChatGPT has a very strong generation effect, and does not mean that ChatGPT has the idea of destroying human beings. Because he is only a text generation model, not a decision model.

4. Summary

Just like many people's algorithms when they were first born, ChatGPT has attracted widespread attention in the industry and human thinking about AI with its usefulness, authenticity, and harmless effects. But when we looked at its algorithmic principles, we found that it was not as terrifying as the industry advertised. Instead, we can learn a lot of valuable things from its technical solutions. The most important contribution of InstrcutGPT/ChatGPT in the AI community is the clever combination of reinforcement learning and pre-trained models. Moreover, the usefulness, authenticity and harmlessness of the model are improved through human feedback. ChatGPT has also further increased the cost of large models, which were previously just competing with data volume and model scale, and now even introducing the cost of outsourcing for employment, making self-employed workers more prohibitive.

Read on