At this year's Microsoft Build 2023 conference, Andrej Karpathy, a researcher from OpenAI, explained in detail how ChatGPT is trained in a presentation on May 24, including the whole process of training a GPT that can support conversations with users and some of the technologies involved. Informative, this article is based on this presentation. This article is from the official DataLearner blog: from Microsoft Build 2023: How Large Language Models Are Trained and How Language Models Become ChatGPT - State of GPT | Official website of Data Learner

Andrej Karpathy, a student of Feifei Li, worked in OpenAI for 2 years as a founding member of OpenAI, then went to Tesla for 5 years, and returned to OpenAI in 2022. It can be said that he is a top talent in the field of artificial intelligence!

This presentation is more than 40 minutes, including 2 aspects, one is how to train a large language model such as GPT, which contains the training process of the large language model and related technologies, although it is a high-level explanation, but it is very useful for understanding the training of ChatGPT. There is some valuable information about this section:

The steps and techniques involved in the training process of the whole process of ChatGPT
Large models are divided into basic models, SFT models and RLHF models, which are models at different stages with different characteristics and suitable for different tasks
The model capabilities produced at different stages are different, the cost of training is different, and the basic model resource consumption in the pre-training stage is the highest
Some explanations are also given, including why the RLHF model is still needed with supervised fine-tuning of the basic model

Currently, users who are not registered on the official website cannot watch videos, they can only go to YouTube to watch them. After the Build conference, there should be an official video and material released. Registered users can watch directly on the official website, and can download subtitles~

State of GPT概述
Tokenzation of text in the pre-training stage: Input and target OpenAI's base model GPT-3 compared with LLaMA
There are supervised fine-tuning
Reward model
Reinforcement learning
Why do you need to do RLHF with SFT models?
The RLHF model is not always better than the base model

State of GPT概述

The theme of this report is called State of GPT, and the whole report uses GPT-3 and MetaAI's LLaMA as examples, although it does not touch on GPT-4 related content, but the core idea should be the same. The full text revolves around the following figure:

From Microsoft Build 2023: How Big Language Models Are Trained

This figure summarizes the training panorama of large language models well, mainly including four stages:

Pre-training stage: Train a basic model based on the original data, and obtain a base model that can be deployed and used
Supervised fine-tuning stage: Continue to train this basic model based on high-quality data, and the result is an SFT model, which can also be deployed and used
Reward modeling stage: This is also a stage of the famous RLHF in ChatGPT, that is, training a reward function alignment model, and the results of training in this stage are generally not recommended for deployment
Reinforcement learning stage: The reward model based on stage three continues reinforcement learning, and a "ChatGPT" similar model can be obtained that can support dialogue with users

These four stages, each of which involves some training techniques and datasets, as well as output and hardware consumption, are very clear. This article also details the content involved in these three stages.

Pre-training phase

The main purpose of the pre-training phase is to train a basic language model. The core is based on the transformer architecture, using a large amount of unlabeled data to train the model's prediction of the next token, which is also the most time-consuming and computing power in the entire large model stage.

According to Andrej Karpathy, this phase consumes 99% of the computation time and FLOPS (floating-point operations per second). The main datasets and techniques used in this phase are as follows:

LLM pre-training phase work content	The pre-training phase sets the results
data set	The original network dataset, containing trillions of words of text data, was of low quality but large in quantity
The algorithm used	Language model, predicting the next token
Hardware used	Thousands of GPUs
Training time	A few months
The model generated by this stage	The base model, ready to be deployed and used
Model examples	GPT、LLaMA、PaLM

Next, Andrej Karpathy introduces several aspects of the pre-training phase in detail, and we will also follow this idea to make an introduction.

Tokenzation of text

At this stage, an important step in the processing of the dataset is to do tokenizer. The original training data of the large language model is all text, which cannot be directly entered, and the use of tokenizer to turn the text into integers is conducive to the model learning the representation and context information of each token, and can be transferred on different datasets. As shown in the figure below, here is an example from OpenAI doing tokenzier:

Tokenizer is the conversion of raw data into integer values (OpenAI provides an open source implementation: byte pair encoding: https://www.datalearner.com/blog/1051671195543180), which is the input that the transformer model can accept. Generally speaking, you need to turn the original text into tokens, and then turn it into the integer value corresponding to each token.

As shown in the following figure:

Inputs and targets for the pre-training phase

After making tokenzier, you can use the transformer model for training on the dataset. In general, the input of the model can be summarized as (B, T), where B is the batch_size, T is the largest context length, generally the length of the context length is in the thousands, of course, there are now longer such as hundreds of thousands of tokens, but most of them are still thousands.

In real code, the trained sequence is entered as a line, segmented by a special separator such as <|endoftext|>, indicating that this is the end of a paragraph. As shown in the following figure:

Here to the maximum input length of 10 as an example, although it seems that this video is truncated, but you can also see that the table in the lower right corner is the input of our data, a large amount of text data becomes an integer and then connects according to the line, and the <|endoftext|> is divided between different documents, and it is a red 50256 from the figure, which is to tell the model that this is the end of a document. The algorithm will take the input of a batch every T length according to the input of the data. As you can see, the input of a batch here does not require complete documents or statements!

The training of language models is based on this data to predict. Take the integer at any position in it, and then use the length of T before the integer as input to predict what the position is. This involves the concept of vocabulary size, that is, we need to know the number of words contained in this data set, that is, the total number of different tokens, for example, here may be 50527 words, then that is to say, the result of our prediction is to find a value from 50527, representing the next word. This number is fixed and unambiguous in the model.

At this point, we can train a basic language model. This model has strong representation capabilities. But all the purpose is to generate the next word. Nevertheless, such models also demonstrate great capabilities on downstream tasks.

In the era of GPT-1, we collected downstream supervised datasets for fine-tuning, and when GPT-2 OpenAI found that it could adapt the model to downstream tasks through simple prompts.

OpenAI's basic model, GPT-3, compared to LLaMA

Taking this stage as an example, the two most representative models are GPT-3 and LLaMA (GPT-4 does not disclose details). Andrej Karpathy compared the GPT-3 to the LLaMA model and acknowledged that the LLaMA-65B model, despite having smaller parameters, is more powerful than GPT-3. The reason is that LLaMA training uses more tokens in the data set. The comparison between the two is as follows:

Model name	Parameter name	Training duration	Hardware used	Training dataset tokens
GPT-3	175 billion	A few months	1000-10000 V100 GPUs	300 billion
LLaMA-65B	65 billion	21 days	2048 A100 GPUs	1.4 trillion

However, it should be noted that the basic model generated at this stage is not an assistant model, although fine-tuning it or doing prompt engineering can get the desired result, but it is not so perfect. In fact, the underlying model doesn't answer questions, it's just completing the documentation.

Here Andrej Karpathy also explained that the basic model of OpenAI's GPT-3 can be accessed through davinci's API, but the basic model of GPT-4 is no longer available, and the access is actually assistant. There is currently no way to access the underlying model of GPT-4.

If you need the basic model to have a powerful dialogue ability like "ChatGPT", you need to train later.

There are supervised fine-tuning

The supervised fine-tuning stage is supervised training based on the basic model trained in the pre-training stage. This stage requires the collection of high-quality Q&A data, but it doesn't need as much volume as Phase One (and it can't). In general, it is good to hire outsourcers to collect 10,000 to 100,000 prompt-response pairs of data sets.

However, there is no change in the algorithm at this stage, and you can continue to train on this high-quality prompt-response dataset with the same method of pre-training. At this stage, the result is an SFT model (SFT), which is still a model that can be deployed and used. The Vicuna 13B currently released by the industry is this model (https://www.datalearner.com/ai-models/pretrained-models/Vicuna-13B).

SFT models trained based on the base model usually only need 1-100 GPUs to train for a few days.

The following figure shows the SFT dataset related to the industry's open source OASST1 and OpenAI's InstructGPT:

It can be seen that OpenAI's dataset should contain the evaluation of the response, including authenticity, usefulness and harmfulness, etc., involving subsequent model training.

At this point, we already have a model that can get a good response. However, Andrej Karpathy also went on to illustrate the next two important phases, namely those from the RLHF. They are still important for model improvement and are very different from the SFT phase.

Reward model

RLHF full name Reinforcement Learning from Human Feedback, that is, reinforcement learning based on human feedback, can be divided into two stages, one is to train the reward model, and the other is to do reinforcement learning.

In the training of the reward model, the main work is to turn the previously collected SFT dataset into a comparable dataset.

In the following figure, using the same prompt, let the model generate 3 different replies:

Next, it is still handed over to the humans, who can compare the quality of the replies and sort them in order. Then a dataset is formed in the following format:

The blue part is prompt, and the prompt result is the same for different lines. The yellow part is the result of the model's response to prompt. The green part is a marker that indicates the reward result. Next, continue training the language model, but let it only predict this green part. Let the model learn to compare which reply is better in the same prompt.

Of course, OpenAI itself has some deterministic ground truth, ensuring that the final ranking results are realistic (because pairwise comparisons can produce very strange things that are different from the facts, which should also be considered).

The dataset used for training in the reward model stage generally contains 100,000 to 1 million comparison results, annotated by outsourced personnel, with low quantity and high quality. The algorithm used is a two-classified-like method used to predict the reward of consistent preference, and the resulting reward model is generally not deployed as a model for everyone.

The training of the reward model is usually the same as that of the SFT model, 1-100GPUs, and training in a few days can be done. After the reward model training is complete, it is necessary to enter the reinforcement learning phase.

Reinforcement learning

In the next reinforcement learning phase, we will use the SFT model and the previous reward model. The dataset is as follows:

As you can see, it is similar to the reward model stage. It initializes an SFT model, and then the green part is predicted by the RM model. The yellow part is the target of training. Here the model learns to identify which prompts are good and which are not. If the reward function result is positive, then the conclusion generated by the yellow part will be more biased in the future, and if the reward result is negative, it means that the model will generate as few replies as possible in the yellow part in future generation.

The reward model remains unchanged after the third stage, that is, the green part here does not change. The goal of training is to identify the good and bad yellow parts. The goal is to enhance the response generated by the model. In this way, repeated training can allow the model to understand how the yellow part is generated to get a better score.

The dataset used in this phase is the same as before, 10,000-100,000 prompts, high quality, small number. The goal is to use reinforcement learning ideas to generate tokens to maximize reward. Such a model is the final ChatGPT model. It usually takes 1-100 GPUs, and a few days of training are enough.

Why do you need to do RLHF with SFT models?

In fact, the answer is very simple, that is, the effect is better. According to the InstrcutGPT paper, humans prefer RLHF-trained models in practical tests compared to SFT models.

As for the real reasons why RLHF works better, Andrej Karpathy currently believes that there is no one opinion that everyone can agree on. However, he himself gave a possible explanation. That is, it is easier for a task to make judgments than to build. For example, if you were asked to collect some data to train the model to write poetry, you might have to look for a description of the poem, and then the poem corresponding to that description. The job may not be easy. But if the model generates several poems for a description, let everyone judge which poem is better. Well, the work may be relatively simple and clear.

Andrej Karpathy provides this description from the perspective of dataset collection and labeling. The former is the training data collection process of the SFT model, and the latter is actually the process of RLHF.

The RLHF model is not always better than the base model

Well, in practice, not in all cases, the RLHF model is better than the base model. Although the RLHF model can better follow human intent, it loses some entropy. This means that the base model can produce more diverse results, while the RLHF model will produce more peaky results (note that the peaky result can be understood as the peak result in the distribution, that is, the distribution of the RLHF model is more peaked, with more obvious high-frequency words or sequences).

This means that if you want to generate diverse results with a model, such as given n things, and need to generate more similar results based on this, then the RLHF model is not suitable. He gave an example of generating more similar names based on the names of Pokémon sprites, which is obviously the case where the base model is more suitable.

The figure below is the anonymous scoring results of the large model displayed by LM-SYS (that is, given a question, randomly and anonymously give the answers of 2 models, let ordinary users compare good and bad, and the results are ranked). The results show that the top three models are all RLHF models.

From Microsoft Build 2023: How Big Language Models Are Trained