laitimes

Technical principles of large language models

author:NineData

In today's era, people's work and life are inseparable from data access, and the data storage and query behind almost all platforms are inseparable from databases. SQL as a database query and processing language has a long history, first proposed by IBM in the early 70s of last century when studying the relational data model, and subsequently developed into a widely used database standard access interface.

Today's development of big language models gives us an opportunity to revisit this layer of standards and how people can access databases in a more natural way, and data returned to customers in a more direct and flexible way. Due to historical development, a conclusion from database analysis requires the full path of "analyst + report front-end + data back-end + SQL + data storage", and this usage paradigm will be challenged in the future. In addition to the advantages of natural language itself, the context learning ability, transfer learning and text summary ability of the context also have a lot of room to play, with these thoughts, we need to understand the development behind the big language model and its technical principles.

First, the development of large language models

As a proven and feasible direction, the "large" of large language model is reflected in the wide training dataset, large number of model parameters and layers, large amount of computation, its value is reflected in versatility, and has better generalization ability. Compared with traditional language models trained in specific fields, there are a wider range of application scenarios. This article refers to Google and OpenAI-related papers and some authors' supplements, combined with my understanding, and tries to analyze its technical development and main implementation in a language that everyone can understand.

1.1 Proposal of the Transformer model

Before Transformer was proposed, the dominant model in the field of natural language processing was recurrent neural networks (RNNs), which used recurrent and convolutional neural networks for language sequence transformation. In 2017, the Google Brain team's top meeting in the field of artificial intelligence, NeurIPS, published a paper called "Attention is all you need", which proposed for the first time a new simple network architecture, the Transformer, which is completely based on the attention mechanism and completely abandons circular recursion and convolution.

Recursive models typically calculate along the sign positions of the input and output sequences to predict subsequent values. But this inherent sequential nature hinders parallelization within the training samples because memory constraints limit batching between samples. The attention mechanism, on the other hand, allows dependencies to be modeled regardless of their distance in the input or output sequence.

Transformer eschews the model architecture of recursive networks and relies entirely on attention mechanisms to draw global dependencies between inputs and outputs. After just 12 hours of training on eight P100 GPUs, Transformer was able to reach a new state-of-the-art level in translation quality, demonstrating great parallelism. It became the most advanced Large Language Model (LLM) at the time.

To summarize two core breakthroughs:

  1. It breaks through the learning limitations of long-distance text dependence, avoids the model architecture of recurrent networks, and relies entirely on attention mechanisms to draw global dependencies between input and output. The operands required to correlate signals from two arbitrary input or output positions increase with distance, which previously required linear or logarithmic growth, are now converged into a constant and accuracy is guaranteed by a multi-attention head mechanism.
  2. Training can be done in high parallelism, which is important for taking advantage of hardware dividends and iterating models quickly.

The figure below is the Transformer model mentioned in the paper, using stacked self-attention and point-by-point, fully connected layers for the encoder and decoder, as shown in the left half (encoder) and right half (decoder) of Figure 1, respectively, and the relevant technical details will be highlighted later.

Technical principles of large language models

Transformer model

Based on this work, OpenAI has developed a GPT (Generative Pre-training) generative pre-training model, which is simply changed by borrowing a diagram from the Internet, and the relevant details will be expanded later.

Technical principles of large language models

Development of GPT

1.2 Generative pre-training potential: GPT-1

In 2018, OpenAI published the paper "Improving Language Understanding by Generative Pre-training."

The model used has two stages, the first stage is unsupervised pre-training, which learns a large-capacity language model through the Transformer based on a massive text set, and the second stage is parameter fine-tuning based on labeled data. The resulting general task-agnostic model (or generic model) outperformed the discriminantly trained model, and achieved better results in 9 of the 12 datasets selected in the paper. In GPT-1, the structure of 12 layers of Transformer is used as the decoder, and each Transformer layer is a self-attention mechanism of multiple heads, and then the probability distribution of the output is obtained through full connection.

For OpenAI, I think this practice is the core factor that lays the foundation for their development towards this route, and there are mainly several key breakthroughs:

1. It proves that general model training has great value potential. The labeling data used to learn specific tasks was difficult to obtain, resulting in the model effect could not be continuously improved, but Finetune achieved better results through Transformer unsupervised training + a small amount of labeled data.

2. The paper tries to increase the middle layer of Transformer, and in the increase in the number of layers from 2 to 12, each additional layer can improve the accuracy by 9% on average. Coupled with the parallel capabilities of the Transformer itself, this undoubtedly has great potential on the GPU.

3. The paper finds that adding language modeling as an auxiliary learning objective in the second step of Finetune can improve the generalization ability of the supervised model and accelerate convergence. Illustrate that with larger data sets, the model will benefit more from auxiliary learning objectives.

Technical principles of large language models

Generative pre-training has the initial potential GPT-1

Although the abstract highlights the advantages of the model for specific tasks in the absence of labeled data, the above three findings have a significant impact on the subsequent technical route of OpenAI. However, GPT-1 still has problems such as information forgetting and duplication when generating long texts, and there are still many shortcomings compared with models in specific fields.

1.3 Generalization ability breakthrough: GPT-2

In 2019, OpenAI published its latest development, a paper called "Language Models are Unsupervised Multitask Learners." The focus is on larger models, wider data sets, and better generalization capabilities. GPT-1 is a 12-layer transformer, BERT is the deepest 24-layer transformer, GPT-2 is a 48-layer transformer with a total of 1.5 billion parameters, and the training set is called WebText, which extracts text deduplication from 45 million links to obtain 8 million documents with a total of 40GB of text.

The paper believes that the single domain dataset trained by the existing system with a single task is the main reason for the lack of model generalization ability, so on a wider dataset, GPT-2 adopts a multitask approach, each task must ensure that its loss function can converge, and different tasks share the main transformer parameters.

The final trained model does not require any parameters and model changes, in the zero-shot task, 7 out of 8 data sets perform as the best in the industry, this generalization ability can be said to be very powerful, and the machine translation scene has achieved outstanding results, GPT is also after the 2.0 came out, began to attract attention.

1.4 Larger parameter larger dataset: GPT3

To perform better in specific areas, the previous model still needs thousands of annotated sample data for finetune, which greatly affects the versatility of the model, and humans can know the context (in-context) according to the previous sentence, so as to answer the question correctly. GPT3 tested the in-context learning ability by increasing the parameters (175 billion), and obtained the following data without finetune. While the parameters are increasing, it is divided into three scenarios to see the accuracy of the answer: Zero-shot (0 samples), One-shot (only one standard sample), and Few-shot (a small number of standard samples, about 1000). As the number of parameters increases, the improvement effect of Few-shot compared with Zero-shot is expanding, indicating that the larger the parameter has a stronger generalization ability for the sample.

Technical principles of large language models

Three scenarios

Technical principles of large language models

The effect of model parameters and sample sets on correctness

The paper did the verification of different parameters, n(params) is the parameter combing, n(layers) is the number of model layers, d(model) is 1/4 of the FFN layer, d(head) is the dimension of the multi-attention head, and the number of context tokens used in all tests is 2048.

Technical principles of large language models

Verify the results

GPT-3 improves on GPT-2's pursuit of unsupervised and zero-shot learning, and instead pursues few-shot in unsupervised mode. GPT-3 adopts a 96-layer multi-headed transformer, the context window size is increased to 2048 tokens, and training on a larger dataset of 45TB of text data achieves excellent performance on multiple NLP datasets. GPT-3 works more on engineering problems, such as data pollution handling, reducing network interaction between nodes when GPUs are parallel, and load balancing.

The paper tested more than 24 scenarios, and GPT-3 achieved powerful performance on many NLP datasets, including translation, question answering, and cloze tasks, as well as some tasks that require real-time reasoning or domain adaptation, such as interpreting words, using new words in sentences, or performing 3-digit arithmetic. The paper also shows that GPT-3 can generate news articles that are difficult for human evaluators to distinguish in a few-shot setting.

1.5 Hot ChatGPT: GPT 3.5

In March 2022, OpenAI published another paper "Training language models to follow instructions with human feedback", which aligns language models with users' intentions for various tasks through human feedback and fine-tuning. And launched the InstructGPT model, InstructGPT is a round of enhanced optimization based on GPT-3, so it is also known as GPT-3.5. Although GPT3.5 still makes some simple mistakes, the paper work shows that fine-tuning using human feedback is a promising direction.

The paper provides a way to fine-tune human feedback to make language models better conform to user intent across a wide range of task applications. Starting with a set of human-written prompts and prompts submitted through the OpenAI API, the paper collects a dataset of labeled samples of the desired model behavior and fine-tunes GPT-3 using supervised learning. The paper then artificially ranks the model outputs, using reinforcement learning from human feedback (RLHF) to further fine-tune the supervised model. The parameter of the InstructGPT model is 1.3B, while the parameter of the GPT-3 model is 175B, which is about 130 times that of the InstructGPT model, but the output of the InstructGPT model is better than the output of the GPT-3 model.

The training process first hired 40 contractors to label the data, collect a sample set of artificial answers submitted to OpenAI's prompts, and some human-written prompts as a baseline for training supervised learning. Then, the output of OpenAI is compared on a larger set of prompts and gaps are manually flagged to train a Reward Model to predict the output of human preferences. Finally, PPO is used to maximize the effect of this reward model and fine-tune on the supervised model. The specific technical details of this part will be developed later. The paper argues that if the model has values, it reflects more the values of the labeler than the values of the broader population.

Recognizing the intent of a human mission is a very important capability. ChatGPT adopts the same structure model as InstructGPT, which is specially optimized for Chat, and is open to the public for testing and training in order to generate more effective labeling data. Reinforcement learning based on human feedback (RLHF) is the most important feature that distinguishes ChatGPT from other generative models, which helps the model minimize harmful, untrue and biased outputs and improve natural communication. At the same time, in order to better support multi-turn conversations, ChatGPT introduces a stack-based context management mechanism to help ChatGPT track and manage contextual information in multi-turn conversations, thereby generating coherent and natural replies in multi-round conversations.

1.6 Current Technical Limitations

  1. In the field of expertise, GPT cannot generate suitable responses in the absence of corpus training.
  2. The question of credibility, the lack of specific sources of answers.
  3. The problem of timeliness, the underlying training data of the large model is past data, and the cost of training again is very high.
  4. Stephen Wolfram created the computational knowledge search engine and computational language wolfram, which has the opportunity to turn natural language into computational symbols and then calculate to solve this problem.
  5. The training method of the model has a fatal problem, the trained model in the answer to the question, in the answer to choose an optimal answer, but the answer may still be wrong, the model is essentially black box, the internal logic has not been decomposed, can not guarantee that there is no harmful or harmful customer description. If you tune the trained model more carefully, you might refuse to answer (to avoid false positives for the prompt). Sometimes the model ends up not responding to a phrase, but with a slight tweaking to the question/phrase, it ends up answering correctly.

Second, the main technical details

Google's paper is relatively short, see Liu Yan recommended Jay Alammer's explanation of Transformer, here is also partially quoted, here I hope to use the words that everyone can understand, extract the main technical details to explain clearly.

From the perspective of mathematics or machine learning, language models are all modeling of the probability correlation distribution of word sequences, that is, using the already said statements (statements can be used as vectors in mathematics) as input conditions to predict the probability distribution of different statements or even language sets at the next moment. GPT generative pre-training models also automatically generate each word of the response based on corpus probabilities, and ChatGPT uses reinforcement learning from human feedback (RLHF) to intervene in reinforcement learning to achieve better results.

2.1 What is a Transformer?

This article focuses on the core structure and technical points of Transformer, skipping the training optimization part.

Codec-component structure

The Transformer is essentially an Encoder-Decoder architecture that includes encoding components and decoding components. For example, in a machine translation task, a sentence in one language is taken as input, and then a sentence in another language is translated as output. The encoding component and the decoding component can have many layers, for example, the paper when Google first proposed it used 6 layers, followed by GPT-1 is 12 layers, and then GPT-3 is 96 layers.

Technical principles of large language models

Encoder-Decoder architecture

Each encoder consists of two sublayers: the Self-Attention layer and the Position-wise Feed Forward Network (FFN), each with the same structure, but they use different weight parameters. The input from the encoder flows into the Self-Attention layer first. It allows the encoder to use the information of other words in the input sentence when encoding a specific word (it can be understood as: when we translate a word, we not only focus on the current word, but also contextually focus on the information of other words).

The decoder also has these two layers in the encoder, but there is also a codec attention layer (i.e. Encoder-Decoder Attention) between them, which is used to help the decoder focus on the relevant parts of the input sentence that need attention.

Technical principles of large language models

Encoder-Decoder Attention

  • The processing of text by the encoder

For text processing, as with normal NLP tasks, Embedding algorithms are first used to convert each word into a word vector. In the abstract of the Transformer paper, the dimension of the word embedding vector is 512, and all encoders receive a list of vectors of size 512. Embedding occurs only in the lowest encoder, and the other encoders receive the output of the previous encoder. This list size is the parameter we can set — basically this parameter is the length of the longest sentence in the training dataset. After embedding the input sequence, each word flows through two layers within the encoder and then up the encoder one by one.

Technical principles of large language models

The processing of text by the encoder

Self-Attention principle

It was previously said that Transformer's self-attention mechanism breaks through the limitation of text attention distance, so it is very critical. Let's look at this sentence first:

The animal didn't cross the street because it was too tired           

What does the "it" in this sentence mean, is it an animal, or a street or something else? This one is easy for people, but not simple for models. Self-attention is used to solve this problem and point it at the animal. After weighting, you can get a weighted situation similar to Figure 8, and The animal gets the most attention.

Technical principles of large language models

Self-Attention principle

In self-attention, each word has 3 different vectors, which are the Query vector (Q), the Key vector (K), and the Value vector (V), all with a length of 64. They are obtained by multiplying the embedding vector X by three different weight matrices W^Q, W^K, W^V by 3 different weight matrices, where the dimensions of the three matrices are also the same. Both are 512×64.

The concept of Query, Key, Value is taken from an information retrieval system, for example, a simple search. When you search for a certain product (a thin red down jacket worn by young ladies in winter) on an e-commerce platform, the content you enter on the search engine is Query, and then the search engine matches the key for you (such as the type, color, description, etc. of the product) according to the query, and then gets the matching content (Value) according to the similarity between Query and Key.

Q, K, V in self-attention also play a similar role, in matrix calculation, the dot product is one of the methods to calculate the similarity of the two matrices, so QK^T is used in Equation 1 for the similarity calculation. Then there is the matching of the output according to the similarity, where the weighted matching method is used, and the weight is the similarity between query and key.

Pay more attention to the header mechanism

Multi-headed attention enhances self-attention ability, one is to expand the position of attention to make it pay attention to multiple different positions at the same time, and the other is that it provides multiple "representation subspaces" for the attention layer, such as the paper uses 8 attention heads, then there are 8 different sets of Q/K/V matrices, and each input word vector is projected into 8 representation subspaces for calculation.

The specific process is as follows, after the word vector of "Thinking Machines" passes through the encoder at the bottom layer, 8 self-attention calculations are performed using different weight matrices, and 8 different Z matrices (0-7) can be obtained. Then the 8 Z matrices are spliced together and multiplied by the weight matrix W0 to get the final matrix Z, which contains the information of all the attention heads. This matrix is input to the FFN layer.

Technical principles of large language models

The matrix is input to the FFN layer

Now look again at the previous example, under the multi-attention head mechanism, what are the words that "it" focuses on, the top 8 colors represent 8 attention heads, you can see that there is one attention head that pays the most attention to "the animal", and the other attention head focuses on "tired", in a sense, the model's representation of the word "it" is integrated into the representation of "animal" and "tired".

Technical principles of large language models

Pay more attention to the header mechanism

  • Therefore, the multi-attention head is essentially using more angles to calculate attention and then unify, which can enhance the complete understanding of the context of the sentence.
  • In the decoder, the Transformer block has an additional encoder-cecoder attention than the encoder. In encoder-decoder attention, Q comes from the previous output of the decoder, and K and V come from the output of the encoder. These vectors will be used at the Encoder-Decoder Attention layer of each decoder to help the decoder focus on the appropriate location of the input sequence. The following figure shows that in the process of translating I am a student, each round of the decoder generates a word, as shown in the figure when generating "a", "a" will be added as the input Q of the next round, and then the decoder combines the input and the K and V of the encoder to generate "student".

2.2 How does ChatGPT improve training results?

Behind ChatGPT is a new training paradigm in the field of Large Language Model (LLM) generation: RLHF (Reinforcement Learning from Human Feedback), which optimizes language models based on reinforcement learning from human feedback. Regarding RLHF training, there is a TAMER framework (Training an Agent Manually via Evaluative Reinforcement) that is worth considering.

  • RLHF is a complex concept involving multiple models and different training stages, and here we break it down in three steps:
  • Pre-train a language model (LM);
  • Aggregate question answering data and train a Reward Model (RM);
  • Fine-tune LM in a reinforcement learning (RL) manner.

The large language model trained by GPT3 calculates the next largest possible word based on the probability distribution, and he does not care about the logical accuracy of the facts, and there is no so-called consciousness, so sometimes he writes nonsense seriously. RLHF uses human feedback from generated text as a performance measure, or further uses that feedback as a reward to optimize the model so that language models trained on a general text data corpus can align with complex human values. The specific steps are as follows:

First, we train a language model using classical pre-training targets. For this step of the model, OpenAI used a smaller version of GPT-3 in its first popular RLHF model, InstructGPT. Then proceed with the following steps:

  1. Train a supervised policy language model

GPT-3 itself cannot recognize the different intentions implied by human instructions, and it is difficult to judge whether the generated content is of high quality. In order to solve this problem, the training process randomly selects questions from the dataset, and the annotators give high-quality answers, which is equivalent to providing a series of manually written prompts and corresponding answer datasets. Then use these manually labeled datasets to fine-tune the GPT3.5 model to obtain the SFT model (Supervised Fine-Tune).

  1. Train a reward model

Training method: According to the model of the first stage, the questions are randomly selected, multiple different answers are given, and the optimal answers are manually selected for labeling, which is somewhat similar to teaching guidance. The reward value of a high-quality answer is fed into the next round of reinforcement learning RL, and a reward model is trained to predict the output of human preference.

The training of RM is the beginning of RLHF differentiation from the old paradigm. The model receives a series of text and returns a scalar reward that numerically corresponds to the person's preference. We can model with LM in an end-to-end way, or with a modular system (such as ranking outputs and converting them into rewards). This reward value will be critical for seamless integration into existing reinforcement learning RL algorithms.

In terms of model selection, RM can be another LM that has been fine-tuned, or an LM that is trained from scratch based on preference data. For example, Anthropic proposes a special pre-training method, that is, replacing the general pre-training fine-tuning process with preference model pretraining (PMP). Fine-tuning LM is considered to have higher utilization of sample data, but the jury is still out on which RM is better.

  1. Proximal Policy Optimization (PPO)

Strategies to optimize reward models using PPO. Use the output of the reward model as a scalar reward, and use the PPO algorithm to fine-tune the supervision policy to optimize the reward.

Training method: The core purpose of PPO is to convert online artificial learning into offline learning, and the machine scores itself. The reward model trained in the second stage is used to randomly select questions in the dataset, the PPO model is used to generate multiple answers, and the quality scores are given separately by the RM model trained in the previous stage. The return scores are passed in order to generate a policy gradient, and the PPO model parameters are updated by reinforcement learning.

Finally, steps 2 and 3 can be iterated in cycles, and the model can be continuously improved.

The PPO algorithm addends:

For a long time, for engineering and algorithmic reasons, it was considered impossible to train LMs with reinforcement learning. At present, many organizations have found a feasible solution to fine-tune some or all of the parameters of the initial LM using the Policy Gradient RL algorithm and Proximal Policy Optimization (PPO). PPO algorithms have been around for a relatively long time, and there are plenty of guidelines on their principles, making them an advantageous choice in RLHF.

We formulate fine-tuning tasks as RL problems. First, the policy is an LM that accepts the prompt and returns a series of texts (or probability distributions of text). The action space of this strategy is all the tokens corresponding to the vocabulary of LM (generally on the order of 50k), and the observation space is a sequence of possible input tokens (vocabulary ^ the number of input tokens, which is relatively large). The reward function is a combination of a preference model and a policy shift constraint.

The reward function determined by the PPO algorithm is calculated as follows: input the prompt x to the initial LM and the current fine-tuned LM to get the output texts y1 and y2 respectively, and pass the text from the current strategy to RM to get a scalar reward rθ. The generated text of the two models is compared, the penalty item for the difference is calculated, and the RL policy that deviates significantly from the initial model is generated in each training batch to ensure that the model outputs reasonable and coherent text.

Technical principles of large language models

The reward function determined by the PPO algorithm

In general, ChatGPT trains the SFT supervised strategy model in the manually labeled prompts and answers, and then gives multiple answers by the model through random questions, and then manually sorts them to generate a reward model, and then enhances the reward effect through PPO reinforcement training. As a result, ChatGPT can better understand the intent of the instruction and follow the command to complete the output that is in line with the trainer's values.

Finally, as a verified and feasible direction, the "large" of the big language model is reflected in the wide data set, large number of parameters and layers, large amount of computation, and its value is reflected in versatility and has a wide range of application scenarios. The development of large language models is mainly due to the good parallel scalability of the model, and with the increase of data volume and computation, the main challenges are engineering and tuning. In addition to GPT, LLama, PaLM, etc., there are many corresponding studies in China, because many basic technologies have existed before, and the recent domestic catch-up speed is also very fast, and we expect to reach the GPT 3.5 level in about half a year. NineData is also very optimistic about this direction, and has applied the large language model to the SQL development of the NineData platform, supporting direct search and change of data through natural language, providing database problems and knowledge Q&A, database SQL optimization suggestions and other capabilities, and we will launch more valuable functions in the future, welcome to log in and use. https://www.ninedata.cloud

About the author:

Chen Changcheng (Tianyu), Vice President of Jiuzhang Arithmetic Technology and former senior technical expert of Alibaba Cloud, has been deeply engaged in the database field for 15 years, leading the evolution of Alibaba database infrastructure (IOE to distributed, remote multi-active, containerized storage and device separation) and the construction of cloud native database tool system.

Technical principles of large language models

Chen Changcheng (Tianyu), Vice President of Arithmetic Technology of Jiuzhang, former senior technical expert of Alibaba Cloud

Bibliography:

Google Brain: “Attention is all you need”

OpenAI: “Improving Language Understanding by Generative Pre-training”

OpenAI: “Language Models are Unsupervised Multitask Learners”

OpenAI: “Language Models are Few-Shot Learner”

OpenAI: “Training language models to follow instructions with human feedback”

Luke Cheng:https://github.com/huggingface/blog/blob/main/zh/rlhf.md

Jay Alammar: http://jalammar.github.io/illustrated-transformer/

Read on