I used my free time to compile 10 important concepts of big language models, hoping to use them to understand big language models more deeply.

Big language models are truly an exciting field and will experience rapid innovation. This article can help you understand how it works.

At the same time, we should also note that big language models learn language in a very different way than humans – they lack the social and perceptual context that human language learners use to infer the relationship between discourse and the speaker's mental state. They are also trained differently from human thought processes. But these may also be opportunities to improve big language models or invent new models of learning algorithms.

Generative artificial intelligence (GenAI), specifically ChatGPT, grabbed everyone's attention. Transformer-based Large Language Models (LLMs), trained on large-scale unlabeled data, demonstrate the ability to generalize to many different tasks. To understand why big language models are so powerful, we'll dive into how they work in this post.

10 important concepts explain in detail the working principle of LLM large language model

Formally, the language model of only the decoder is only given the context p(xi|x1··· xi−1). Such a formula is an example of the Markov process and has been studied in many use cases. This simple setup also allows us to generate tokens token by token in a self-regressive manner. xix1 · · · xi−1

Before we dive in, I must point out the limitations of this formula in enabling artificial general intelligence (AGI). Thinking is a non-linear process, but our communication tool, the mouth, can only speak linearly. As a result, language presents a linear sequence of words. This is a logical start to model the language using Markov processes. But I suspect that this formulation can fully capture the thought process (or AGI). On the other hand, thinking and language are interrelated. A sufficiently powerful language model can still exhibit some kind of thinking ability, as GPT4 demonstrates. Below, we'll take a look at the scientific innovations that make LLM look smart.

Transform

There are many ways to model/represent the conditional distribution p(xi|x1··· xi−1)。 In large language models, we try to estimate this conditional distribution using a neural network architecture called a transformer.

In fact, neural networks, especially recurrent neural networks (RNNs) of all kinds, have been used for language modeling for a long time before Transformer. The RNN processes tokens sequentially, maintaining a state vector that contains a representation of the data seen before the current token. To handle the nth-th token, the model combines the state n-1 representing the sentence before the token with the information of the new token to create a new state that represents the sentence n before the token. Theoretically, if state continues to encode contextual information about the token at each point, information from one token can propagate arbitrarily along the sequence. Unfortunately, the vanishing gradient problem puts the state of the model at the end of a long sentence without precise, extractable information about the preceding markers. The reliance of token computation on the results of previous token computations also makes parallel computation on modern GPU hardware difficult.

These problems are solved by the self-attention mechanism in Transformer (Attention Is All You Need). Transformer is a model architecture that avoids duplication and instead relies entirely on attention mechanisms to draw global dependencies between inputs and outputs. The attention layer has access to all previous states and weighs them against learned correlation measures, providing relevant information about the remote token. Importantly, Transformers use an attention mechanism without RNNs, simultaneously processing all tags and calculating attention weights between them in successive layers. Since the attention mechanism only uses information from other labels from lower layers, all labels can be computed in parallel, increasing training speed.

The input text is parsed into tokens by the byte-pair tagger, and each token is converted into an embedding vector. Then, add the location information of the token to the embedding. Transformer building blocks are scaled dotted attention units. When a sentence is passed to the Transformer model, the attention weight between each mark is calculated simultaneously. The attention unit generates an embed for each token in the context that contains information about the token itself and a weighted combination of other related tokens, each weighted by its attention weight.

For each attention unit, the Transformer model learns three weight matrices; Query weight WQ, key weight WK, value weight WV. For each token i, the input word embedding is multiplied by each of the three weight matrices to generate the query vector qi, the key vector ki, and the value vector vi. The attention weights are the product of points between qi and kj, scaled by the square root of the key vector dimension and normalized by softmax. The output of the attention unit of token i is the weighted sum of the value vectors of all tokens, weighted by the attention of token i to each token j. The attention calculation for all tokens can be represented as one large matrix calculation:

A set of (WQ, WK, WV) matrices are called attention heads, and each layer of the Transformer has multiple attention heads. With multiple attention heads, the model can calculate different correlations between marks. The computation of each attention head can be performed in parallel, with the output connected by a matrix WO and projected back into the same input dimension.

In the encoder, there is a fully connected multilayer perceptron (MLP) after the attention mechanism. The MLP block further processes each output encoding separately. In an encoder-decoder setting (for example, for translation), an additional attention mechanism is inserted into the decoder between self-attention and MLP to extract relevant information from the encoder-generated encoding. In the architecture of the decoder, this is not necessary. Regardless of encoder-decoder or decoder-only architectures, the decoder cannot predict the output using current or future outputs, so the output sequence must be partially masked to prevent this reverse traffic to allow autoregressive text generation. To generate tokens one by one, the last decoder is followed by a softmax layer to generate the output probabilities of the vocabulary.

Supervisory fine-tuning

The decoder GPT (Improving Language Understanding by Generative Pre-Training) is essentially an unsupervised (or self-supervised) pre-training algorithm that maximizes the following possibilities:

where k is the size of the context window. While the architecture is task-agnostic, GPT shows that great advances can be made in natural language reasoning, question answering, semantic similarity, and text classification by generative pre-training of language models on various unlabeled text corpora, followed by discriminative training. Fine-tune each specific task.

After pre-training the model according to the above objectives, we can adjust the parameters according to the supervised target task. Given a labeled dataset C, where each instance is labeled x1... xm, and the label y composition. The input is passed through the pretrained model to obtain the activated hlm of the final transformer block, and then fed into an added linear output layer with the parameter Wy to predict y:

Accordingly, we have the following objective function:

In addition, language modeling is helpful as a secondary goal, as it improves the generalization of supervised models and accelerates convergence. That is, we optimize the following goals:

As mentioned above, text classification can be fine-tuned directly. Other tasks, such as question answering or text entailment, have structured input, such as ordered sentence pairs or triples of documents, questions, and answers. Because the pretrained model is trained on a continuous sequence of texts, some modifications are required to apply it to these tasks.

Text entailment: Concatenates the sequence of premise p and hypothetical h tokens, marked with a delimiter ($) in between.

Similarity: There is no inherent order in the two sentences being compared. Thus, the input sequence contains two possible sentence sequences (with separators in between) and processes each sentence independently to produce two sequence representations that are added by elements before the input linear output layer.

Q&A and common sense reasoning: Each sample has a context document z, a question q, and a set of possible answers {ak}. GPT concatenates the document context and question with each possible answer, adding a separator in the middle to obtain [z; q; $; ak]。 Each sequence is processed independently and then normalized by a softmax layer to produce an output distribution of possible answers.

Zero-shot transfer (aka meta-learning)

While GPT shows that supervised tuning works well on task-specific datasets, achieving powerful performance on the desired task often requires fine-tuning of datasets of thousands to hundreds of thousands of examples specific to that task. Interestingly, GPT2 (Language Models are Unsupervised Multitask Learners) proves that language models begin learning multiple tasks without any explicit supervision, conditioned by documents and questions (aka prompts).

Learning to perform a single task can be expressed in a probabilistic framework as estimating conditional distributions p(output|input). Since a generic system should be able to perform many different tasks, even for the same input, it should condition not only the input, but also the task to be performed. That is, it should model p(output|input, task). Previously, task tuning was typically implemented at the architecture level or at the algorithm level. But language provides a flexible way to specify tasks, inputs, and outputs all as sequence of symbols. For example, a translation training example can be written as a sequence (translate to French, English text, French text). In particular, GPT2 conditioned on the context of the format example pair english sentence=french sentence, and then after the final hint english sentence=we sampled from the model by greedy decoding and used the first generated sentence as the translation.

Similarly, to induce summarization behavior, GPT2 adds text after the article and generates 100 markers by top-k random sampling of k = 2, which reduces repetition and encourages more abstract abstracts than greedy decoding. Similarly, a reading comprehension training example can be written as (answer the question, document, question, answer).

Note that zero-shot transfer is different from zero-shot learning in the next section. In zero-shot migration, "zero-sample" means not performing gradient updates, but it usually involves providing an inference time demo for the model (such as the translation example above), so it's not really learning from the zero example.

I find an interesting connection between this meta-learning approach and Montagu semantics, a theory of natural language semantics and their relationship to grammar. In 1970, Montague made his point:

In my opinion, there are no important theoretical differences between natural language and the artificial language of logicians; In fact, I think it is possible to understand the grammar and semantics of both languages with a natural and mathematically accurate theory.

Philosophically, both zero-sample transfer and Montague semantics treat natural languages as programming languages. Large language models capture tasks through embedding vectors in black-box methods. But we don't know exactly how it works. In contrast, the most important feature of Montague's semantics is its adherence to the principle of combinatoricity, i.e. the meaning of the whole is a function of the meaning of its parts and its syntactic combinatorial patterns. This could be one way to improve large language models.

Contextual learning

GPT3 (Language Models are Few-Shot Learner) shows that extending language models can greatly improve the performance of small samples that are not task-independent. GPT3 further specializes the description as "zero-sample", "single-sample", or "few-sample", depending on the number of demonstrations provided at the time of inference:

"Low-sample learning" or contextual learning, we allow as many demo context windows (typically 10 to 100) to fit the model;
"One-time learning", where we only allow one presentation;
"Zero" learning, where no demonstration is allowed and only natural language instructions are provided to the model.

For few-shot learning, GPT3 evaluates each example in the evaluation set by randomly sampling K examples from the training set of the task as a condition, separated by 1 or 2 newline characters depending on the task. K can be any value from 0 to the maximum allowed amount in the model context window, which is nctx = 2048 for all models, which typically fits between 10 and 100 examples. K values are usually, but not always, the larger the better.

For some tasks, GPT3 uses natural language cues in addition to (or when K = 0) presentation. For tasks that involve choosing one of multiple options (multiple choice) to complete correctly, the prompt includes K contextual examples plus correct completion, followed by a context-only example, and the evaluation process compares the likelihood of each completed model.

For tasks that involve binary classification, GPT3 gives the options more semantic names (such as "True" or "False" instead of 0 or 1), and then treats the task as multiple choice.

For free-form tasks, GPT3 uses beam search. The evaluation process scores the model using an F1 similarity score, BLEU, or exact match, depending on the criteria of the dataset at hand.

Model size is important

The capacity of a language model is critical to the success of task-independent learning, and increasing it can improve performance across tasks in a log-linear fashion. GPT-2 was created as a direct extension of GPT-1 with a 10x increase in the number of parameters and dataset size. But it can perform downstream tasks in a zero-sample transfer setup - without any parameter or schema modifications.

GPT3 uses the same model and architecture as GPT2, except that alternating dense and locally banded sparse attention patterns are used in the Transformer layer.

On TriviaQA, the performance of GPT3 steadily increases as the model size increases, indicating that the language model continues to absorb knowledge as the capacity increases. Compared to zero-sample behavior, single-sample and low-sample performance is significantly improved.

Data quality is important

Although there is less discussion, data quality is also important. The dataset of language models is expanding rapidly. For example, the CommonCrawl dataset contains nearly a trillion words, which is enough to train the largest model without having to update the same sequence twice. However, we found that unfiltered or lightly filtered versions of CommonCrawl tend to be of lower quality than curated datasets.

As a result, GPT2 has created a new web scraping that emphasizes document quality by scraping all outbound links from Reddit that receive at least 3 karma, which can serve as a heuristic indicator of whether other users find the link interesting, educational, or just interesting. The final dataset contains slightly more than 8 million documents, and after deduplication and some heuristic cleansing, the total text reaches 40 GB.

In addition, GPT3 takes 3 steps to improve the average quality of the dataset: (1) filtering CommonCrawl based on similarity to a range of high-quality reference corpora, (2) document-level fuzzy deduplication within and between datasets to prevent redundancy and preserve the integrity of the validation set as an accurate measure of overfitting; (3) Add known, high-quality reference corpus to the training mix to enhance CommonCrawl and increase its diversity.

Similarly, GLaM (GLaM: Efficient Scaling of Language Models with Mixture-of-Expert) has developed a text quality classifier that generates a high-quality web corpus from a large original corpus of origin. The classifier is trained to classify collections of curated texts (Wikipedia, books, and some selected websites) and other web pages. GLaM uses this classifier to estimate the content quality of the web page, and then uses the Pareto distribution to sample the page based on its score. This allows the inclusion of some lower quality web pages to prevent systematic bias in the classifier.

GLaM also sets mixed weights based on the performance of each data component in a smaller model and prevents small sources such as Wikipedia from being oversampled.

Chain of thought

As noted earlier, the prediction of the next token differs from the thought process. Interestingly, some of the reasoning and arithmetic abilities of large language models can be unlocked by prompting the Chain-of-Thought Prompting Elicits Reasoning in Large Language Model. A thought chain is a series of intermediate natural language reasoning steps that lead to the final output. If a demonstration of thought chain reasoning is provided in an example of a small sample prompt, then a large enough language model can generate thought chains: ⟨ inputs, thought chains, outputs, ⟩. But we don't know why and how it works.

Reinforcement Learning Based on Human Feedback (RLHF)

LLM uses a different goal of language modeling (predicting the next mark) than the goal of "effectively and safely following the user's instructions." Therefore, we say that language modeling goals are misaligned.

InstructGPT (Training language models to follow instructions with human feedback) aligns language models with user intent on various tasks by using reinforcement learning (RLHF) from human feedback. The technique uses human preference as a reward signal to fine-tune the model.

Step 1: Collect demonstration data and train the supervision strategy. The labeler provides a demonstration of the desired behavior on the input cue distribution. The pre-trained GPT3 model is then fine-tuned on this data using supervised learning.

Step 2: Collect comparison data and train the reward model. Collect a dataset of comparisons between model outputs, where the tagger indicates which output they prefer for a given input. The reward model is then trained to predict the output of human preference.

Step 3: Use PPO to optimize the policy for the reward model. Use the output of RM as a scalar reward. Use the PPO (Proximal Policy Optimization Algorithm) algorithm to fine-tune the supervision strategy to optimize this reward.

Steps 2 and 3 can be iterated continuously; Collect more comparative data of the current best strategy to train a new RM and then train a new policy.

Instruction fine-tuning

While the supervised fine-tuning introduced in GPT-1 focuses on task-specific tuning, T5 trains with a maximum likelihood goal (using "teacher enforcement") regardless of the task. Essentially, T5 leverages the same intuition as zero-sample transfer, that NLP tasks can be described by natural language instructions, such as "Is the mood of this movie review positive or negative?" Or "Translate "how are you" into Chinese. To specify which task the model should perform, T5 adds a task-specific (text) prefix before providing the original input sequence to the model. IN ADDITION, FRAN'S FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS EXPLORES INSTRUCTION FINE-TUNING, WITH A PARTICULAR FOCUS ON SCALING THE NUMBER OF TASKS, SCALING MODEL SIZES, AND FINE-TUNING OF THOUGHTCHAIN DATA.

For each dataset, Fran hand-wrote ten unique templates that used natural language instructions to describe the tasks of that dataset. While most of the ten templates describe the original task, for added variety, FLAN also includes templates for up to three "twist tasks" for each dataset (for example, for sentiment classification, we include templates that require movie reviews to be generated). We then instruct tuning the pre-trained language model on a mix of all datasets, and the examples in each dataset are formatted by a randomly selected instruction template for that dataset.

So-called just-in-time engineering is essentially a kind of reverse engineering, that is, how to prepare training data for instruction fine-tuning and context learning.

Retrieving Enhanced Build (RAG)

Due to cost and time, large-language models in production use tend to lag behind in terms of freshness of training data. To solve this problem, we can use large language models by retrieving augmented generation (RAG). In this use case, we don't want a large language model to generate text based only on its training data, but rather we want it to somehow incorporate other external data. With RAG, large language models can also answer (private) domain-specific questions. For this reason, RAG is also known as "open-book" Q&A. LLM + RAG can be an alternative to classic search engines. In other words, it acts as information retrieval with hallucinations.

Currently, the retrieval part of the RAG is typically implemented as a k-nearest neighbor (similarity) search of a vector database that contains vector embeddings of external text data. For example, DPR (Dense Passage Retrieval for Open-Domain Question Answering) formulates encoder training as a metric learning problem. However, we should note that information retrieval is often based on correlation, which is not the same as similarity. I expect more improvements in this area in the future.

10 important concepts explain in detail the working principle of LLM large language model