laitimes

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

author:Heart of the Machine Pro

Machine Heart report

Heart of the Machine Editorial Office

At the recent Microsoft Developer Conference Microsoft Build 2023, OpenAI co-founder Andrej Karpathy gave a speech titled "State of GPT", in which he first visually introduced the stages of GPT's training process, then showed how to use GPT to complete tasks and gave intuitive examples, and finally gave some very practical use suggestions. Machine Heart has compiled the talk in detail for the reader's benefit.
The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Video address: https://youtu.be/bZQun8Y4L2A

How to train GPT?

First, let's take a general look at the training process of GPT large models. Remember, this is new territory, and change is fast. The process is like this now, and it may not be the same when new technologies emerge in the future.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

It can be seen that the training process of GPT can be roughly divided into four stages: pre-training, supervised fine-tuning, reward modeling, and reinforcement learning.

These four phases proceed sequentially. Each stage has its own data set, and each stage has its own algorithm for training the neural network. The third line is the resulting model. Finally, there is some remarks at the bottom.

Of all the phases, the pre-training phase requires the most computation, and it can be said that 99% of the training computation time and floating-point arithmetic are concentrated in this phase. Because this phase requires processing very large Internet datasets, it can take months for a supercomputer composed of thousands of GPUs to work. The other three phases are considered fine tuning, requiring much less GPU and training time.

Below we will explain the entire training process of GPT in stages.

Pre-training phase

The goal of the pre-training phase is to get a base model.

First step one: data collection. This stage requires a huge amount of data, and here is an example of this data mixing method from Meta's LLaMA model:

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

It can be seen that the pre-training data of LLaMA mixes multiple different types of datasets at different scales, the largest of which is CommonCrawl crawled from the Internet and C4 built on CommonCrawl, in addition to GitHub, Wikipedia and other datasets.

After collecting this data, it also needs to be preprocessed, which is also known as "tokenization". Simply put, this is a translation process that translates the original text into some sequence of integers that is the local representation of what GPT actually works with.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

This process of transpiling from text to tokens and integers is lossless, and there are several algorithms that perform this process. For example, as shown in the figure above, we can use a technique called byte pair encoding, which works by iteratively combining short text blocks and grouping them into tokens. The last thing that is actually entered into the transformer is those integer sequences.

Let's look at two example models, GPT-3 and LLaMA, some of the main hyperparameters to consider during the pre-training phase. Karpathy said that since they have not yet released relevant information about GPT-3, they used GPT-3 data in their presentations.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

As you can see, the size of the vocabulary is usually 10,000 orders of magnitude; Context lengths are usually around 2,000 or 4,000, but now there are as many as 100,000. The context length determines the maximum number of integers that GPT looks at when predicting the next integer of the sequence.

For the number of parameters, you can see that GPT-3 has 175 billion and LLaMA has 65 billion, but in fact LLaMA performs much better than GPT-3. Why? Because LLaMA trains tokens much longer, reaching 1.4 trillion, while GPT-3 only has about 300 billion. Therefore, when evaluating a model, it is not enough to look at the number of parameters.

The table in the middle of the figure above shows some hyperparameters that need to be set in the Transformer neural network, such as the number of heads, dimension size, learning rate, number of layers, and so on.

Below are some training hyperparameters; For example, in order to train a LLaMA model with 65 billion parameters, Meta trained for about 21 days using 2,000 GPUs, with a capital cost of about $5 million. This probably reflects the orders of magnitude of the costs of each pre-training phase.

Let's see what happens during the actual pre-training process. Roughly speaking, tokens are first batched into data batches. These allocation data form an array that is then entered into the Transformer. The size of these arrays is B×T; where B is the batch size, which is the number of rows of stacked independent samples; T is the maximum context length. An example is given in the following figure.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

In the example in the figure, the context length T is only 10, but the actual model can have a T of 2000 or 4000 or more. That is, a row of data in a real model can be very long, such as an entire document. We can pack many documents into lines and separate them by ending the token <|endoftext|> with these special texts. In simple terms, these tokens are where to tell the Transformer where new documentation begins. For example, the 4 lines of the document in the figure are converted into an array of 4×10 at the bottom.

Now, you need to enter these numbers into the Transformer. Here we only look at one of the cells (green), when in fact each cell goes through the same process.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

This green cell will see all the tokens that precede it, i.e. the tokens of all yellow cells. We want to feed all of the above here into the Transformer neural network, and the Transformer needs to predict the next token of the sequence, which is the red token in the graph.

To make accurate predictions, neural networks need to adjust their tens of billions of parameters. After each adjustment, the neural network's prediction distribution for each cell token will be different. For example, if the size of the vocabulary is 50257 tokens, then we need the same number of numbers to get the probability distribution of the next token, which predicts the possible values of the next token and the corresponding probability.

In the example in the diagram, the next cell should be 513, so it can be used as a supervisory source to update the weight of the Transformer. We can take the same action on each cell in parallel. We are constantly changing batches of data in an effort to give the Transformer the ability to correctly predict the next token of the sequence.

Let's look at a more specific example. This is a small GPT that The New York Times trained on Shakespeare's works. A short passage from Shakespeare's work and the training of GPT on it are given here.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

First, when GPT is initialized, the weights are completely random, so its output is also completely random. As time goes on, the training time becomes longer and longer, GPT continues to iterate, and the result samples given by the model become more and more coherent and smooth. Finally, you can see that the Transformer learned something about words and knows where to put spaces.

In the actual pre-training process, some quantitative indicators are used to determine the performance change in the model iteration. In general, researcher monitoring is a loss function. A low loss indicates that the Transformer is more likely to give a correct prediction, that is, the probability that the next integer in the sequence is the correct value is higher.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Pre-training is actually a language modeling process, which can take up to a month to train. After that, GPT learned a very powerful general-purpose language representation. We can then efficiently fine-tune it for specific downstream tasks.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

For example, if the downstream task is sentiment classification. In the past, your approach might have been to collect a large sample labeled with "positive" or "negative" emotions and train an NLP model. But now the new method doesn't need to do sentiment classification beforehand, you just need to take a large pre-trained language model, and then only need a small number of sample samples, and you can fine-tune the model for your specific task very efficiently.

This is very useful for practical applications. So why do pre-trained large language models (LLMs) only need simple fine-tuning to use? This is because the process of language modeling itself already covers a large number of tasks - in order to predict the next token, the model must understand the structure of the text and the different concepts contained in it.

This is GPT-1.

Now look at GPT-2. It has been noted that GPT-2 can be very effective at getting these models to perform prompts even without fine-tuning. The training goal of these language models is to complete the document, so the user can actually induce the model to perform a specific task simply by orchestrating the appropriate fake documents. An example is given below.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

It gives an article that the user wants to complete to do the relevant Q&A. Therefore, just add a few answers with answers to the end of the article (this is called a few-shot prompt) and then ask questions, then since the goal of the Transformer is to complete this document, it is equivalent to answering the question. This example uses prompt to tweak the underlying model into belief that it is mimicking a document, only to complete the question answering task.

Karpathy argues that providing prompt as an alternative to fine-tuning heralds a new era for large language models. This makes the base model itself sufficient for many different types of tasks.

Therefore, the research frontier in related fields has shifted to the evolution of basic models. Major research institutions and enterprises are building their own basic big models. However, these models are not all publicly available, for example, OpenAI has not released GPT-4 base models. The GPT-4 model we call through the API is not actually a base model, but an assistant model.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

The GPT-3 base model is available through the DaVinci API, the GPT-2 base model is also public, and users can even find its parameter weight configuration on GitHub: https://github.com/openai/gpt-2. However, overall, the most open base model is Meta's LLaMA series model, but the series is not licensed for commercial use.

Now it's important to point out that the base model is not equal to the assistant model. The underlying models do not answer user questions, they only complete the documentation. So if you say to the base model, "Write a poem about bread and cheese," you probably won't get it — it will just see your request as a document and try to complete it.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

However, you can induce the underlying model to write poems with appropriate prompts, as shown on the right side of the figure above.

Of course, you can also induce the model to become an assistant. To do this, you need to create some specific, small-sample prompts that look like documents of the interaction process between humans and assistants exchanging information. As shown in the figure below, then you just need to attach your question at the end of the document, and the basic model can be transformed into a useful assistant to a certain extent, giving a certain answer. But the process is not very reliable, and the practice is not very good.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Therefore, in order to create a true GPT assistant, another method is needed, namely supervised fine tuning (SFT).

Supervised fine-tuning phase

During the supervised fine-tuning phase, a small but high-quality data set needs to be collected. OpenAI's approach is to manually collect data consisting of prompts and ideal responses. This data requires a lot, generally tens of thousands.

Then, continue to perform language modeling on this data. The algorithm remained the same, except for the training dataset: from a large number of low-quality Internet documents to a small amount of high-quality question-and-answer "prompt-response" data.

After this training process is completed, an SFT model is obtained. Deploy these models and get assistants that already do a certain level of work.

Let's look at an example. This is the data written by a human contractor, with a prompt in it, and then the human writes the ideal response.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

The ideal response naturally cannot be played arbitrarily, but needs to follow a number of rules (such as the figure on the right above), which have formatting requirements and ensure that the answers given are useful, credible, and harmless.

Next, reinforcement learning based on human feedback (RLHF) is required, which includes a reward modeling phase and a reinforcement learning phase.

Reward modeling phase

At this stage, data collection needs to be transformed into a comparative form. An example is given here. For the same prompt, the assistant is asked to write a program or function that checks whether a given string is a palindrome. Then use the already trained SFT model to generate multiple results, three of which are given here. Then let humans rank these results.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

It's not easy to do, as it can take hours for a human to complete a prompt. Now assuming that the ranking is complete, you need to perform a binary classification-like operation on all possible pairs of these results.

As shown in the figure below, the specific method is as follows: arrange the prompt in rows; The three lines of prompt here are the same, but the finished result is different, that is, the yellow token in the figure (from the SFT model). Then add a special reward to read out the token after it. In this way, only the supervision of the transformer in the green token position can make the transformer predict a certain reward, so as to judge whether the prompt completion result is excellent.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

This basically leaves the Transformer guessing the quality of each completed result. After guessing the quality of each different outcome, the developer can use the existing ground truth to force the quality score of some outcomes to be higher than others, so that the model's reward prediction is consistent with the basic truth value given manually. This process can be done with a loss function.

With the reward model, GPT is still not a useful assistant, but the reward model is useful for later reinforcement learning stages, because the reward model can evaluate the quality of any completed result of any given prompt.

Reinforcement learning phase

What the reinforcement learning stage does is based on the reward model, using reinforcement learning algorithms to score a large number of prompts.

Here take a prompt as an example, arrange the results of the SFT model completion (yellow) into rows, and then add a reward token (green) after it. These rewards come from the reward model and have been fixed.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Now use the same language modeling loss function, only now train on the yellow token and reweigh the language modeling goal against the reward indicated by the reward model.

For example, in the first line, the reward model considers this completion result to be scored quite high. Therefore, all tokens sampled by the model in the first row will be strengthened, that is, there will be a higher probability of being adopted in the future. In contrast, the reward model does not like the second completion result and gives a negative score, so the probability of all tokens of the row appearing in the future is reduced.

This goes through many prompts over and over again, and after many batches of data, you get a strategy that creates a yellow token. According to this strategy, all completed results can be given high scores by the reward model.

This is the training process for RLHF. The resulting model can be deployed into an application.

ChatGPT is an RLHF model, while other models may be SFT models, such as Claude.

So why does OpenAI use RLHF? The simple reason, Karpathy says, is that using RLHF makes the model perform better. According to some previous experiments done by OpenAI, it can be seen that RLHF models that use PPO (Near End Policy Optimization) algorithms are better overall. When the results are provided to humans, humans also basically prefer tokens from the RLHF model over the SFT model and the basic model of incarnating as an assistant through prompt.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

So why does RLHF make models better? At present, the AI research community has not found a theory that is widely recognized, but Karpathy still gave his own insights. He thinks this may have something to do with the asymmetry between the computational difficulty of comparison and generation.

To illustrate with an example: suppose we want a model to write a haiku about paper clips. If you're a contractor who is working to create training data, you're collecting data for an SFT model. So how do you write a good haiku about paper clips? And you may not be a good haiku poet. However, if you are given a few haiku, you will have the ability to discern which of them is better. That said, it's a much simpler task to determine which sample is better than creating a good sample. So this asymmetry may make comparison a better way to make better use of human judgment to create better models.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Now comes to the other aspect: RLHF doesn't always bring improvements to the base model. In some cases, RLHF models lose some entropy, meaning they output more monotonic and less variable results. The base model has higher entropy and can output more diverse results.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

For example, the following task may be more appropriate to use the base model, which is to generate something similar to the n examples you already have. The example task here is to generate more Pokémon names. First, the user provides 7 Pokémon names to the model, and then lets the base model complete the documentation. The base model generated a large number of Pokémon names. These names are all fictional, after all, Pokémon do not really exist. Karpathy believes that such tasks get better results using the base model because the base model has higher entropy and gives results that are similar to the previous examples, but are more diverse and cool.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Now, there are quite a few assistant models that users can use. A team at Berkeley is ranking many assistant models and giving them basic ELO scores. Of course, the best model at the moment is the GPT-4; Claude and GPT-3.5 followed. Some models publicly provide model weights, such as Vicuna, Koala, and so on. In this list, the top three are all RLHF models, and the other models are basically SFT models.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

The above is how to train the model. Let's turn around and see how we can apply the GPT Assistant model to real-world problems.

How to use GPT?

Here's a practical example of how best to use GPT. Let's say you're writing an article that ends with the phrase, "California's population is 53 times that of Alaska." But now you don't know the population data of these two states, and you need an intelligent assistant to help you.

How will humanity accomplish this task? Roughly speculation, human beings are likely to go through a series of thought processes, as shown in the figure below: first of all, in order to get results, you need to compare the number of people, then you need to query population data; Then use the query tool to look it up - Wikipedia found population data for California and Alaska; The next step is obviously to do a division operation, possibly a calculator; Then we get a multiplier of 53; then our brains might quickly test it with empirical sanity—53 times feels reasonable, given that California is the most populous state in the United States.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Once the information is there, it comes to creative writing. You might start by writing, "California has 53x times greater," then think about what doesn't feel like a good fit, delete it and think about which expression is more appropriate, and finally get the sentence you're happy with.

Simply put, in order to write such a sentence, you will experience a lot of monologue-style thinking inside. So what does GPT experience when generating such a sentence?

GPT deals with token sequences. Whether it is reading or generating, it is carried out step by step, each of which is for a token, and the calculation workload is the same. These transformers have a lot of layers, there are 80 inference layers, but then again, 80 is not very much. Transformer does its best to mimic writing through this, but its thought process is very different from that of humans.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

That is, unlike humans, GPT does not have any inner monologue, it will only look at each token and invest the same amount of computation on each token, and that's it. They're like token simulators — they don't know what they know or don't know, and just imitate writing the next token; They also do not reflect, and their hearts do not think about whether the result is reasonable; They are written incorrectly and will not be modified in reverse. They simply sample tokens in sequence.

But even so, Karpathy argues that GPTs have some form of cognitive advantage, such as their very broad knowledge of facts covering many different domains, because they have tens of billions of parameters that are enough to store a lot of facts. They also have relatively large and perfect working memory. As long as it can fill the context window of the Transformer, it can take advantage of it through its internal self-attention mechanism. This means that GPT remembers anything that can be embedded in its context window in a lossless way.

Karpathy said: The process of human use of GPT through prompt is essentially a process in which two different cognitive architectures, the brain and LLM, work together.

Perform inference with GPT

Let's look at one use case where Transformer performs quite well in practice: inference.

If there is only a single token, of course, you can't expect the Transformer to reason out anything. The execution of inference needs to involve more tokens. For example, you can't ask a transformer a very complex question and expect it to find the answer with a single token. Transformers need to "think" through tokens.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

An example is given on the right side of the image above. You can see the "thinking" that the Transformer does to answer the problem in the output results. If you provide some examples (above), then Transformer mimics that template, and the results look pretty good. Of course, you can also guide the Transformer to give similar output by saying "Let's think step by step" — which shows how it works to some extent. And since it's a bit like entering the work process display mode, it will invest a little less computation on each individual token. In this way, it performs a slower reasoning process and is more likely to succeed in getting the correct answer.

Let's look at another example. As shown in the figure below, humans will write poorly when writing, similarly, Transformer may make mistakes when choosing the next token, but unlike humans who can stop in time to modify, Transformer will continue to generate, make mistakes to the end, and finally get the wrong answer.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

But just as humans can start over again if they don't write well, Transformer can sample multiple times, and then we can use a process to find the best of them. This is called self-consistency.

Interestingly, by making the model reflect, you can see that the model actually knows that it is wrong. For example, if GPT-4 generates a poem that does not rhyme, then the poem it generates rhymes. Then you just have to ask it "Did you complete the task?" It will know that it has not completed the task and will complete it again for you.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

But if you don't give that prompt, it doesn't know it's wrong. It doesn't look back on its own, after all, it's just a token simulator. You have to make it look back via prompt.

Karpathy says AI models can be divided into two types of systems: System 1 and System 2. One class of systems is fast and automated, corresponding to large language models that simply sample tokens. The second type of system, on the other hand, is slower and will think repeatedly for planning.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

There are many people now designing prompts to make LLM exhibit thought processes similar to those of the human brain. As shown in figure (d) on the left above, this is the Tree of Thought proposed in a recent paper. The paper proposes to maintain multiple completion results for any given prompt, and then score these results to keep the results with better scores.

To do this, you don't just need to use a single prompt, but you need to combine multiple prompts in Python Glue code. This essentially maintains multiple prompts and also requires executing some tree search algorithm to find scalable prompts. It can be said that this is a symbiosis of Python Glue code and individual prompts.

Karpathy is comparing AlphaGo here. Each move of AlphaGo is the next move, and its strategy was initially trained to mimic humans. But in addition to this strategy, it also performs a Monte Carlo tree search. As a result, AlphaGo brainstorms a large number of different possibilities, evaluates them, and leaves only the ones that work well. The thinking tree is a bit like the thought process when AlphaGo plays Go, except that it deals with text.

Not just thought trees, but many more people are experimenting with LLM to do more complex tasks than simple question answering, but a lot of it is like Python Glue code, connecting many prompts.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Two examples are given in the figure above. Among them, the paper on the right proposes ReAct, where the researchers construct the prompt answer into a sequence of thoughts, actions, and observations, in which the model can also use tools in the action part. It's like some kind of thought process to answer queries.

The image on the left is AutoGPT. This project has been a bit of hype lately, but it's also really interesting research. AutoGPT saves a list of tasks and recursively breaks them down. It's not working very well so far, and Karpathy doesn't recommend it for practical applications, but he says it's instructive from a research perspective.

These are some of the research results that have created a second type of systems thinking.

Karpathy went on to talk about another interesting phenomenon with LLM, saying, "LLM is like a psychological quirk. They don't want to succeed, they just want to imitate." If you want it to give the right answer, you need to explicitly demand it. This is because the data in the training dataset of the Transformer is not always correct, and there is also low-quality data.

For example, if there is a physics problem, there may be an incorrect answer given by a student in the dataset, and there will also be a correct answer given by an expert. Transformers don't know which one to imitate or want to imitate, after all, their training goal is language modeling, not to distinguish between right and wrong. So when using and testing, if you want the right answer, you need to make it clear.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

For example, in the paper shown above, the researchers tried a variety of different prompts and found that for the same problem, the output results obtained by different prompts were not the same accurate! As you can see, if the model is explicitly required in prompt to reason step by step and give correct results, its accuracy will be higher, because then the Transformer does not have to assign probabilities to low-quality solutions.

So if you want the right answer, say it out loud! For example, add "you are an expert in a certain field" or "assume your IQ is 120" to prompt. But don't go too far, such as asking the model to assume that it has an IQ of 400, so that your problem may be beyond the distribution of the data or in the distribution but the result is very sci-fi - then the model may start to play some sci-fi role.

Let LLM use tools / plugins

Using the right tools for specific problems can often do more with less. The same is true for LLM. Depending on the task, we may want LLM to use tools such as calculators, code interpreters, search engines, etc.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

But first we need to keep in mind that transformers may not know by default that they can't do certain things. Users may even need to explicitly tell the Transformer in prompt: "You are not good at mental arithmetic, if you want to do large number arithmetic, please use this calculator, this is the way to use this calculator." You have to explicitly ask it to use a certain tool, because the model itself doesn't know what it's good or not good at.

Retrieval is an important tool that can greatly improve LLM performance. Because LLM is memory only, search engines that specialize in search can greatly complement LLM. Practice has also proven that the usefulness of LLM with access to search tools is greatly improved.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

As mentioned earlier, the Transformer's context window is its working memory range. If information related to the current task can be loaded into its working memory, the model performs better because it can read all the memories at once. In fact, the use of search enhancement generation is also a topic of interest to many people. Below the image above shows LlamaIndex, which has a data connector that connects a large amount of different types of data. This tool can index various data and make them available to LLM.

The fashionable thing is to take related documents, divide them into blocks of text, and then perform embedding operations to get embedding vectors that represent that data. These embedding vectors are saved. When using a model, we can issue queries to stored vectors from which we can fetch blocks of text relevant to the current task. These text blocks are then added to the prompt, which LLM generates. This approach works well in practice.

This is similar to how humans accomplish tasks. People can also do things based on their own memories alone, but if they can retrieve information related to the task, it will naturally be easier to do things. Transformers have a large memory, but they can also benefit from retrieval.

Add a constraint in prompt

Setting a constraint in prompt forces LLM to output results to a specific template. The following figure shows Microsoft's Guidance tool to help users better use LLM, accessible at https://github.com/microsoft/guidance. In the example given here, the LLM output will be in JSON format. This is guaranteed because prompt adjusts the probability of the transformer outputting different tokens, and the output position of these tokens is limited, i.e. it can only fill in the gaps in the text. This achieves strict restrictions on where the text is left blank. Sampling with constraints is useful for some tasks.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Fine tune

Designing prompt allows LLM models to do more different tasks, but we can also do this by fine-tuning.

Fine-tuning a model means changing the weight distribution of the model. This kind of thing is not difficult to do, because there are already big open source models like LLaMA and some software libraries for fine-tuning.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Parameter-efficient fine-tuning techniques such as LoRA allow users to train only a few sparse parts of the model. When using this technique, most of the base model remains the same, while some parts can vary. In practice, this technique works well and allows people to make small adjustments to the model at a very low cost. At the same time, because the model is mostly fixed, those parts can be computed using very low-precision inference, because gradient descent does not update them. As a result, the overall fine-tuning efficiency can be very high.

However, it is important to remember that fine-tuning requires technical expertise and most likely domain expertise, after all, whether hiring human experts to write datasets or synthesizing data through automated methods is very complex, and this can also lengthen the iteration cycle.

In addition, Karpathy points out that supervised fine-tuning (SFT) is still possible for users, because it is really about continuing the language modeling task; However, RLHF is a topic that needs further research, and it is much more difficult to implement, so it is not recommended for beginners.

Recommendations for using GPT for Karpathy

To help people use GPT better, Karpathy gives some advice. When using GPT to complete a task, you can divide the task into two parts: one, to achieve the best results; Second, optimize the results according to the specified order.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

For the first part, the first is the selection model, and the strongest model is GPT-4 at the moment. Once you have a model, when performing a specific task, you need to design the prompt to be detailed enough to include the background, relevant information, and description of the task. You have to think about how humans would accomplish this task, but you also need to understand that humans have inner monologues and can introspect, but LLM cannot. Understanding how LLM works can be helpful in designing prompts. You can retrieve some relevant background and information and add it to the prompt. Many people have shared relevant experiences and technologies online.

You don't have to rush LLM to complete your tasks in one step. You can do a few more experiments to test the possibilities. You can provide LLM with some examples to really understand your intentions.

Problems that are difficult to solve with native LLM can be left to tools and plug-ins. You have to think about how to integrate the tools, which of course can't be solved with a single prompt Q&A. You need to do a few more experiments and practice the truth.

Finally, if you succeed in figuring out a prompt design that works for you, you can stick with it a little bit and see how you can fine-tune the model to better serve your application; But understand that fine-tuning the model will be slower and require more investment. For research professionals who want to use RLHF, although RLHF is currently better than SFT if it can be used, it will also cost more. To save costs, exploratory studies can use lower performance models or shorter prompts.

Karpathy emphasized that there may be some problems when solving use cases with LLM, such as possible biased results, fabricated hallucinatory information, inference errors, inability to understand application types (such as spelling-related tasks), knowledge partitioning (GPT-4 training data as of September 2021), possible attacks (such as prompt injection attacks, jailbreak attacks, data poisoning attacks)...

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Karpathy recommends that users currently only use LLM in low-risk apps and with human supervision. LLM can serve as a source of inspiration and advice that complements us rather than completely autonomously.

epilogue

Karpathy concludes by saying, "GPT-4 is an amazing creation. I'm grateful that it's alive and it's beautiful." It has extraordinary capabilities to help users answer questions, write code, and more. The ecosystem surrounding it is also thriving.

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Finally, Karpathy asked GPT-4 a question: "If you were to inspire the audience at Microsoft Build 2023, what would you say?"

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

GPT-4 gives the following answer:

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

The author also took advantage of it here and asked ChatGPT to translate this passage into Chinese:

The co-founder of OpenAI personally went on the popular science GPT, so that technical whites can also understand the strongest AI

Read on