原文：Navigating the Large Language Model revolution with Paperspace - James Skelton。

In this post, we try to get an overview of the rapidly expanding Large Language Model (LLM) ecosystem by explaining the relevant key terms and upcoming models.

The past few months have witnessed a long-awaited explosion of AI research. The emergence of a pre-trained transformer (GPT) model five years ago was arguably the first stone to pave the way. From there, the development of human speech generation was almost a matter of time. With OpenAI's ChatGPT and GPT4, as well as major contenders like Bard and open source alternatives like LLaMa entering the public domain over the past six months, it's now more important than ever for everyone to be familiar with these impressive new technologies.

This article will begin by discussing GPT architectures and succinctly explaining why this architecture has become the default architecture for any NLP/NLU task. Next, we will discuss some of the main terms about LLM, such as LoRA fine-tuning methods, reinforcement learning for human feedback (RHLF), and quantification methods for faster and lower-cost fine-tuning, such as QLoRA. We'll end this section with a brief overview of using the best performing NLP models on your own projects, including Alpaca, LLaVa, MPT-7B, and Guanaco.

GPT architecture

The GPT model is an LLM first introduced by Rashford et al. in Improving Language Understanding by Generative Pre-Training in 2018. These researchers from OpenAI are trying to create a model that accepts natural language cues as input, combined with an understanding of the content to predict the best possible response. Instead of generating the entire sequence of text at once, the GPT model uses each word called a "token" as a guide input to generate the next token. This allows a text sentence to be generated in a localized context, preventing the sentence from moving too far from its input.

In addition, the self-attention mechanism built into the transformer enables the model to focus on different parts of the input sequence when generating the response, so it can focus part of its attention on predicting the most important parts of the sentence. "Self-attention works by calculating a set of attention weights for each input token. The weights then show the relevance of each token to the others. The transformer then uses the attention weights to assign more importance to the most important parts of the input and less importance to the less relevant parts. "

Translation: An overview of the research explosion of AI large models (LLM).

The generic GPT loop is as follows, with a token and a positional encoding that indicates its position in the sentence as input, and then it passes through a dropout layer and then through N transformer block layers (shown on the right in the image above). A transformer block consists of layers of self-attention, normalization, and feed-forward networks (i.e., MLP or Dense). These structures work together to determine and output the most likely new token.

This process loops until the GPT model predicts that the most likely new token is the end-of-sentence character. This can be further extended to produce complete paragraphs, and more than a single sentence is especially common in newer versions of GPT models. When trained on enough data, this long-term, context-driven generation capability makes GPT models unmatched on text synthesis tasks.

Terminology that modern LLM needs to know

This section covers fine-tuning methods for LLM that we think are worth knowing.

LoRA

The first technique we will discuss is low-rank adaptation (LoRA). Low-rank adaptation (LoRA) for large language models is an ingenious way to train/fine-tune LLM, significantly reducing the video memory required for training. To achieve this, LoRA merges existing model weights with rank factorization weight matrix pairs. These new weights then become the only variable in training, while the remaining weights remain frozen.

Because the update matrix represents much fewer parameters than the original weights, this allows for a significant reduction in training costs without significantly reducing training effectiveness. In addition, by adding these weights to the attention layer of these models, we can adjust the effect of this additional weights as needed.

RLHF

Reinforcement Learning Human Feedback (LLM RLHF) in Large Language Models refers to a method that uses a combination of reinforcement learning and human feedback to train large language models (LLM). Reinforcement learning is a type of machine learning in which algorithms learn to make decisions through trial and error. In the LLM domain, reinforcement learning can optimize the performance of LLM by providing feedback on the quality of the text it generates.

In a large language model like Chat GPT, RHLF's sequence of events can be succinctly broken down into the following steps:

Train some to generate a pre-trained transformer model on enough data
Train a reward model that accepts a series of text and returns a scalar reward that numerically represents human preference, i.e., how well humans rate their recognition
Fine-tune the model using reinforcement learning models trained with human feedback.

In this way, LLM can go beyond the effects of pure machine learning and introduce additional human knowledge later in the training process. In practice, this can greatly improve the humanization and interactivity of model responses.

QLoRA

QLoRA is an efficient LLM trimming method that significantly reduces memory requirements enough to fine-tune a 65B parameter model on a single 48GB GPU while maintaining full 16-bit trim task performance. QLoRA introduces a quantitative step based on the LoRA method transformation, and although it was only recently published, its effects make it worth including in this article. QLoRA is very similar to LoRA, with several major differences.

As you can see in the image above, QLoRA has several distinct differences from its predecessor, LoRA. The difference is that the QLoRA method quantizes the transformer model to 4-bit precision and uses the paging optimizer in the CPU to handle any excessive data spikes. In practice, this makes it possible to fine-tune an LLM, such as LLaMA, with significantly reduced memory requirements.

Models to understand in the LLM revolution

The popularity of the GPT model in the open source community over the past six months can be widely attributed to the popularity of Meta's LLaMa model. While they are not available for commercial use, they are publicly available to researchers who fill out simple forms. This availability has led to a significant increase in open source projects based on the LLaMa model. In this section, we'll take a brief look at some of the most important fine-tuned LLaMa models released over the past six months.

Alpaca

LLaMa-Alpaca was the first fine-tuning project to find prominence. The project, run by researchers at Stanford, used a 52k instruction-response sequence generated by OpenAI's text-davinci-003 to create a powerful instruction-following dataset.

The research team behind the project quickly discovered that their model achieved near-SOTA results on a much smaller model than GPT-3.5/GPT-4. They used 5 students to conduct a double-blind comparison of their newly trained model and the original text-davinci-003 model. The double-blind comparison found that the results were very similar, indicating that Alpaca achieved nearly identical capabilities on a small subset of the training parameters.

The release of Alpaca led to a series of alternatives trained on similar datasets and the addition of additional patterns such as vision.

LLaVA

LLaVA (Large Language-and-Vision Assistant) is the first and most prominent project to attempt to merge LLaMA fine-tuning with visual understanding. This allows the model to accept multimodal input and generate thoughtful responses that exhibit an understanding of both textual and visual inputs.

Their experiments showed that LLaVA has impressive multimodal chat capabilities, sometimes exhibiting similar behavior to multimodal GPT-4 on unseen images/instructions. On a synthetic multimodal instruction following dataset, it was found to have a relative score of 85.1% compared to GPT-4. In addition, when fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a maximum accuracy of 92.53%.

The authors have extended this project with a similar directive adjustment strategy, creating LLaVA-Med. This adaptability and extensibility of the LLaVA model, being able to cover new and complex topics, both textually and visually, indicates that LLaVA is a model of interest because development continues.

MPT-7B

One of our favorite open source projects right now, the MosaicML pre-trained transformer family represents some of the biggest developments brought about by this LLM revolution. Unlike the other projects we are discussing today, it was developed without LLaMA and therefore does not have to inherit its commercial license. This makes it probably the best open source LLM available right now, comparable to the tuned LLaMa 7B model.

The MPT-7B performed extremely well. As shown in the graph above, its performance in various tests is comparable to that of the LLaMa-7B.

The MPT-7B is a transformer trained from scratch on the text and code of a 1T token. It comes in three variants:

Chat: This is probably the type of model that readers are most familiar with, and this model is designed to output the same response as human chat.
Instructions: This is another common prototype of these models, as seen in Alpaca, Vicuna, etc., instruction models are able to interpret complex instructions and return responses that are accurately predicted.
Story writing: The story writing model is trained on the sequence of long-form literary works and can accurately imitate the author's style for long-form story generation.

Guanaco

Generative Universal Assistant for Natural-language Adaptive Context-aware Omnilingual outputs - Guanaco, introduced for the QLoRA paper. Guanaco is an advanced instruction following language model built on Meta's LLaMa 7B model.

Based on the initial 52K dataset of the Alpaca model, Guanaco was trained on an additional 534,530 entries covering English, Chinese Simplified, Chinese Traditional, Japanese, German, and various language and grammar tasks. This wealth of data allows Guanaco to excel in multilingual environments and extends the model's capabilities to cover a wider range of locales.

Conclusion

In this post, we cover a range of topics related to the LLM revolution to help understand these complex systems. Overall, we are in the midst of rapid growth in the NLP field of AI. This is the perfect time to get involved, build understanding, and capture the power of these technologies for yourself and your own business interests.

Translation: An overview of the research explosion of AI large models (LLM).