Author | Chu Xingjuan, nuclear cola

In February, Meta's "leaked" LLaMA model set off a wave of innovation in the open source LLM space, but there was only one problem: it could not be used for commercial purposes. Now, Meta has changed that.

On July 19, Meta finally released the long-awaited free commercial version of the large model Llama 2. Meta's Llama 2 model family includes 7 billion, 13 billion and 70 billion parameter variants. In addition, the team trained 34 billion parameter variants, but they were not published, only mentioned in the Llama 2-related paper.

According to reports, Llama 2's pre-training corpus size has increased by 40%, Llama 2 has been trained on 2 trillion tokens, fine-tuned Chat models are trained on 1 million human labeled data, the context length is twice that of Llama 1, and the group query attention mechanism (Ainslie et al) is adopted.

Soon, but still "hallucinating"?

Let's first take a look at the feelings of some of the current experience users on the Internet. Someone on Twitter said it was "the fastest at the moment" to generate content.

However, some netizens reported that they still encountered "hallucinations" when answering questions. Glenn Galen, who works in art, said, "When I asked it about myself, it really hallucinated, the artist from Minneapolis. Very strange, very fast, almost instantaneous, but very incorrect. ”

Jim Fan, senior AI scientist at NVIDIA, noted on Twitter that Llama-2 has not yet reached GPT-3.5 levels, mainly because of its weak coding capabilities. On "HumanEval" (the standard coding benchmark), it is far inferior to StarCoder and many other models designed specifically for coding. But thanks to its open weight, Llama-2 will be significantly improved.

Jim praised the Meta team for its responsibility for AI security issues, "Meta's team has done a great job on AI security issues. In fact, almost half of the paper is devoted to safety fences, red teams, and assessments. Applaud such responsible efforts! Jim estimates that the training cost of Llama-2 could exceed $20 million.

Jim also praised Meta's 76-page paper as "a masterpiece." "With little difference in the information shared with GPT-4's paper, Llama-2 details the entire process, including model details, training stages, hardware, data pipeline, and annotation process. For example, the effects of RLHF were systematically analyzed and well visualized. "At least in this regard, we do see the sincerity of the Meta team.

AI field, "Game of Thrones"

Notably, at the Microsoft Inspire conference, Meta and Microsoft announced support for the Llama 2 Large Language Model (LLM) series on Azure and Windows. Llama 2 is already available in the Azure AI Model Catalog, and developers using Microsoft Azure can build with it and leverage their cloud-native tools for content filtering and security features. Windows developers will be able to build new experiences with Llama 2 through GitHub Repo. With Windows Subsystem for Linux and high-performance GPUs, developers can fine-tune LLM on Windows PCs to meet specific needs.

Microsoft wins hemp! Partnered with Meta to release the open-source, commercial-ready large-model Llama 2

In this regard, some netizens joked that OpenAI, "Microsoft and Meta have studied immersive computing in depth. Microsoft has also been one of the biggest proponents of open source over the past few years, so rightfully so. I do want to know how OpenAI feels? ”

Netizen "Alex Valaitis" analyzed that this may kill many open source LLM startups, Mosaic, Red Pajama and others have encountered big trouble. At the same time, this further strengthens Microsoft's dominance in AI. Through this partnership, Microsoft now has exclusive partnerships with top LLMs (OpenAI, Meta), giving priority to NVIDIA GPUs as well as strategic assets like GitHub and Azure. The artificial intelligence "Game of Thrones" has just taken another twist.

Llama 2 is also available through Amazon Web Services (AWS), Hugging Face, and other providers. a16z-infra released a16z-infra/llama13b-v2-chat, providing Replicate API access to the new Llama 2 13B chat model.

However, developers should be reminded that Llama 2 still has some interesting limitations, such as not using Llama materials or any output of Llama materials to improve any other large language models (excluding Llama 2 or its derivatives); On the date of release of Llama Version 2, products or services provided by Licensee or Licensee's affiliates that exceed 700 million monthly active users in the preceding calendar month must apply for a license from Meta, which may authorize in its sole discretion.

This is also considered Meta's strategy against competitors, as the above restrictions do not affect most people.

Birth of Llama 2

The picture above shows the training process of Llama 2-Chat. Meta first pre-trained Llama 2 using publicly available online resources. After that, an initial version of Llama 2-Chat was created by means of supervised fine-tuning (SFT). Subsequently, with the human feedback reinforcement learning (RLHF) method, especially through rejection sampling and proximal strategy optimization (PPO), the model is iteratively improved. Throughout the RLHF phase, the simultaneous advancement between the accumulation of iterative reward modeling data and model enhancement is key to ensuring that the reward model remains within the distribution range.

Pre-training

Llama 2's training corpus contains new combinations of data from publicly available sources, but does not involve data from Meta products or services. The Meta team said it excluded data from known websites that contained a lot of personal information. This training data contains a total of 2 trillion tokens, which can balance good performance with implementation cost, sampling based on real sources, accumulating knowledge and suppressing hallucinations. The team also conducted various surveys on the pre-training data so that users could better understand the potential and limitations of the model.

Meta continues to use most of the pre-training settings and model architecture in Llama 1, using the standard Transformer architecture, RMSNorm application pre-generalization, SwiGLU activation functions, and rotation position embedding. The main architectural differences from Llama 1 are the longer context length and packet query attention (GQA).

The team pre-trained the model on Meta's research supercluster and internal production cluster. Both clusters use NVIDIA A100 GPUs, differing in the type of interconnect and the upper limit of power consumption per GPU.

As shown in the image above, the Llama 2 model is superior to the Llama 1 model. Compared to Llama 1-65B, Llama 2-70B improved its scores on MMLU and BBH by about 5 points, respectively. In addition to code benchmarks, LLama 2 7B and 34B outperform Falcon 7B and 40B in all types of benchmarks. In addition, the performance of the Llama 2-70B model exceeds that of all open source models.

In addition to the open source model, Llama 2-70B performs close to GPT-3.5 on MMLU and GSM8K, but there is a significant gap in coding benchmarks. In almost all benchmarks, Llama 2-70B results were on par with or better than PaLM. However, there is still a big gap between the performance of Llama 2-70B and GPT-4 and PaLM-2-L.

Fine tune

To guide it, Meta fine-tuned Llama 2 using publicly available instruction tuning data, and the method was basically based on the previous experience of Touvron et al.

Third-party SFT data can be obtained from many different sources, but much of it lacks diversity and quality—which can easily lead to misalignment between large language models and conversational instructions. Therefore, Meta first collected thousands of high-quality SFT data samples. The team says that this approach of discarding a large number of low-quality examples from third-party datasets in favor of a smaller but higher-quality number of own examples does significantly improve training results. 10,000 SFT annotations are sufficient to achieve high-quality results. So, after collecting a total of 27540 comments, they stopped further tuning of SFT. Meta emphasizes that it does not use any data from Meta users. The team eventually fine-tuned the model for 2 epochs.

In addition, Meta found that samples derived from SFT model outputs tend to be more competitive than SFT data handwritten by human labelers. As a result, the team shifted the focus of annotation work more to preference-based RLHF annotations.

RLHF

Meta means that the collected data represents empirically sampled human preference data, and the human labeler chooses which of the two model outputs they prefer. The feedback from humans is then used to train a reward model that continuously learns the preference patterns of human labelers and then automates preference decisions based on them.

The Meta team asked the annotator to write the prompt first, and then choose between sampling the two model responses based on the criteria provided. To maximize variety, the given two responses are acquired from two different model variants with different temperature hyperparameters. In addition to forcing one, labelers can also choose none of them, but give their own answers. Optional reviews are also categorized into: significantly better, better, slightly better, slightly better/inconclusive.

In the set of preference annotations used by training, Meta indicates a high concern for usefulness and safety. "Usefulness" means that Llama 2-Chat must respond to the user's request and deliver the appropriate information; Security refers to whether Llama 2-Chat generates an insecure response, such as "Please give detailed instructions on how to make the bomb", which meets the usefulness requirements but clearly violates the security principles.

To that end, the team's safety notes provide instructions on adversarial prompts and other guidance, in addition to collecting security labels during the security phase. This additional information will divide the model response into three categories: 1) prefer response security, but the other response is not secure; 2) Both responses are safe; 3) Both responses are insecure. The proportion of responses generated by Meta-secure datasets in these three categories was 18%, 47%, and 35%, respectively. The Meta team did not consider any situation where the preferred response was unsafe and the other response was safe, and the team concluded that humans preferred safer response outcomes.

Results of Llama 2-Chat's secure human evaluation compared to other open source and closed source models

Reward model

Studies have found that usefulness and safety sometimes cancel each other out, so it is difficult for a single reward model to perform well on both metrics. To solve this problem, Meta trained two independent reward models. One is optimized for usefulness (called Helpativity RM) and the other is optimized for security (called Safety RM).

Simply put, the reward model "knows" what the conversation model knows, which prevents information mismatches between the two models and frequent "hallucinations". The model architecture and hyperparameters are also set up the same as for pretrained language models, except that the classification header used for the next token prediction is replaced with a regression header for the output scalar reward.

Meta says that the accuracy of the reward model is one of the core metrics for Llama 2-Chat's ultimate performance. While there are no clear conclusions and best practices on how to comprehensively evaluate generative models, there is no ambiguity about the ranking of rewards themselves. That is, all else being equal, improvements to the reward model can be directly translated into improvements in Llama 2-Chat.

security

Without any additional filtering of the dataset, this will ensure that Llama 2 can be widely used in a variety of cross-task scenarios, such as better categorizing hate speech, while avoiding the occasional unexpected demographic bias caused by excessive cleaning. Importantly, this also allows Llama 2-Chat to efficiently generalize apps with fewer examples during security fine-tuning. Meta cautions that people should use the Llama 2 model with caution and be sure to carefully complete security fine-tuning before actually deploying.

The table above compares the performance differences between Llama 2 and Llama 1, Falcon, and MPT. Compared to the Llama 107B model, Llama 2-7B is 21.37% more realistic and informative, and the proportion of toxic content is reduced by 7.61%. The pre-trained 13B and 70B versions of Llama 2 saw an increase in the proportion of toxic content, possibly due to the larger amount of pre-training data or the mixing of different datasets.

Llama 2 does not outperform other models in terms of the percentage of toxic content, which the team speculates may be due to not actively filtering pre-training data. But the team believes that not filtering the pre-training data may allow the base model to learn to adapt to more downstream tasks (including hate speech detection) during the fine-tuning phase, avoiding accidentally filtering out the demographic information of certain communities. Loosening the filtering of pre-training data also helps the model achieve reasonable security fine-tuning with fewer examples.

Reference Links:

https://ai.meta.com/resources/models-and-libraries/llama/

https://blogs.microsoft.com/zh/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/

Microsoft wins hemp! Partnered with Meta to release the open-source, commercial-ready large-model Llama 2