Editor: Editorial Department

【New Zhiyuan Guide】Hibernate for a long time,Meta just threw a bombshell:The long-awaited Llama 3,It took 24000 GPUs to train,As soon as it debuted, it ascended to the iron throne of the open source large model。 8B and 70B have obtained SOTA in the open source field under the same scale parameters, the inference coding has been greatly improved, and the code and weights are all open source! Moreover, Llama 3 of 400B is also on the way.

LLM界的「真· Open AI」,又来整顿AI圈了!

The industry exclaimed: The first open-source GPT-4 model is finally here! The historic moment when the open-source model catches up with the closed-source model may be just around the corner?

一石激起千层浪,Llama 3才刚刚发布没几小时,就破纪录地登顶了Hugging Face排行榜。

The world's first open-source GPT-4 was born!Llama 3 was released, and Meta AI is available without login

This time, Meta has open-sourced two models, Llama 3 8B and Llama 3 70B, which are available in two versions, pre-training and instruction fine-tuning, respectively.

Xiaozha and LeCun also opened the publicity mode for the first time:

Llama 3 is trained on a custom cluster of 24,000 GPUs with 15 trillion tokens.

Even the smallest 8B version can sometimes beat the Llama 2 70B, which is an order of magnitude larger!

It's worth looking forward to seeing more versions of Llama 3 in the coming months

However, while the context length has been doubled, it is still only 8K.

By the way, Llama 3 can already be used in the web version of Meta AI, or the login-free kind.

In this regard, Hugging Face co-creator and CEO said: "Llama 1 and Llama 2 have now spawned 30,000 new models. I can't wait to see what impact Llama 3 will have on the AI ecosystem. 」

400B 能野兽,刷新开源SOTA

However, the 8B and 70B versions of Llama 3 are just appetizers, and bigger ones are yet to come!

The true performance beast, the Llama 3 400B, will soon be released from the ban and is still in training.

Among them, the pre-trained version scored a high score of 96 on the inference challenge test set ARC-Challenge.

The fine-tuned version of Llama 3 400B performs very well in math (GSM-8K), code (Human-Eval), and massive multi-task language understanding (MMLU).

What is the concept of this data?

Jim Fan, a senior scientist at NVIDIA, made a comparison chart with Claude 3 Opus, GPT-4-2024-04-09 and Gemini in the same benchmark data:

It can be seen that Llama 3 400B is already on par with GPT-4 and Claude 3 in multilingual inference tasks and code capabilities.

What's even more eye-catching is that it beats the Gemini Ultra 1.0 in all capabilities.

There is also a more detailed data comparison chart, experience it yourself.

For a while, the whole network fell into madness.

Netizen: The first "open-source GPT-4" is here

Karpathy brilliantly concludes that the 400B model will be "the first open-source GPT-4-level model".

Jim Fan感慨道:

The upcoming launch of Llama 3 400B will be a watershed moment when the community will receive an open-source heavyweight GPT-4 model. It will change the way many research efforts and grassroots startups are calculated.

The Llama 3 400B is still in training and hopefully will be able to perform better in the coming months. With such a strong backing, we can unleash more research potential. Expect a surge in the construction energy of the entire ecosystem!

OpenAI research scientist Will Depue also expressed the same opinion, looking forward to an open-source GPT-4 level model - Llama 3 400B, the future possibilities are endless!

昨天刚刚发布的Mixtral 8×22B刷新SOTA之后,没想到,却被Llama 3 70B碾压了。

开源模型的SOTA，当属于Llama 3 400B。

Ng Enda's birthday, but received a chic "gift".

Meta announced in its blog that in the coming months, it will release a model with several new features, including multilingual conversations, longer contexts, and overall capability improvements.

Once Llama 3 has completed training, the technical report will be published directly.

Meta重回开源模型「铁王座」

In terms of performance, 8B and 70B are significantly better than Llama 2 and achieve SOTA.

The pre-trained model and the instruction fine-tuning model have achieved such advanced performance at the parameter scale of 8B and 70B, thanks to the optimization improvement of pre-training and post-training.

Meta's research team has also improved the post-training optimization process, which greatly reduces the rate of false rejection of tasks, improves the consistency of model output with human intent, and increases the diversity of model responses.

At the same time, the model's logical reasoning, code generation, and instruction following capabilities have also been greatly improved, making Llama 3 a more controllable model.

Compared to open-source models that are pre-trained at nearly the same scale, Llama 3 8B completely beats Mistral and Gemma. However, the reasoning ability is weaker than that of the Gemma-7B.

Compared to the closed-source Gemini Pro 1.0 and the open-source Mixtral 8×22B, the Llama 3-70B comes out on top in several benchmarks.

Let's take a look at the comparison of the two parameter versions of Llama 3 with the pre-trained models of Llama 2-7B, 13B, and 70B.

毋庸置疑,Llama 3 8B肯定是要超越Llama 2 7B,甚至碾压了Llama 2 13B。

Llama 3 70B要比Llama 2 70B,尤其在推理(MMLU、ARC-Challenge)、AGIEval基准上上,实现了巨大提升。

Compared with the fine-tuned version, the Llama 3 8B also surpasses the open-source Gemma 7B and the Mistral 7B Instuct.

The 70B version of Llama 3 is better than Gemini Pro 1.5 and Claude 3 Sonnet in terms of reasoning (MMLU), math (GSM-8K), and even code (HumanEval) benchmarks.

Let's take a look at the performance comparison with the version with different parameters of the Llama 2 instruction fine-tuning.

Both the Llama 3 8B and the 70B are significantly improved over the Llama 2 with the same parameters.

During the development of Llama 3, Meta not only focused on benchmarking, but also worked to optimize the model's performance in real-world scenarios.

To this end, Meta has developed a high-quality human assessment dataset with 1,800 prompts that cover 12 key use cases, including asking suggestions, brainstorming, categorizing, multiple-choice, coding, creative writing, information extraction, role-playing, open-ended Q&A, logical reasoning, paraphrasing, and summarizing.

In order to prevent Llama 3 from overfitting on the evaluation dataset, the modeling team itself did not have access to it.

人工评估结果显示,Llama 3 70B的表现远胜于Llama 2、GPT-3.5、Mistral Medium和Claude Sonnet。

With the development of large models, where should we innovate?

Throughout the project, Meta focused on four key elements: model architecture, training data, scaling up training, and instruction fine-tuning.

128K token分词器+GQA

In terms of architecture, Meta still chose the Transformer architecture for Llama 3.

This architecture is a relatively standard pure decoder Transformer, but with a few key improvements over Llama 2.

For example, Llama 3 uses a tokenizer with 128K tokens to encode the language more efficiently, which significantly improves model performance.

In order to improve the inference speed of the Llama 3 model, Meta uses the Grouped Query Attention (GQA) mechanism at both 8B and 70B scales.

In addition, Meta trained the model on a sequence of 8,192 tokens and used a mask to ensure that the self-attention mechanism did not cross document boundaries.

15万亿token训练,7倍于Llama 2

At the same time, large, high-quality training datasets are crucial.

The team invested a lot of resources in pre-training the data.

Ultimately, Llama 3 was pre-trained on over 15 trillion tokens, all collected from publicly available sources.

Its training dataset is 7 times larger than Llama 2 and contains 4 times more code.

To cope with multilingual scenarios, more than 5% of Llama 3's pre-trained dataset is high-quality non-English data, covering more than 30 languages.

At the same time, in order to make the training data of sufficient quality, Meta has developed a series of data filtering pipelines.

These pipelines include the use of heuristic filters, NSFW filters, semantic deduplication methods, and text classifiers to predict data quality.

An interesting point in this process is -

Previous generations of Llama were surprisingly good at identifying high-quality data, so Meta used Llama 2 to generate the training data used to train Llama 3's text quality classifier.

In addition, Meta conducted a number of experiments to evaluate the best way to mix data from different sources in the final pre-trained dataset.

Eventually, Meta was able to choose a combination of data that would allow Llama 3 to perform well in a variety of use cases, including STEM, coding, and historical knowledge.

Scaling Law依旧是王道

In order to make effective use of the pre-training data, the team invested a lot of effort in scaling up the pre-training.

For downstream benchmarking, Meta has developed a detailed set of scaling laws. This ensures that the team is able to select the best combination of data while making the best use of training computing resources.

The law of scaling helps teams predict the performance of the largest model on mission-critical tasks before actually training the model, which is critical to ensure that the model performs well across a wide range of use cases and capabilities.

In the process, the team observed several interesting new phenomena in scaling behavior.

For example, while the Chinchilla optimal training computation for an 8B parameter model corresponds to about 200 billion tokens, Meta found that the model performance continued to improve even after the model was trained on data of more than two orders of magnitude!

Llama 3 with 8B and 70B parameters continued to increase linearly logarithmically after being trained on up to 15T tokens.

Larger models can match the performance of these smaller models with fewer training calculations, but they are preferred because they are more efficient in the inference process.

To train the largest Llama 3 model, the team combined three parallelization methods: data parallelism, model parallelism, and pipeline parallelism.

As a result, the team achieved the most efficient implementation: over 400 TFLOPS per GPU for computational utilization when training with 16K GPUs simultaneously.

The team trained on two custom-built 24K GPU clusters. To maximize GPU uptime, Meta has also developed an advanced new training technology stack that automates error detection, handling, and maintenance.

At the same time, Meta has greatly improved hardware reliability and detection mechanisms for silent data corruption, developed new scalable storage systems, and reduced checkpointing and rollback overhead.

These improvements have resulted in an overall effective training time of more than 95%.

Compared to Llama 2, these improvements directly increase the training efficiency of Llama 3 by about three times!

Fine-tuning of innovative instructions

At the same time, the team also innovated in the fine-tuning of instructions.

Meta's post-training approach is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).

Meta found that the use of prompts in SFT and preference ordering in PPO vs. DPO had a completely greater impact on the performance of the alignment model than expected.

The biggest performance improvement for Llama 3 is due to the careful curation of this data and multiple rounds of quality assurance against the standards provided by human annotators.

By learning from preference ordering through PPO and DPO, Llama 3's performance on inference and coding tasks is also greatly improved.

If Llama 3 is asked a difficult reasoning question to answer, it can sometimes produce the correct reasoning process.

The difficulty in this process is that it knows how to come up with the right answer, but it doesn't know how to choose. But by training on preference ranking, the model learns how to choose the correct answer.

Safer

When it came to deployment, the team adopted a new system-level approach.

Meta envisions the Llama model as part of a broader system that puts developers in the driver's seat. The Llama model will serve as a fundamental part of the system, and the developers will design it with the end goal in mind.

In terms of model safety, instruction fine-tuning plays an important role.

Through internal and external efforts, the team conducted security tests on the instruction fine-tuning model.

The red team approach uses human experts and automated methods to generate adversarial prompts in an attempt to elicit problematic responses, such as chemical, biological, cybersecurity, and other risk areas related to abuse risks.

In the process, the team made the Llama Guard model, the foundation of security, which can be fine-tuned according to the needs of the application.

The new Llama Guard 2 uses the MLCommons taxonomy. In addition, CyberSecEval 2 expands on its predecessor by adding measures to assess LLMs' propensity to abuse code interpreters, offensive cybersecurity capabilities, and susceptibility to prompt injection attacks.

Finally, the introduction of Code Shield also adds support for inference-time filtering of generated LLM-insecure code. This reduces insecure code suggestions, code interpreter abuse, and more.

In addition, Meta has updated its Responsible Use Guidelines (RUG) to recommend that all inputs and outputs be checked and filtered according to the appropriate content guidelines.

In addition, cloud service providers provide tools such as content moderation APIs to encourage developers to deploy responsibly.

The web version of Meta AI can chat without logging in

At the same time, today Meta also released a web version of Meta AI, which is blessed by the latest Llama 3 and is known as one of the world's top AI assistants.

The UI design of the whole page is very simple, not only for dialogue, but also for the function of generating pictures.

Similar to ChatGPT-3.5 registration-free login, with Meta AI's chat function, you can open it anytime and anywhere when you enter the webpage, without logging in.

Portal: https://www.meta.ai/

However, in the case of drawing, this is an exception.

In fact, at last year's Connect conference, Xiaozha made a preview for the first time.

And now, more people around the world can interact with it like never before.

Meta AI is not only able to chat on the web, but also integrates into its own social apps, such as Facebook, Ins, WhatsApp and Messenger.

Next, let's feel the unique charm brought by Meta AI assistant.

Want to organise a weekend excursion but don't have time to plan your trip?

Meta AI will first ask three travel questions based on the request, and then go to tailor a travel list!

- Destination: Where are you going?

- Duration: How many days will you be traveling?

- Type of trip: Is it a beach vacation, an urban adventure, an outdoor adventure, or something else?

Or are you struggling with math and need to make your work email look more professional? Meta AI can help!

Even, you can log in to save your conversations with Meta AI for future reference.

让Llama 3画一幅自画像。

Ins、Facebook等APP无缝集成

As mentioned earlier, Meta AI can also be used in searches on Facebook, Instagram, WhatsApp, and Messenger.

The advantage of this is that real-time information from the network can be accessed at any time, without having to switch between different applications.

As an example, let's say you're planning a ski trip in a messenger group chat.

Search directly through Messenger allows Meta AI to find flights from New York to Colorado and find out the least crowded weekends to travel on — all without having to jump out of Messenger.

When you're scrolling through Facebook and you see an interesting post with a map of the Northern Lights in Iceland.

You can ask Meta AI directly, "What is the best time of year to see the Northern Lights?"

In addition to the web version, Meta AI's image feature can also be experienced in WhatsApp.

When you start typing prompt in the search box, you'll see an image that changes with each word you type.

It's clear how Meta AI can turn your imagination into reality.

According to reports, the images generated by Meta AI are clearer and of better quality, and the ability to incorporate text into images has also been improved.

Whether it's album art design, wedding signage, birthday decoration, or outfit inspiration, Meta AI can generate images to bring your imagination to life faster and with more quality than ever before.

It even provides helpful tips and suggestions for improving your images, allowing you to iterate on the starting point.

And that's not all......

When you find an image you like, you can ask Meta AI to animate it, improve it in a new style, or even turn it into a GIF to share with friends.

It can be seen that with the powerful blessing of Llama 3, Meta AI performs better than ever.

Soon after, Meta AI will be available in the Quest headset.

The world's first open-source GPT-4 was born!Llama 3 was released, and Meta AI is available without login