Editor: Momoko So sleepy

Meta, USC, CMU and UCSD jointly proposed a revolutionary new architecture Megalodon, which can handle infinite contexts and surpass Llama2-7B in 2 trillion token training tasks to achieve extraordinary efficiency.

After Mamba, another architecture that dares to challenge Transformer was born!

Researchers from Meta, the University of Southern California (USC), CMU, and UCSD have proposed a new neural network architecture called Megalodon.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

This is an architecture designed to effectively handle LLM pre-training and inference with "infinite context" lengths.

Address: https://arxiv.org/abs/2404.08801

As we all know, the Transformer architecture is limited by quadratic complexity and weak length extrapolation ability when dealing with long contexts.

Although subquadratic solutions (e.g., linear attention, state-space models) exist, they are often inferior to Transformer in terms of pre-training efficiency and even the accuracy of downstream tasks.

Megalodon came into being to solve the problem of infinite processing contexts.

At the same time, it enables both efficient training (reducing communication and computation) and efficient inference (maintaining a constant KV cache).

It is worth mentioning that in a direct comparison with Llama 2, Megalodon not only trains more efficiently, but also surpasses Transformer in terms of accuracy in the task of processing 7 billion parameters and 2 trillion training tokens.

Specifically, Megalodon has a training loss of 1.70, which sits between Llama 2-7B (1.75) and 13B (1.67).

This paradigm-changing innovation represents a giant leap forward in the field of AI, and Megalodon ushers in a new era of computing efficiency and performance.

The biggest milestone since the release of GPT-3

Netizens said that first Google, then Meta, infinite context is one step away from us, and LLM will release unlimited potential.

Others believe that "infinite context length is definitely a game-changer"!

What's more, the startup CEO said, "This is the biggest milestone since the release of GPT-3, but there is no movement?!

Megalodon is the foundation of AGI."

"Meta's Megalodon is a breakthrough and has significant implications for AGI. Its infinite context length mimics human cognition, enabling seamless task switching."

According to the paper's author, Hao Zhang, this is a completely new alternative to Transformer.

According to the author of the paper, Beidi Chen, "Attention is good, but you don't need a complete attention mechanism"!

Tri Dao, an assistant professor at Princeton, said, "Combining SSM/RNN/EMA with attention is the way to get higher quality, longer context, and faster reasoning! Griffin, Jamba, Zamba, and now Megalodon are good examples."

Revolutionary architecture, more stable training

So, what is the design of the Megalodon architecture to achieve such excellent performance?

According to reports, it has been improved based on the MEGA architecture and added a number of new technical components.

First of all, the Complex Exponential Moving Average (CEMA) component is a new technology that extends the multi-dimensional damping exponential moving average method used in MEGA to the complex number domain, which can enhance the model's ability to process complex data.

Secondly, the researchers proposed an innovative normalization technique called the "time-step normalization layer".

It extends the traditional group normalization technique to the autoregressive sequence modeling task, allowing the model to perform effective normalization when processing sequence data.

以往，「层归一化」（Layer Normalization）与Transformer相结合性能，虽令人印象深刻。

However, it is clear that layer normalization does not directly reduce the internal covariate shift of the time step or order dimension.

In addition, although Group Normalization is improved over Layer Normalization in CV tasks, it cannot be directly applied to the Autoregressive Sequence Modeling of Transformers, because future information will be leaked through the mean and variance of the time-step dimension.

As shown in the figure below, c illustrates the approach to layer normalization and time-step standardization in the Megalodon architecture.

Finally, in order to enhance the stability of large-scale LLM pre-training, the researchers proposed a configuration that combines normalized attention and prenormalization with two-hop residuals.

This configuration optimizes the learning process of the model and improves the stability of training.

In Figure 3 below, a is a complete frame sketch of Megalodon.

The two diagrams in the middle and right show the pre-normalization and the configuration with two-hop residual pre-normalization, respectively.

2T token training, performance surpasses Llama2-7B

In the specific experimental evaluation, the researchers scaled Megalodon to a scale of 7 billion parameters and applied it to a large-scale LLM pre-training of 2 trillion tokens.

In addition, the authors conducted experiments on medium/small parameter scale sequential modeling benchmarks, including Long Range Arena (LRA), raw speech classification on Speech Commands, image classification on ImageNet-1K, and language modeling on WikiText-103 and PG19.

The results show that Megalodon significantly outperforms all state-of-the-art baseline models in various data modes in these tasks.

Data learning efficiency

Through the training loss graph and the results of multiple benchmarks, it can be seen that Megalodon has better data learning efficiency than Transformer under 7B parameters.

Computational efficiency

For different 4K and 32K context lengths, the pre-training of the Megalodon architecture is also very computationally efficient.

Short contextual assessment on academic benchmarks

Specifically, the researchers compared Megalodon to Llama 2, as well as the open-source base model, on a standard academic benchmark of short context (4K token).

After training on the same 2 trillion tokens, Megalodon-7B performed significantly better than Llama2-7B.

Long contextual assessment

For different long context confusion, it is proved that Megalodon can use a long context to predict the next token.

Figure 5 shows the perplexity (PPL) of the validation dataset at various context lengths from 4K to 2M.

In the long-context QA task in the Scroll dataset, Megalodon gets the best F1 on NaQA and competes with Llama 2 Long.

Medium-sized benchmarking assessments

In tests at Long Range Arena (LRA), the new architecture significantly closes the performance gap between chunked and full attention.

The results of other evaluation sets, such as Raw Speech Classification, ImageNet-1K, WikiText-103, and PG-19, are as follows:

Some impressions

Here are some of the insights and experiences of the original author of this study:

It took nearly two years from the idea to the final completion. During this period, I experienced several failures, and I also learned a lot of correct methods of doing scientific research in the era of large-scale pre-training.

Through this project, the researchers also realized the problems that should be paid attention to when making new model architectures in the era of large models. In summary:

The comparison of two different model architectures must be convincing if the data are identical. When the data is different, even if the proportion is small (<10%), the final result can be significantly different. Both the results of the training loss and downstream tasks are greatly affected by the training data.
For different architectures, it is only meaningful to compare them under the condition that the model is fully trained. For example, for a 7B size model, 2T training data is almost a basic requirement. Some models may perform well when data is scarce, but lag behind others when the data size increases. Therefore, for the comparison of large model architectures, the prerequisite for the results to be convincing is adequate training.

For models with very different architectures, the comparison significance of traditional FLOPS-based scaling law is decreasing. The reason is that two models with different architectures, even with the same flops, their actual speed can be several times worse. This has a lot to do with whether the architecture algorithm itself is suitable for computing on the most advanced GPUs. Therefore, the real practical comparison method is divided into two aspects: data learning efficiency and computational efficiency, as in this article. However, in practice, this requires a high level of engineering competence from researchers. In the era of large models, the development of new algorithms has been highly integrated with systems and other aspects.

Resources:

https://arxiv.org/abs/2404.08801

https://zhwanlon.com/p/692682649

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Meta, USC, CMU and UCSD jointly proposed a revolutionary new architecture Megalodon, which can handle infinite contexts and surpass Llama2-7B in 2 trillion token training tasks to achieve extraordinary efficiency.

Read on

From YES to NO?AMD 8700F&8400F, the new processor can't smell good?

After the founding of the People's Republic of China, there were two principles for dealing with war criminals, among which they were entangled in whether to kill some war criminals

What will Dai Li do if he arrests Wu Jingzhong? Mao Renfeng was drunk and snickered: Why do you have to let it go?

The woman was beaten by multiple people in the public bathroom in the early morning, and she took a video to show off, and the police handled it beautifully

[65mn spring steel heat treatment hardness full revealed]

Huawei's self-developed PC processor has been exposed, and Intel X86 is no longer the only one on the desktop

The OPPO Reno12 Pro may be equipped with a Dimensity 9200 processor, as well as new products such as tablet headphones

When the contracted land is expropriated, the land contract is terminated and the follow-up treatment is made

Introduction to die steel nitriding treatment technology

This year's fiercest mobile phone processor exposure, performance seckill Qualcomm and Apple!

Alibaba Yitian 710 128 core processor test: the fastest in Arm history!

Lottery detailsI won't be the only one in the world who is still washing clothes and rubbing my hands, right? A while ago, I was partying with friends, and I accidentally rubbed my cuffs when I took something to eat, and I complained that I would rub it again when I went back tonight

The iPhone 16 uses a new anti-glare material, and the iPad Pro debuts the M4 processor to improve AI performance

The elderly in Inner Mongolia were fined 3,000 yuan for grazing sheep at the door of their homes, and the sheep were pulled away!

What happens if PT film is dispensed directly without plasma surface treatment?-Hongzhan Automation

Foreigners openly cut the queue at Disney and caused public outrage, and the staff were dealt with very happily