laitimes

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

author:New Zhiyuan

Editor: Momoko So sleepy

Meta, USC, CMU and UCSD jointly proposed a revolutionary new architecture Megalodon, which can handle infinite contexts and surpass Llama2-7B in 2 trillion token training tasks to achieve extraordinary efficiency.

After Mamba, another architecture that dares to challenge Transformer was born!

Researchers from Meta, the University of Southern California (USC), CMU, and UCSD have proposed a new neural network architecture called Megalodon.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

This is an architecture designed to effectively handle LLM pre-training and inference with "infinite context" lengths.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Address: https://arxiv.org/abs/2404.08801

As we all know, the Transformer architecture is limited by quadratic complexity and weak length extrapolation ability when dealing with long contexts.

Although subquadratic solutions (e.g., linear attention, state-space models) exist, they are often inferior to Transformer in terms of pre-training efficiency and even the accuracy of downstream tasks.

Megalodon came into being to solve the problem of infinite processing contexts.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

At the same time, it enables both efficient training (reducing communication and computation) and efficient inference (maintaining a constant KV cache).

It is worth mentioning that in a direct comparison with Llama 2, Megalodon not only trains more efficiently, but also surpasses Transformer in terms of accuracy in the task of processing 7 billion parameters and 2 trillion training tokens.

Specifically, Megalodon has a training loss of 1.70, which sits between Llama 2-7B (1.75) and 13B (1.67).

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

This paradigm-changing innovation represents a giant leap forward in the field of AI, and Megalodon ushers in a new era of computing efficiency and performance.

The biggest milestone since the release of GPT-3

Netizens said that first Google, then Meta, infinite context is one step away from us, and LLM will release unlimited potential.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Others believe that "infinite context length is definitely a game-changer"!

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

What's more, the startup CEO said, "This is the biggest milestone since the release of GPT-3, but there is no movement?!

Megalodon is the foundation of AGI."

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2
新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

"Meta's Megalodon is a breakthrough and has significant implications for AGI. Its infinite context length mimics human cognition, enabling seamless task switching."

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

According to the paper's author, Hao Zhang, this is a completely new alternative to Transformer.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

According to the author of the paper, Beidi Chen, "Attention is good, but you don't need a complete attention mechanism"!

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Tri Dao, an assistant professor at Princeton, said, "Combining SSM/RNN/EMA with attention is the way to get higher quality, longer context, and faster reasoning! Griffin, Jamba, Zamba, and now Megalodon are good examples."

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Revolutionary architecture, more stable training

So, what is the design of the Megalodon architecture to achieve such excellent performance?

According to reports, it has been improved based on the MEGA architecture and added a number of new technical components.

First of all, the Complex Exponential Moving Average (CEMA) component is a new technology that extends the multi-dimensional damping exponential moving average method used in MEGA to the complex number domain, which can enhance the model's ability to process complex data.

Secondly, the researchers proposed an innovative normalization technique called the "time-step normalization layer".

It extends the traditional group normalization technique to the autoregressive sequence modeling task, allowing the model to perform effective normalization when processing sequence data.

以往,「层归一化」(Layer Normalization)与Transformer相结合性能,虽令人印象深刻。

However, it is clear that layer normalization does not directly reduce the internal covariate shift of the time step or order dimension.

In addition, although Group Normalization is improved over Layer Normalization in CV tasks, it cannot be directly applied to the Autoregressive Sequence Modeling of Transformers, because future information will be leaked through the mean and variance of the time-step dimension.

As shown in the figure below, c illustrates the approach to layer normalization and time-step standardization in the Megalodon architecture.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Finally, in order to enhance the stability of large-scale LLM pre-training, the researchers proposed a configuration that combines normalized attention and prenormalization with two-hop residuals.

This configuration optimizes the learning process of the model and improves the stability of training.

In Figure 3 below, a is a complete frame sketch of Megalodon.

The two diagrams in the middle and right show the pre-normalization and the configuration with two-hop residual pre-normalization, respectively.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

2T token training, performance surpasses Llama2-7B

In the specific experimental evaluation, the researchers scaled Megalodon to a scale of 7 billion parameters and applied it to a large-scale LLM pre-training of 2 trillion tokens.

In addition, the authors conducted experiments on medium/small parameter scale sequential modeling benchmarks, including Long Range Arena (LRA), raw speech classification on Speech Commands, image classification on ImageNet-1K, and language modeling on WikiText-103 and PG19.

The results show that Megalodon significantly outperforms all state-of-the-art baseline models in various data modes in these tasks.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Data learning efficiency

Through the training loss graph and the results of multiple benchmarks, it can be seen that Megalodon has better data learning efficiency than Transformer under 7B parameters.

Computational efficiency

For different 4K and 32K context lengths, the pre-training of the Megalodon architecture is also very computationally efficient.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Short contextual assessment on academic benchmarks

Specifically, the researchers compared Megalodon to Llama 2, as well as the open-source base model, on a standard academic benchmark of short context (4K token).

After training on the same 2 trillion tokens, Megalodon-7B performed significantly better than Llama2-7B.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Long contextual assessment

For different long context confusion, it is proved that Megalodon can use a long context to predict the next token.

Figure 5 shows the perplexity (PPL) of the validation dataset at various context lengths from 4K to 2M.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

In the long-context QA task in the Scroll dataset, Megalodon gets the best F1 on NaQA and competes with Llama 2 Long.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Medium-sized benchmarking assessments

In tests at Long Range Arena (LRA), the new architecture significantly closes the performance gap between chunked and full attention.

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

The results of other evaluation sets, such as Raw Speech Classification, ImageNet-1K, WikiText-103, and PG-19, are as follows:

新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Some impressions

Here are some of the insights and experiences of the original author of this study:

It took nearly two years from the idea to the final completion. During this period, I experienced several failures, and I also learned a lot of correct methods of doing scientific research in the era of large-scale pre-training.
新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2

Through this project, the researchers also realized the problems that should be paid attention to when making new model architectures in the era of large models. In summary:

  • The comparison of two different model architectures must be convincing if the data are identical. When the data is different, even if the proportion is small (<10%), the final result can be significantly different. Both the results of the training loss and downstream tasks are greatly affected by the training data.
  • For different architectures, it is only meaningful to compare them under the condition that the model is fully trained. For example, for a 7B size model, 2T training data is almost a basic requirement. Some models may perform well when data is scarce, but lag behind others when the data size increases. Therefore, for the comparison of large model architectures, the prerequisite for the results to be convincing is adequate training.
新架构掀翻Transformer!无限上下文处理,2万亿token碾压Llama 2
  • For models with very different architectures, the comparison significance of traditional FLOPS-based scaling law is decreasing. The reason is that two models with different architectures, even with the same flops, their actual speed can be several times worse. This has a lot to do with whether the architecture algorithm itself is suitable for computing on the most advanced GPUs. Therefore, the real practical comparison method is divided into two aspects: data learning efficiency and computational efficiency, as in this article. However, in practice, this requires a high level of engineering competence from researchers. In the era of large models, the development of new algorithms has been highly integrated with systems and other aspects.

Resources:

https://arxiv.org/abs/2404.08801

https://zhwanlon.com/p/692682649

Read on