Editor: Qiao Yang I'm so sleepy

A few days ago, Princeton University and Meta published their latest research results on arXiv - Lory model, which proposes to build a fully differentiable MoE model, which is a new method of pre-training autoregressive language models.

Unlike most models that use abbreviations, the authors explain in a footnote that Lory is a parrot with rainbow-colored feathers, very similar to the spirit of "soft MoE".

150B token was trained from scratch, and Princeton Meta released a fully differentiable MoE architecture Lory

The team of authors of the paper can also be called a "star cast".

Paper address: https://arxiv.org/abs/2405.03133

One of the lead authors, Danqi Chen, is an assistant professor in the Department of Computer Science at Princeton University and one of the co-leaders of the NLP group at Princeton. She received her B.A. from Tsinghua University and her Ph.D. from Stanford University in 2018 under the supervision of Christopher Manning.

Dan Jurafsky, a Stanford professor and a leading figure in the field of NLP, once said of her: "She has a great taste for discovering important research questions. She's already had an extraordinary impact on the field, and she's only going to get bigger."

Mike Lewis is a research scientist at Meta AI who leads the pre-training of Meta's just-released large language model, Llama 3.

He has previously published a number of influential research results, including Bart, Roberta, top-k sampling, and more.

The first author of this article is Zhong Zexuan, a fifth-year doctoral student at Princeton University, and his supervisor is Professor Chen Danqi.

Zhong Zexuan graduated from the University of Illinois at Urbana-Champaign with a master's degree and Peking University with a bachelor's degree in computer science, and interned at Meta AI and Microsoft Research Asia, where he completed his research.

After the release, the authors of the paper also provided a full interpretation on Twitter.

The key technology introduced includes two aspects, one is to replace token-level routing with a causal segmentation routing strategy, which can achieve efficient expert merging while maintaining the autoregressive properties of language models.

Second, a data batch processing method based on similarity is proposed, which will lead to a low-level expert model if it is only spliced together with randomly selected texts, and grouping similar texts can make the model more specialized.

Based on these methods, the authors trained a series of Lory models from scratch using data from 150B tokens, with two levels of active parameters of 0.3B and 1.5B, and a maximum of 32 experts.

Compared to the dense model, Lory's training process is more efficient, and the same loss value can be achieved with 2.5 times fewer steps.

The research team used contextual learning to evaluate Lory's ability and found that the model achieved good results in downstream tasks such as common sense reasoning, reading comprehension, closed-book question answering, and text classification.

It can be observed that the performance of the model can be improved by using more experts.

相比目前MoE领域的SOTA模型Expert Choice（EC），Lory模型也表现出了有竞争力的性能。

In December 2023, a French startup called Mistral AI released a model that is comparable to or even better than GPT-3.5 and Llama 2 70B, Mixtral 8x7B.

Mixtral uses a sparse MoE network, which not only shows strong performance, but also is very efficient, and the inference speed is 6 times faster than that of Llama 2 70B, so that MoE has received wide attention from the open source community.

There is even speculation that GPT-4 may also use MoE technology to achieve a super-large model with more than a trillion parameters.

For the language model of the Transformer architecture, there are two main elements of the MoE:

The first is to replace the dense feedforward network layer (FFN) with a more sparse MoE layer, in which each expert is an independent neural network, or even the MoE itself, thus forming a hierarchical MoE structure.

The second is to use the gated network or routing mechanism to determine which expert the token is sent to, in which the routing mechanism of the token is the key point to determine the performance of the MoE model.

Causal segment routing

While this mechanism of MoE helps to scale the model efficiently, the process of training a routing network introduces discrete, non-differentiable learning objectives. The SMEAR model released in 2023 has already begun to explore solutions, using expert merging methods to build fully differentiable MoE models.

Address: https://arxiv.org/abs/2306.03745

However, the method used by SMEAR is to perform a soft merge of all experts and take their weighted averages, which is suitable for text classification tasks but difficult to apply to autoregressive language models.

Therefore, the authors propose a method of using segment routing to perform expert merging on each statement instead of each token, which effectively reduces the number of merge operations.

If only the current segment is used for routing, it is likely that the language model will miss information across segments, so this paper proposes to use causal segmentation routing similar to autoregression.

When merging experts for the current segment, the information from the previous segment needs to be taken into account to determine the routing weight of each expert.

The results of the ablation experiment also show that compared with the causal segmentation routing strategy, the use of prefixes for routing alone will lead to the degrading performance of the language model.

Similarity-based data batching

The standard practice for pretrained language models is to randomly stitch together documents in a dataset to construct fixed-length training samples.

This approach is problematic for the MoE model, where the tokens of adjacent segments can come from very different and unrelated documents, which can compromise the specialization of the expert model.

Therefore, inspired by a paper in ICLR 2024, the authors employ a similar technique in Lory, connecting similar documents in turn to construct training samples, allowing the expert model to be more "focused" on different domains or topics.

Paper address: https://arxiv.org/abs/2310.10638

Experiments show that the Lory model is better than the dense model in terms of random batch processing or similarity-based batching, but the similarity-based method can achieve a greater loss improvement.

150B token was trained from scratch, and Princeton Meta released a fully differentiable MoE architecture Lory

A few days ago, Princeton University and Meta published their latest research results on arXiv - Lory model, which proposes to build a fully differentiable MoE model, which is a new method of pre-training autoregressive language models.