Editor: Editorial Department

Snowflake's Arctic, with 128 experts and 480 billion parameters, has become the largest open-source model so far. It is characterized by being large and sparse, so it uses less than half the computing resources of the Llama 3 8B to achieve the same performance indicators.

Just now, Arctic, with 128 experts and 480 billion parameters, has successfully ascended to the throne of the largest open-source MoE model to date.

It is based on a new Dense-MoE architecture design, consisting of a 10B dense Tranformer model and 128×3.66B MoE MLP, and is trained on 3.5 trillion tokens.

Not only that, as an "open source" model that is more "open source" than "open source", the team even disclosed all the processing methods of the training data.

The world's largest open-source model has exploded the record again!480 billion parameter MoE beats Llama 3 and Mixtral

Arctic has two characteristics, one is large and the other is very sparse.

The advantage is that this architecture allows you to get models with similar performance with several times less training overhead than others.

That said, Arctic's performance is superior compared to other open-source models trained with similar compute budgets.

Compared to Llama 3 8B and Llama 2 70B, Arctic uses less than half the training computing resources of them, but achieves a comparable score in the evaluation metrics!

Figure 1 Comparison of enterprise intelligence averages and training costs for encoding (HumanEval+ and MBPP+), SQL generation (Spider), and instruction following (IFEval).

The specific information is as follows-

480B parameters, 17B is active during generation;

128 Specialists, 2 active during generation;

Instruct & Base版本发布；

Focus on enterprise tasks (code, SQL, inference, tracing);

Released under Apache 2.0;

Approximately 900GB of memory at FP16 precision and 240GB of memory with INT4 precision

使用DeepSpeed-MoE训练。

The main thing is a cost performance

The evaluation mainly looks at two indicators, one is the enterprise intelligence index, and the other is the academic benchmark.

Enterprise intelligence metrics are essential skills for enterprise customers, including coding (HumanEval+ and MBPP+), SQL generation (Spider), and instruction following (IFEval).

At the same time, the team also adopted the academic benchmarks commonly used in the industry to assess LLMs, including world knowledge, common sense reasoning and mathematical ability.

It can be seen that Arctic has surpassed open-source competitors such as Mixtral 8×7B in a number of enterprise intelligence indicators.

In the compute category, it achieves top-tier performance, even with models trained with higher compute budgets.

On academic benchmarks, it doesn't perform badly either.

During the assessment, the team found something interesting.

World knowledge indicators such as MMLU are commonly used as academic benchmarks. As the high-quality network and STEM data increases, the MMLU score improves as the FLOPS is trained.

However, one of Arctic's goals is to optimize training efficiency while keeping training budgets small, so it makes sense that Arctic scores lower on MMLU compared to other models.

As a result, if the training computation budget is higher than Arctic's training, the MMLU performance will surpass Arctic's.

Of course, the performance of MMLU World Knowledge is not necessarily directly related to the enterprise intelligence that the team focuses on.

表3 Arctic与DBRX、Llama 3 8B、Llama 3 70B、Mixtral 8x7B、Mixtral 8x22B的对比

The training cost of enterprise-level AI has been brought down!

In the past, the cost of building top-tier enterprise AI with LLMs was prohibitively high and resource-intensive.

Often, the cost is tens of millions or even hundreds of millions of dollars, and the cost is staggering.

Researchers from the Snowflake AI team have been working on this, and the team has open-sourced systems such as ZeRO, DeepSpeed, PagedAttention/vLLM, and LLM360 in the past, significantly reducing the cost of LLM training and inference.

Launched today, Arctic excels at enterprise tasks such as SQL generation, coding, and following benchmark instructions.

It sets a new benchmark for cost-effective training, allowing users to create high-quality, customized models that meet the needs of their businesses at a fraction of the cost.

Arctic is also a true open model, under the Apache 2.0 license, providing unlimited access to weights and code.

Starting today, Snowflake Arctic is available on Hugging Face.

With only half the computing resources, the performance is comparable to that of the Llama 3 8B

The team found that enterprise customers have consistent needs and use cases for AI – building conversational SQL data assistants, code assistants, and RAG chatbots.

To facilitate evaluation, the team averaged coding (HumanEval+ and MBPP+), SQL generation (Spider), and instruction following (IFEval) to integrate these capabilities into a single metric called "Enterprise Intelligence".

In an open-source LLM, Arctic achieves top-tier enterprise intelligence with less than $2 million (equivalent to less than 3,000 GPU weeks) of training compute budget.

What's more, it excels on enterprise intelligence tasks even when compared to models trained with significantly higher compute budgets.

The results show that Arctic's performance on enterprise-level metrics is comparable to or better than that of Llama 3 8B and Llama 2 70B, while using less than half the training computing resources of the latter two.

Specifically, Arctic uses 1/17th the computing budget of the Llama3 70B, but it is on par with enterprise-level tasks such as programming (HumanEval+ and MBPP+), SQL (Spider), and instruction following (IFEval).

表1 Arctic、Llama-2 70B、DBRX和Mixtral 8x22B的模型架构和训练计算量（与活跃参数和训练token的乘积成正比）

In addition, Arctic's high training efficiency means that Snowflake customers and the AI community at large can train custom models in a more cost-effective way.

Training efficiency

为了实现如此高的训练效率，Arctic采用了独特的Dense-MoE Hybrid transformer架构。

The architecture combines a dense Transformer model with a scale of 10B and a residual MoE MLP of scale of 128×3.66B, and although the total number of parameters reaches 480B, only 17B parameters are selected to remain active through top-2 gating.

Arctic's design and training are based on three key innovations:

1. More but refined specialists, and more expert options

First of all, the DeepSpeed team proved in late 2021 that Mixture of Experts (MoE) can significantly improve the quality of LLM models without increasing the computational cost.

Secondly, the improvement of model quality mainly depends on the number of experts in the MoE model, the total number of parameters, and how and how these experts can be combined together.

Based on this, Arctic was designed to have 480B parameters, distributed among 128 fine-grained experts, and 17B active parameters were selected using top-2 gating. In contrast, recent MoE models use a much smaller number of experts (as shown in Table 2).

Intuitively, Arctic leverages a larger total number of parameters and a large number of experts to expand the model capacity, while choosing more wisely among a large number of refined experts, and using a moderate number of active parameters for resource-efficient training and inference, ultimately resulting in top-level intelligence.

Figure 2 Standard MoE architecture vs. Arctic

2. Architecture and system co-design

Even with the most powerful AI hardware, it is still inefficient to train a large number of experts based on a common MoE architecture.

The reason for this is that the overhead of full communication between experts is very high. However, if you can overlap communication with computation, you can greatly reduce this overhead.

Therefore, the team combined a dense Transformer with a residual MoE component (Figure 2) in the Arctic architecture, allowing the system to eliminate most of the communication overhead through communication computational overlapping, ultimately achieving excellent training efficiency.

3. Data courses for businesses

To excel on enterprise metrics like code generation and SQL requires a very different data curriculum than the model that trains generic metrics.

After hundreds of small-scale comparative experiments, the team found that generic skills such as common-sense reasoning can be learned at the beginning, while more complex metrics such as coding, math, and SQL can be learned effectively later in training.

Therefore, Arctic uses a three-stage course for training, each with a different data composition -

The first phase (1T Tokens) focuses on general-purpose skills, while the last two phases (1.5T and 1T Tokens) focus on enterprise-level skills.

Table 2 Dynamic data composition of Arctic three-stage training

Inference efficiency

Training efficiency is just one aspect of Arctic's efficiency.

Inference efficiency is also critical if you want to deploy models at a low cost.

As a leap in the scale of the MoE model, Arctic uses more experts and parameters than other open-source autoregressive models.

So, in order to effectively run inference on Arctic, the team made some systematic innovations -

a) In interactive inference with smaller batches (e.g., batch size of 1), the inference latency of the MoE model is bottlenecked by the time it takes to read all active parameters, where inference is limited by memory bandwidth.

At this batch size, Arctic (17B active parameter) has 4x fewer memory reads than Code-Llama 70B and 2.5x fewer memory reads than Mixtral 8x22B (44B active parameter), resulting in faster inference performance.

To this end, the team worked with NVIDIA's TensorRT-LLM and vLLM teams to provide an initial implementation of Arctic for interactive inference.

With FP8 quantization, the team can put Arctic into a single GPU node.

While still far from fully optimized, Arctic's throughput exceeds 70+ tokens/sec at a batch size of 1, allowing for an effective interactive service.

b) When the size of the batch increases significantly, such as thousands of tokens per forward pass, Arctic moves from memory bandwidth limitation to computational constraints, where the inference bottleneck lies in the active parameters of each token.

At this point, Arctic's computational requirements are reduced by a factor of 4 compared to CodeLlama 70B and Llama 3 70B.

In order to achieve computationally constrained inference and high throughput corresponding to the low number of active parameters in Arctic (as shown in the figure below), a large batch size is required.

To achieve this, there needs to be enough KV cache memory to support large batch sizes, as well as enough memory to store nearly 500B of model parameters.

Faced with these challenges, the team finally found a solution.

The team achieved this in two-node inference by using a system optimization combination of FP8 weighting, split fusion and continuous batching, tensor parallelism within nodes, and pipeline parallelism between nodes.

Figure 3 Comparison of the average and active parameters of enterprise intelligence during inference encoding (HumanEval+ and MBPP+), SQL generation (Spider), and instruction tracking (IFEval).

Open source code

The new model, the Arctic base model and the instruction fine-tuning model, are all open source, and anyone can use them in research, products, and prototypes.

Project address: https://github.com/Snowflake-Labs/snowflake-arctic

The researchers fine-tuned pipelines and recipes based on LoRA and allowed for efficient model fine-tuning on a single node.

Now, Snowflake is working with NVIDIA TensorRT-LLM and vLLM to develop an initial inference implementation for the Arctic model, optimized for interactive use with a batch size of 1.

In the future, they will also work with the community to solve the inference complexity of larger batch sizes for really large MoEs.

Cookbook：https://medium.com/snowflake/snowflake-arctic-cookbook-series-exploring-mixture-of-experts-moe-c7d6b8f14d16

In addition, Arctic is currently training on 4k contextual windows, and researchers will develop a method for attention-sinks-based sliding windows to support infinite sequence generation in the coming weeks.

The next step is to expand to a 32K context window.

Meet the team

Snowflake的CEO，是Sridhar Ramaswamy，是前谷歌高级副总裁。

After 15 years at Google, he became a co-founder of Neeva, which was later acquired by Snowflake.

He received his B.S. in Computer Science from the Indian Institute of Technology, Madras and his Ph.D. in Computer Science from Brown University.

ai团队的一把手vivek raghunathan,也是前谷歌副总裁.

He worked as a researcher at Microsoft, then worked on machine learning, advertising infrastructure at Google, and then at Google in 18 years as vice president, leading the YouTube team.

随后,他和 Sridhar Ramaswamy共同创办了Neeva.

Raghunathan is also an alumnus of the Indian Institute of Technology, but received his bachelor's degree from the Bombay campus. After that, he received his master's and doctorate degrees from UIUC.

In order to develop AI, the two poached several of the top veterans of the DeepSpeed team, including Zhewei Yao and Yuxiong He.

Zhewei Yao received his Ph.D. from UC Berkeley, where his research interests were in computational statistics, optimization, and machine learning. (Prior to that, he received his bachelor's degree in mathematics from Shanghai Jiao Tong University in 2016.) ）

He has been with Microsoft since 2021 as a Principal Researcher and R&D Manager, focusing on efficient large-scale training and inference.

Currently, he is a Senior Scientist and SDE II at Snowflake, and a founding member of Snowflake's large-scale pre-training.

Yuxiong He has been with Microsoft for 13 years and is one of the founders of DeepSpeed and recently joined Snowflake.

She received her bachelor's degree in computer engineering from Nanyang Technological University, Singapore.

The Word Of GodAurick Qiao, 去年11 月刚加入Snowflake。

During his PhD at CMU, he won the Best Paper Award at Osdi 2022. Previously, he worked at Microsoft and Dropbox.

He was the CEO of Petuum and the co-founder of LMNet.

Hao Zhang is an assistant professor at the Halıcıoğ Institute of Data Science and the Department of Computer Science and Engineering at UCSD.

He received his Ph.D. in Computer Science from CMU under the supervision of Eric Xing. During his PhD, he took a gap from his studies and worked at Petuum, an ML platform startup.

Hao Zhang co-founded LMnet.ai in 2023, and the company joined Snowflake in November of the same year.

He also previously co-founded LMSYS Org, a non-profit organization that trained Vicuna, a previously popular model, and initiated and maintained the most important large language model evaluation mechanism to date: Chatbot Arena.

His research interests are at the intersection of machine learning and systems.

Resources:

https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/

The world's largest open-source model has exploded the record again!480 billion parameter MoE beats Llama 3 and Mixtral

Snowflake's Arctic, with 128 experts and 480 billion parameters, has become the largest open-source model so far. It is characterized by being large and sparse, so it uses less than half the computing resources of the Llama 3 8B to achieve the same performance indicators.

With only half the computing resources, the performance is comparable to that of the Llama 3 8B

Training efficiency

Inference efficiency

Read on

轩辕大模型的实践与应用 | ML-Summit 2024

The mobile UI model came out, and the Apple iPhone may welcome a new cycle of upgrades

iFLYTEK does not tell the "sexy story" of large models

Meta released the "strongest open-source AI model", and the next generation may be stronger than GPT

面壁新模型:早于Llama3、比肩 Llama3、推理超越 Llama3!

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

13 Models of Effective Communication Expression

Eat through an industrial chain in one day: NO.37 AI large model industrial chain

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

The most complete interpretation of the MoE hybrid expert model: revealing the key technologies and challenges

Baidu's strongest SOTA: 3DGS based on diffusion model!

Sprint 2024 "Half Year Red" | Sixty percent of AI companies have achieved profitable growth, and large model companies have made money?

Dialogue with UBTECH Jiao Jichao: Large model accelerates humanoid robots to "work in the factory"

iFLYTEK's profit puzzle: high investment and low return in the field of large models

Ali Lin Junyang: Large models are not enough for many people, and building multimodal agents is the key

Li Feifei, the godmother of AI, founded a spatial intelligence company that strives to overcome the existing limitations of large-scale AI technology