laitimes

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

author:New Zhiyuan

Editor: Editorial Department

Just now, the throne of the world's strongest open source model changed hands, and the startup Databricks released DBRX, surpassing Llama 2, Mixtral and Grok-1. The process took only 2 months, $10 million, and 3,100 H100s.

The world's strongest open source model changed hands overnight!

Just now, Databricks, a super unicorn, has launched an open-source model with 132 billion parameters - DBRX.

It uses a fine-grained MoE architecture and uses only 36 billion parameters per input, achieving faster token throughput per second.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

This unique MoE architecture makes DBRX an open-source SOTA model with 2x faster inference speed than LLaMA 2-70B!

The most important thing is that the training cost is cut in half! With only $10 million and 3,100 H100 blocks, Databricks released DBRX in 2 months.

This is only a fraction of the cost and chips that Meta used to develop Llama2.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times
3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

DBRX easily beats the open-source models LLaMA2-70B, Mixtral, and Grok-1 in terms of language understanding, programming, math, and logic.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Even, the overall performance of DBRX surpasses GPT-3.5. Especially in terms of programming, it completely beats GPT-3.5.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

DBRX ALSO PROVIDES API FUNCTIONALITY FOR OPEN COMMUNITIES AND ENTERPRISES THAT ARE LIMITED TO A CLOSED MODEL. The weights of the base model (DBRX Base) and the fine-tuned model (DBRX Instruct) are now available under license on Hugging Face.

Starting today, Databricks customers will be able to use DBRX through the API. It runs on a Macbook Pro, and LLMs will soon be able to support personal devices.

Soumith Chintala, the father of Pytorch, is also very optimistic about the latest open source model, DBRX.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

From Mistral, to Grok-1, to DBRX, MoE-based models are taking over the open source world.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

The employees of Databricks excitedly said that in the past three months, the days when my friends asked me on the weekend and said, "No, I have something to do this week, but I can't say anything", has finally come to an end, and DBRX is a "monster" that we have worked overtime.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Another netizen said, "If the lab continues to open-source large-scale MoE models, NVIDIA may need to launch the most powerful consumer GPU with Blackwell architecture."

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

The world's strongest open source model has changed hands

DBRX is a large model based on the Transformer-only decoder, which is also trained on the next token prediction.

It uses a fine-grained Expert Hybrid (MoE) architecture, which means it has more expert models.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Yes, it is still MoE who has made great contributions this time. In MoE, some parts of the model are launched based on the content of the query, which greatly improves the efficiency of training and running the model.

DBRX has about 132 billion parameters, Llama 2 has 70 billion parameters, Mixtral has 45 billion, and Grok has 314 billion.

HOWEVER, DBRX PROCESSES A TYPICAL QUERY AND ONLY NEEDS TO ACTIVATE ABOUT 36 BILLION PARAMETERS ON AVERAGE.

This improves the utilization of the underlying hardware and will increase training efficiency by 30 to 50 percent. Not only does it respond faster, but it also reduces the amount of energy required.

Compared to other open-source MoE models such as Mixtral and Grok-1, DBRX uses more small experts.

Specifically, DBRX has 16 different experts, with 4 experts selected for each token in each layer. Mixtral and Grok-1 have 8 experts, and a routing network selects 2 experts for each token at each layer.

Obviously, DBRX offers 65 times the possibilities of expert combinations, which can significantly improve the quality of the model.

In addition, DBRX uses Rotational Position Coding (RoPE), Gated Linear Units (GLU), and Grouped Query Attention (GQA), and uses the GPT-4 tokenizer available in the tiktoken repository.

The DBRX model is pre-trained on 12 trillion tokens of text and code, and the maximum context length supported is 32k.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

The researchers estimate that the data is at least 2 times better than the data used to pretrain the MPT family of models.

This new dataset was developed using a full suite of database tools, including Apache Spark™ and Databricks notebooks for data processing, Unity Catalog for data management and governance, and MLFlow for experiment tracking.

The team used "curriculum learning" for pre-training and changed the data composition during the training process, which greatly improved the quality of the model.

So, how exactly does DBRX perform?

击败2.4倍参数Grok-1

In Table 1 below, DBRX Instruct refreshes the SOTA of open source AI on comprehensive benchmarks, programming and math benchmarks, and MMLU.

Comprehensive benchmarks

The researchers evaluated DBRX Instruct and other open-source models on two comprehensive benchmarks, Hugging Face's Open LLM Leaderboard and Databricks Model Gauntlet.

Databricks Model Gauntlet consists of more than 30 tasks covering 6 categories: World Knowledge, Common Sense Reasoning, Language Understanding, Reading Comprehension, Symbolic Problem Solving, and Programming.

In terms of overall benchmarking, DBRX Instruct surpasses all chat, instruction tuning models.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Programming and math benchmarks

DBRX Instruct stands out in programming and math.

It scores higher than other open-source models on both HumanEval and GSM8k.

On programming benchmarks, DBRX Instruct scored 70.1%, Grok-1 63.2%, and LLaMA2-70B Chat 32.2%. On mathematical benchmarks, DBRX Instruct is 66.9%, Grok-1 is 62.9%, and LLaMA2-70B Base is 54.1%.

Although the Grok-1 has 2.4 times the parameters of the DBRX, the DBRX surpasses the second-ranked Grok-1 in both programming and mathematics.

On HumanEval, DBRX Instruct (70.1%) even outpaces CodeLLaMA-70B Instruct (67.8%), a model built specifically for programming.

In terms of MMLU, the language comprehension benchmark, DBRX Instruct scored higher than all models at 73.7%.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Surpasses GPT-3.5 across the board

In addition, compared to the closed-source model GPT-3.5, DBRX Instruct surpasses it across the board, and can also compete with Gemini 1.0 Pro and Mistral Medium.

Specifically, DBRX Instruct outperformed GPT-3.5 in terms of common sense knowledge (73.7% vs. 70.0%), common sense reasoning HellaSwg (89.0% vs. 85.5%), and WinoGrand (81.8% vs. 81.6%).

In the HumanEval (70.1% vs. 48.1%) and GSM8k (72.8% vs. 57.1%) tests, DBRX also excelled in programming and mathematical reasoning.

此外,在Inflection Corrected MTBench、MMLU、HellaSwag以及HumanEval基准上,DBRX Instruct的得分高于Gemini 1.0 Pro。

However, the Gemini 1.0 Pro is significantly stronger in terms of GSM8k performance.

在HellaSwag基准上,DBRX Instruct和Mistral Medium得分相似,而Winogrande和MMLU基准上,Mistral Medium更强。

另外,在HumanEval、GSM8k、以及Inflection Corrected MTBench基准上,DBRX Instruct取得了领先优势。

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

In Databricks' view, it's important for open source models to beat closed source models.

In the last quarter, team members saw a significant shift in their customer base of more than 12,000 people, replacing proprietary models with open source models to improve efficiency.

Now, many customers can outperform proprietary models in quality and speed by customizing open-source models for specific tasks.

DBRX was launched to speed up this process.

Long context task quality and RAG

DBRX Instruct is trained with up to 32K token contexts.

Table 3 compares its performance against Mixtral Instruct, as well as the latest versions of GPT-3.5 Turbo and GPT-4 Turbo APIs, on a set of long-context benchmarks.

Without a doubt, GPT-4Turbo is the best model to perform these tasks.

However, with one exception, DBRX Instruct outperforms GPT-3.5 Turbo across all context lengths and all parts of the sequence.

DBRX Instruct和Mixtral Instruct的总体性能相似。

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

One of the most common ways to leverage model context is to retrieve enhanced generation (RAG).

In RAG, the content related to the prompt is retrieved from the database and presented with the prompt, providing more information to the model.

Table 4 shows the quality of DBRX on two RAG benchmarks, Natural Questions and HotPotQA.

DBRX Instruct is highly competitive with open-source models such as Mixtral Instruct and LLaMA2-70B Chat, as well as GPT-3.5 Turbo.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

The training efficiency is twice as high as that of non-MoE models

Model quality must be placed in the context of the efficiency of model training and use, especially in Databricks.

The researchers found that training the MoE model provided substantial improvements in the computational efficiency of training (Table 5).

For example, training DBRX MoE-B, a smaller member of the DBRX family (total parameters of 23.5B and active parameters of 6.6B), requires 1.7 times less Flop than LLaMA2-13B to achieve a score of 45.5% on Databricks LLM Gauntlet.

The DBRX MOE-B also contains half the effective parameters of LLaMA2-13B.

Overall, the end-to-end LLM pre-training pipeline has improved computational efficiency by nearly 4x over the past 10 months.

On May 5, 2023, Databricks released MPT-7B, a 7B parametric model trained on a 1T token that scored 30.9% on the Databricks LLM Gauntlet.

The DBRX MoE-A in the DBRX family (7.7B total parameters and 2.2B active parameters) scored 30.5%, while FLOPS was reduced by a factor of 3.7.

This efficiency is the result of a series of improvements, including the use of the MoE architecture, other architectural changes to the network, better optimization strategies, better word segmentation, and better pre-training data.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Individually, better pre-trained data has a big impact on model quality.

The researchers trained the 7B model on a 1T token called DBRX Dense-A using DBRX pre-training data. It scored 39.0% on Databricks Gauntlet, compared to 30.9% on MPT-7B.

The researchers estimate that the new pre-training data is at least 2 times higher than the data used to train the MPT-7B.

In other words, to achieve the same model quality, the number of tokens required is half less.

Further, the researchers determined this by training DBRX Dense-A on 500B tokens.

It outperformed the MPT-7B on the Databricks Gauntlet by 32.1%.

In addition to better data quality, another important reason for the increased efficiency of tokens may be the GPT-4 tokenizer.

Inference efficiency

Overall, the inference speed of MoE models, as shown by their total parameters, is faster. This is because they use relatively few parameters for each input.

DBRX inference throughput is 2-3 times higher than that of the 132B non-MoE model.

Inference efficiency and model quality are often contradictory: larger models are generally higher quality, but smaller models, higher inference efficiency.

Using an MoE architecture can achieve a better balance between model quality and inference efficiency than dense models.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Measured by Mosaic AI Model Serving, DBRX generation is significantly faster than LLaMA2-70B

For example, DBRX has a higher quality than LLaMA2-70B, and because the number of active parameters is about half that of LLaMA2-70B, DBRX inference throughput can be up to 2x faster.

Mixtral is another aspect of the improved "pareto frontier" of the MoE model: it is smaller and lower in mass than DBRX, but achieves higher inference throughput.

Databricks Foundation Model API inference throughput of up to 150 tokens per second on an optimized 8-bit quantized model service platform.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Free for businesses

Enterprises can access DBRX on the Databricks platform, take advantage of long context capabilities in RAG systems, and build custom DBRX models on their own private data.

The open-source community can access DBRX through GitHub repositories and Hugging Face.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Project address: https://github.com/databricks/dbrx

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Project address: https://huggingface.co/databricks

Because DATABricks is built entirely on top of the database to build DBRX, every enterprise user can use the same tools and techniques to create or improve their own custom models.

Training data can be centrally managed in Unity Catalog, processed and cleaned using tools and services provided by Apache Spark and Lilac AI.

Large-scale model training and fine-tuning is provided by Mosaic AI, which DataBricks recently acquired.

Alignment issues can also be solved through their platforms and services.

Clients and partners such as Nasdaq and Accenture are already using this suite of services and tools.

Acquired a company with a valuation of 1.3 billion, and the liver came out in 2 months

A report by foreign media Wired details the birth process of the world's strongest open source model.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Previously, Databricks had made a name for itself in the industry.

On Monday, more than a dozen engineers and executives at Databricks waited in the conference room for the final results.

What will happen to an LLM that has spent months and about $10 million in training?

Obviously, they didn't know that the model they had created was so powerful until the final results of the aptitude test came out.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

"We've surpassed all models!" As Jonathan Frankle, Chief Neural Network Architect and Head of the DBRX Team, announced the results, the members erupted in cheers and cheers.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Databrick的决策者:Jonathan Frankle,Naveen Rao, Ali Ghodsi,Hanlin Tang

Yes, this is how DBRX surpasses Llama 2 and Mixtral, two of today's most popular open source models.

Even Musk's xAI, recently open-sourced Grok AI, was beaten by DBRX.

Frankle joked: If we receive a mean tweet from Musk, we are sure to succeed.

The most surprising thing for the team was that DBRX even approached GPT-4, the pinnacle of machine intelligence, in several metrics.

There is no doubt that DBRX now sets a whole new technical standard for open source LLMs.

Unicorns reinvigorate the open source community

By open-sourced DBRX, Databricks has taken the open-source movement a step further, joining Meta's open-source bandwagon against OpenAI and Google.

However, Meta did not disclose some of the key details of the Llama 2 model, and Databricks will make all the key decisions made in the final stage, considering that the process of training DBRX cost millions of dollars.

Ali Farhadi, CEO of the Allen Institute for Artificial Intelligence, said that there is an urgent need for greater transparency in the construction and training of AI models.

Databricks chose open source for good reason. Although giants such as Google have deployed AI in the past year, many large companies in the industry have not yet widely used large models on their own data.

According to Databricks, companies in industries such as finance and medicine are eager for ChatGPT-like tools, but are worried about sending sensitive data to the cloud.

Databricks will customize DBRX for customers or tailor it from the ground up for their business. For large companies, the cost of building a model of DBRX's scale is very reasonable.

"That's our big business opportunity. 」

To that end, Databricks acquired startup MosaicML in July last year, bringing in a number of technical talents, including Frankle. No one in either company had ever built such a large model before.

Internal operations

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

Databricks首席执行官Ali Ghodsi

Companies such as OpenAI are obsessively pursuing larger models. But for Frankle, LLMs are more than just size.

Getting thousands of computers smartly connected together and running through switches and fiber optic cables can be challenging.

MosailML's employees are experts in this obscure science, so Databrick valued it at $1.3 billion when it bought it last year.

In addition, data has a significant impact on the final result, and perhaps because of this, Databricks does not disclose the details of the data, including data quality, cleaning, filtering, and preprocessing.

Naveen Rao, Vice President of Databricks and Founder and CEO of MosaicML, said, "You could almost argue that this is the top priority for model quality. 」

Multi-million dollar question

Sometimes, the process of training a large AI model is not only technical, but also emotional.

Two weeks ago, the team at Databricks was faced with a tricky, multi-million dollar problem: how to harness the full potential of the model.

AFTER TWO MONTHS OF TRAINING THE MODEL ON A RENTED MODEL ON 3,072 POWERFUL NVIDIA H100 GPUs, DBRX has already achieved excellent results in multiple benchmarks. But soon, there was only one week left for them to use.

Team members threw ideas at each other on Slack, with one of the proposals being to make a version of the model that generated computer code, or a small version for hobbyists to try.

The team also considered moving away from increasing the size of the model and instead using carefully selected data to improve the model's performance on specific features, an approach known as course learning.

Or, they can continue to scale up the model as originally planned, hopefully making it even more robust.

This last practice is affectionately referred to by the team members as the "let it go" option, and it seems that some people have a special affection for it.

3100 H100 crazy liver for 2 months, 132B inference ability soared 2 times

While the discussion was friendly, a heated exchange of ideas was inevitable as the engineers fought for their preferred solutions.

Eventually, Frankle subtly steered the team's direction toward a data-centric approach (course learning). Two weeks later, the decision has clearly paid off in big payoffs.

However, Frankle's judgment on the other expected outcomes of the project was less accurate.

He originally didn't think DBRX would be particularly prominent in generating computer code, as the team didn't focus on this area.

He even confidently said that if he made a mistake in judgment, he would dye his hair blue.

However, Monday's results showed that DBRX outperformed all other open-source AI models on standard coding benchmarks.

"Our model code capabilities are very strong. Speaking at the results launch on Monday, he said, "I've made an appointment to dye my hair today. 」

risk assessment

Finally, there is the risk of open source models.

DBRX is the most powerful open source model to date, and anyone can use or modify it.

Does this pose an unpredictable risk, such as cybercrime or misuse of chemical and biological weapons?

Databricks says it has conducted comprehensive security testing of the model.

Stella Biderman, executive director of Eleuther AI, said there is little evidence that open source increases security risks. "We have no particular reason to believe that an open model will significantly increase the risk over the existing closed model. 」

Previously, EleutherAI, along with Mozilla and about 50 other organizations and academics, sent an open letter to U.S. Secretary of Commerce Raimondo, asking her to ensure that future AI regulation leaves enough room for open source AI projects to grow.

The experts in the letter believe that open source AI is good for economic growth, as they help start-ups and small businesses gain access to this breakthrough and also help accelerate scientific research.

And that's what Databricks wants DBRX to contribute.

In addition to providing other AI researchers with a new model and useful tips for building their own, DBRX helps deepen the understanding of how AI actually works, Frankle said.

The Databricks team plans to study how the model changes in the final stages of training, perhaps revealing how a strong model emerges with additional capabilities.

Read on