laitimes

Mamba架构第一次做大!混合Transformer,打败Transformer

author:Quantum Position

Abundant color from the temple of Wafei

量子位 | 公众号 QbitAI

Wonderful, the first to really scale up the Bang Mamba architecture to a big enough job.

With 52 billion parameters, it is still a Mamba + Transformer hybrid architecture.

Its name is Jamba.

Mamba架构第一次做大!混合Transformer,打败Transformer

Take the advantages of the two architectures, model quality and efficiency are both, throughput is required, and low memory is required.

Mamba架构第一次做大!混合Transformer,打败Transformer

Preliminary running scores show:

  • Jamba performance is generally close to that of the Mixtral 8x-7B, with 3x the throughput for 128k long contexts.
Mamba架构第一次做大!混合Transformer,打败Transformer
  • A total of 256k contexts are supported, and a single A100 GPU can process 140k, which is the most efficient and economical model of the same size.
Mamba架构第一次做大!混合Transformer,打败Transformer

This achievement comes from AI21labs, an Israeli AI company.

The original author of Mamba reposted excitedly after reading it:

Absolute "big news".
Mamba架构第一次做大!混合Transformer,打败Transformer

Mamba、Transformer,合体

Mamba, proposed by CMU and Princeton University, solves the limitations of Transformer (the longer the inference context, the larger the model memory footprint, and the slower the inference speed, resulting in a huge computing power consumption).

But it also has its own drawbacks –

Without focusing on the context as a whole, Mamba's output is of poor quality, especially on recall-related tasks.

In line with the "both and have" principle, Jamba stepped forward to offer the best of both worlds.

Mamba架构第一次做大!混合Transformer,打败Transformer

Jamba consists of Transformer, Mamba, and MoE layers that optimize memory, throughput, and performance simultaneously.

As shown in the diagram below, in order to integrate the two architectures, Jamba takes an innovative approach to a combination of blocks-and-layers.

To put it simply, each Jamba block contains an attention layer or a Mamba layer, followed by a multi-layer perceptron MLP, and the overall ratio is guaranteed to be one Transformer layer for every eight layers.

Mamba架构第一次做大!混合Transformer,打败Transformer

Second, Jamba leverages MoE to increase the total number of model parameters while simplifying the amount of active parameters used in inference.

In the end, the model capacity is high, and the computational requirements do not increase accordingly.

To maximize model throughput on a single GPU (80GB), Jamba also optimized the MoE layer and number of experts used, ultimately leaving enough memory for everyday inference workloads.

It is worth mentioning that when inferring, Jamba's MoE layer only needs 12 billion of the 52 billion available parameters to ensure that it is more efficient than a Transformer only model of the same size.

You know, someone has tried to expand Mamba before, but they haven't been able to do it above 3 billion parameters.

Therefore, in addition to the successful merger of Mamba and Transformer, Jamba has also achieved the second major achievement:

The first hybrid architecture of its kind to reach production-scale and quality (SSM Hybrid Transformer) (ps. Mamba is a state-space model (SSM).

Throughput and efficiency up

Initial assessments show that Jamba excels in key metrics such as throughput and efficiency.

First, Jamba can deliver 3x the throughput in long contexts, which is more efficient than similarly sized Transformer models such as Mixtral 8x7B.

As shown in the figure below, when the context window reaches 128k, Jamba's tokens per second are close to 1500, and the best-performing Mixtral 8x7B should only be above 500.

Mamba架构第一次做大!混合Transformer,打败Transformer

Secondly, on a single GPU, Jamba can accommodate up to 140k contexts, which is cost-effective and efficient.

相比之下,Mixtral 8x7B为64k,Llama2 70B则仅为16k。

Mamba架构第一次做大!混合Transformer,打败Transformer

Thirdly, the output quality of Jamba is also guaranteed.

It scored SOTA in 3 out of 4 on the following reasoning benchmarks. At the same time, in benchmarks such as GSM8K, Jamba is on par with the SOTA model, if not in the first place.

Overall, the Jamba's performance is close to that of the Mixtral 8x7B.

Mamba架构第一次做大!混合Transformer,打败Transformer

Finally, the authors caution that this is only the result of the initial transformation, and there is still a lot of room for optimization (such as MoE parallelism, faster Mamba implementation). So the performance will be stronger then.

Good news: Jamba is now live on Hugging Face, with an emphasis on apache-2.0 licensing.

(The command version of Jamba will soon be available on the AI21labs platform.) )

Mamba架构第一次做大!混合Transformer,打败Transformer

Netizens were moved to tears after reading it.

Mamba架构第一次做大!混合Transformer,打败Transformer
Mamba架构第一次做大!混合Transformer,打败Transformer

Portals:

Hatps://huggingface.ko/i21labs/jamba-v0.1

Reference Links:

[1]https://www.ai21.com/blog/announcing-jamba

[2]https://www.ai21.com/jamba

[3]https://twitter.com/AI21Labs/status/1773350888427438424?s=20

[4]https://twitter.com/tri_dao/status/1773418926518734957?s=20

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends