laitimes

两只羊驼掐头去尾拼一起,屠榜HuggingFace

author:Quantum Position

Mengchen is from Wafei Temple

量子位 | 公众号 QbitAI

HuggingFace's open-source large model ranking has been slaughtered again.

The front row is occupied by an all-out SOLAR 10.7B trim build, squeezing out the various Mixtral 8x7B trim trim versions from a few weeks ago.

两只羊驼掐头去尾拼一起,屠榜HuggingFace

What is the origin of the SOLAR model?

The paper has just been uploaded to ArXiv from the South Korean company Upstage AI, which uses a new method of scaling large models deep up-scaling (DUS).

两只羊驼掐头去尾拼一起,屠榜HuggingFace

To put it simply, two 7B alpacas pinch their heads and tails, one cuts off the front 8 layers, and the other cuts off the back 8 layers.

The remaining two 24 layers are stitched together, and the 24th layer of the first model is spliced together with the 9th layer of the second model, and finally it becomes a new 48-layer 10.7B large model.

两只羊驼掐头去尾拼一起,屠榜HuggingFace

The paper claims that the new approach surpasses traditional scaling methods such as MoE and can have the same infrastructure as the underlying large model.

There is no need for add-on modules such as gated networks, training frameworks optimized for MoE, and custom CUDA kernels for fast inference, which can be seamlessly integrated into existing methods while remaining efficient.

The team chose Mistral 7B, the largest single model of 7B, as the substrate, and spliced it together in a new way to surpass the original and MoE versions.

At the same time, the aligned Instruct version also surpasses the corresponding MoE Instruct version.

两只羊驼掐头去尾拼一起,屠榜HuggingFace

Carry out the stitching to the end

Why is this splicing method, the paper is introduced from a kind of intuition.

Start with the simplest way of scaling, which is to repeat the base model of 32 layers twice to become 64 layers.

The advantage of this is that there is no heterogeneity, all layers are from the base large model, but there is a large "layer distance" at the seams of layers 32 and 33 (the same as layer 1).

Previous studies have shown that different layers of Transformer do different things, such as deeper layers are better at handling more abstract concepts.

The team believes that too large a layer distance may hinder the model's ability to make effective use of pre-trained weights.

One potential solution is to sacrifice the middle layer and thus reduce the variability at the seams, and this is where the DUS method is born.

Based on the trade-off between performance and model size, the team chose to remove 8 layers from each model, and the seams went from 32 layers to layer 1 to 24 layers with layer 9.

The performance of the model after simple splicing will be lower than that of the original basic model at first, but it can be quickly recovered after continued pre-training.

In the instruction fine-tuning phase, in addition to using the open-source dataset, a mathematically enhanced dataset was made, and the DPO was used in the alignment phase.

The last step is to put the weighted average of the model versions trained with different datasets, and also to stitch to the end.

两只羊驼掐头去尾拼一起,屠榜HuggingFace

Some netizens questioned the possibility of testing data leaks.

两只羊驼掐头去尾拼一起,屠榜HuggingFace

With this in mind, the team specifically reported the data contamination test results in the appendix of the paper, which showed low levels.

两只羊驼掐头去尾拼一起,屠榜HuggingFace

Finally, both the SOLAR 10.7B base model and the fine-tuning model are open source under the Apache 2.0 license.

Netizens who have tried it have reported that extracting data from JSON format data performs well.

两只羊驼掐头去尾拼一起,屠榜HuggingFace

Address:

https://arxiv.org/abs/2312.15166

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on