laitimes

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

author:Quantum Position

Cressy from the temple of Wafei

量子位 | 公众号 QbitAI

No fine-tuning is required, just four lines of code can exponentially increase the length of the window of a large model, up to 3 times!

Moreover, it is "plug and play", which can theoretically be adapted to any large model, and has been successfully tested on Mistral and Llama2.

With this technology, the LargeLM can be transformed into a LongLM.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

Recently, Chinese scholars from Texas A&M University and other institutions have released a new large-scale model window expansion method, SelfExtended (SE).

On Mistral, the researchers randomly inserted 5 digits into the 24k length of text to search the model, and the results were processed by SE to show an all-green (pass) test result.

And the untreated version, at 6k in length, has already begun to "see red".

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

Alex Graveley, creator of GitHub Copilot, is also excited to announce that the experiment on Llama2 has also been a success.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

Upon further questioning from netizens, Alex explained the specific meaning of "work" in the tweet: the noise that used to appear at 4k is now gone.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

As for the limit of the length of the SE window, a big guy who reproduces the SE code according to the paper said that theoretically (as long as the computing power is sufficient) can reach infinite length.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

So, what kind of effect can SE achieve?

The ability to write long texts has been significantly enhanced

Over the course of the window length from 4,096 to 16,384, Llama 2's confusion has skyrocketed by two orders of magnitude from the beginning.

However, after using SE, the text length has increased by 4 times, and the confusion has only increased by 0.4.

On Mistral, SE brings a lower level of confusion than Mistral's own Sliding Window (SWA) mechanism.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

△ The lower left figure uses logarithmic coordinates

In the LongBench dataset designed for long text models, the scores of the model processed by SE are improved compared with the original version in tasks such as single/multi-document Q&A, summarization, few-shot Xi learning, and code.

In particular, on a model called SOLAR, the processed model performed better at 16k than the original version at 4k.

SOLAR is composed of two alpacas with their heads and tails pinched, which makes its attention layer structure different from other Transformer-based models.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

At the same time, in the closed-domain question answering tasks composed of exam questions such as GSM, the SE optimized model also achieves a higher average score than the original version, and is slightly inferior to its own SWA method on Mistral.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

However, the enhancement of the ability of long text did not lead to a decrease in the ability of the model on short text.

Under HuggingFace's OpenLLM benchmark, the SE version of Llama2 has significantly lower test scores than the original version.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

Currently, the out-of-the-box version of SE supports three models, Phi, Llama, and Mistral, on which window expansion can be performed with only 4 lines of code.

For other models, some changes to the code are required.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

So, how does SE add window length to a model?

Both attention mechanisms work together

Researchers believe that the ability of long text is inherent in the large model itself, but it needs to be stimulated by a certain method to achieve it.

The main problem is that when large models process long text, they will encounter relative position encoding that is beyond what was seen during training.

In response to this situation, the authors adopted the FLOOR attention mechanism as a solution strategy.

FLOOR groups the input text sequences, and then divides the absolute position of a token by the number of groups, so that it can be mapped to a shorter range over a long distance.

Then, the attention operation of these mapped values solves the problem of position coding overrun and realizes the processing of long text.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

However, when processing short- and medium-length texts, the original attention mechanism of the model is still used, so as to ensure that the model will not "take care of one or the other" and avoid the loss of short text ability due to the growth of long text ability.

In addition, the authors who reproduce SE on Mistral admit that the current model is not perfect, and there may be a problem of a burst of computational load.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

At the same time, the original author of SE also said that the current SE method has not been optimized in terms of efficiency, and plans to solve this problem by introducing strategies such as the FlashAttention mechanism in the future.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

Address:

https://arxiv.org/abs/2401.01325

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on