Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

author：Quantum Position 2024-01-08 13:53:00

Cressy from the temple of Wafei

量子位 | 公众号 QbitAI

No fine-tuning is required, just four lines of code can exponentially increase the length of the window of a large model, up to 3 times!

Moreover, it is "plug and play", which can theoretically be adapted to any large model, and has been successfully tested on Mistral and Llama2.

With this technology, the LargeLM can be transformed into a LongLM.

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

Recently, Chinese scholars from Texas A&M University and other institutions have released a new large-scale model window expansion method, SelfExtended (SE).

On Mistral, the researchers randomly inserted 5 digits into the 24k length of text to search the model, and the results were processed by SE to show an all-green (pass) test result.

And the untreated version, at 6k in length, has already begun to "see red".

Alex Graveley, creator of GitHub Copilot, is also excited to announce that the experiment on Llama2 has also been a success.

Upon further questioning from netizens, Alex explained the specific meaning of "work" in the tweet: the noise that used to appear at 4k is now gone.

As for the limit of the length of the SE window, a big guy who reproduces the SE code according to the paper said that theoretically (as long as the computing power is sufficient) can reach infinite length.

So, what kind of effect can SE achieve?

The ability to write long texts has been significantly enhanced

Over the course of the window length from 4,096 to 16,384, Llama 2's confusion has skyrocketed by two orders of magnitude from the beginning.

However, after using SE, the text length has increased by 4 times, and the confusion has only increased by 0.4.

On Mistral, SE brings a lower level of confusion than Mistral's own Sliding Window (SWA) mechanism.

△ The lower left figure uses logarithmic coordinates

In the LongBench dataset designed for long text models, the scores of the model processed by SE are improved compared with the original version in tasks such as single/multi-document Q&A, summarization, few-shot Xi learning, and code.

In particular, on a model called SOLAR, the processed model performed better at 16k than the original version at 4k.

SOLAR is composed of two alpacas with their heads and tails pinched, which makes its attention layer structure different from other Transformer-based models.

At the same time, in the closed-domain question answering tasks composed of exam questions such as GSM, the SE optimized model also achieves a higher average score than the original version, and is slightly inferior to its own SWA method on Mistral.

However, the enhancement of the ability of long text did not lead to a decrease in the ability of the model on short text.

Under HuggingFace's OpenLLM benchmark, the SE version of Llama2 has significantly lower test scores than the original version.

Currently, the out-of-the-box version of SE supports three models, Phi, Llama, and Mistral, on which window expansion can be performed with only 4 lines of code.

For other models, some changes to the code are required.

So, how does SE add window length to a model?

Both attention mechanisms work together

Researchers believe that the ability of long text is inherent in the large model itself, but it needs to be stimulated by a certain method to achieve it.

The main problem is that when large models process long text, they will encounter relative position encoding that is beyond what was seen during training.

In response to this situation, the authors adopted the FLOOR attention mechanism as a solution strategy.

FLOOR groups the input text sequences, and then divides the absolute position of a token by the number of groups, so that it can be mapped to a shorter range over a long distance.

Then, the attention operation of these mapped values solves the problem of position coding overrun and realizes the processing of long text.

However, when processing short- and medium-length texts, the original attention mechanism of the model is still used, so as to ensure that the model will not "take care of one or the other" and avoid the loss of short text ability due to the growth of long text ability.

In addition, the authors who reproduce SE on Mistral admit that the current model is not perfect, and there may be a problem of a burst of computational load.

At the same time, the original author of SE also said that the current SE method has not been optimized in terms of efficiency, and plans to solve this problem by introducing strategies such as the FlashAttention mechanism in the future.

Address:

https://arxiv.org/abs/2401.01325

— END —

QbitAI · Headline number signed

Four lines of code explode the context of the large model by a factor of 3, and the alpaca Mistral is suitable for both

The ability to write long texts has been significantly enhanced

Both attention mechanisms work together

Read on

Alpacas were "interviewed" by security officers for frequently spitting on surrounding tourists

The alpaca was "interviewed" by the security officer for frequently spitting on tourists, and I died laughing in the comment area of netizens

How do you fight alpacas in Monster Hunter: World?

A netizen raised a little orange cat, but the more he grew up, the stranger it became... This is a cat and a camel? Netizens raised 2 little oranges, and the more one grows up, the stranger it becomes... And when other kids are playing, it sits upright

Nowadays, many shopping malls have set up an animal park, some of which are called cute pet halls, raising small animals such as alpacas, hamsters, pet dogs, cats, peacocks, etc., which attract many children to play

#京山农场#京山农场营地阳光明媚. The weather is just right, peacocks, tease alpacas, feed pigeons, ride horses, drive off-road vehicles, fly kites, there is always one for you, and there is also a scene

Guan Tingna, the "wife" in Zhao Benshan's play, lives in a mansion to feed alpacas!

Guan Tingna, the "wife" in Zhao Benshan's play, lives in a mansion and feeds alpacas!

Guan Tingna, the actress in Zhao Benshan's play, lives in a mansion and raises alpacas, and netizens are amazed by her amazing behavior and career line

光影消博 | 来认识消博会上的“酷羊驼”

39-year-old Guan Tingna lives in a mansion to feed alpacas! Wearing pajamas, showing off her plump career line, her skin is fair, and she is in good condition

39-year-old Guan Tingna feeds alpacas in a mansion! Wearing pajamas to show off her plump career line, she is worthy of being a stunner in the world

39-year-old Guan Tingna lives in a mansion to feed alpacas! She has fair skin and is in good condition, and she wears pajamas to show off her plump career line

Zhao Benshan's "royal wife" Guan Tingna raises alpacas in the mansion and wears low-necked pajamas with a plump figure

39-year-old Guan Tingna lives in a house to feed alpacas, wears pajamas, shows off her plump career line, and has fair skin and good condition

Zhao Benshan's "royal wife" Guan Tingna raised alpacas leisurely in the mansion, showing elegance and calmness

Effect of arc resistance model on fast transient overvoltage in 1000 kV GIS

The large-scale model product NETA "Qiankun Circle" will be on the car, and Nezha Automobile will enter the era of smart cars 2.0

GPT-4 Turbo-level domestic large model debuted, and Zhou Guanyu's F1 race data analysis stunned the big guy

iFLYTEK Xinghuo launched the first intelligent twins platform, agilely reaching the last mile of large-scale model application enterprises

Byte released ViTamin, a visual foundation model, and achieved SOTA for multiple tasks, which was selected for CVPR2024

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

Turn in | OccGen: A new breakthrough in generative 3D semantic occupancy prediction models in the field of autonomous driving

Whispering (52): Intensive reading of journal articles - model building and model analysis

Supporting 13 billion parameter large models to lead the industry, MediaTek released the strongest intelligent cockpit chip

"Mode" does not care about or "model" is the opposite of the opposite: on the development trend of the traffic model in the troubled times YEF2024

iFLYTEK Xinghuo launched the first intelligent twins platform to agilely reach the last mile of the landing of large model application enterprises

Stardust Intelligence released an AI robot with full marks in operation ability and large model blessing

Hima Practice: The Way of Audio Editing in the Model Era - Cloud Editing Word Editing

Hima Advertising Algorithm Optimization Practice (1): The Evolution of Advertising CVR Model

When "elderly care" meets AI models

STAR Model: Unlock the Four Golden Keys to Success in Life