Jia Jiaya's team and MIT released the ultra-long text extension technology: two lines of code to solve the limitations of large model dialogue

Text/Tencent Technology Guo Xiaojing

Recently, the Jia Jiaya team of Hong Kong University of Chinese and MIT released a technology called LongLoRA: with only two lines of code and an 8-card A100 machine, the text length of the 7B model can be extended to 100k tokens, and the text length of the 70B model can be extended to 32k tokens; At the same time, the research team also released LongAlpaca, the first long-text dialogue large-language model with 70B parameters.

As soon as this new technology and new model were released, it has been on the hot lists of major open source websites: Hugging Face is the first hot list, paperwithcode is the first hot list, Github all python projects are the fifth hot, GitHub stars exceeded 1,000 in a week, and related technical posts on Twitter have nearly 180,000 views.

Jia Jiaya's team and MIT released the ultra-long text extension technology: two lines of code to solve the limitations of large model dialogue

GitHub Stars has reached 1.3K

Related tech posts on Twitter received nearly 180,000 views

Lost halfway, the model is lazy, the longer the context, the dumber the model gets... If you have experienced large language model products, users will feel the limitation of text input length to some extent, such as when you want to discuss some slightly longer content with the large model, you need to split the input, and the main points of the previous input will soon be forgotten by the large model.

This is a typical large-language model dialogue defect, like a child born with attention deficits who have difficulty concentrating on a new book. The key to the defect is that the model lacks long text processing capabilities. The reason why the new results released by Jia Jiaya's team have received such high attention on major open source websites is that they break through the long text processing capabilities of the model.

First, long text processing is no longer a blind spot of large language models

The proposal of LongLoRA has solved the dialogue defects of the global large language model for the first time, and since then, dozens of pages of papers, hundreds of pages of reports, and huge books have no longer become the blind spot of large models.

In this regard, some professionals said excitedly that LongLoRA is a lamp of hope in the maze of large language models, which represents the industry's rethinking and attention to long text large language models, effectively expands the context window of large language models, allows models to consider and process long text sequences, and is an innovative invention of large language models.

In addition to technological innovations, one of the difficulties of large language models in dealing with long text problems is the lack of publicly available long text dialogue data.

To this end, the research team specially collected 9K long text Q&A corpus pairs, including various Q&A on famous books, papers, in-depth reports and even financial statements.

It was not enough to answer long questions, the team selected a 3K short question and answer corpus mixed with 9K long question and answer corpus for training, so that the long text large model has short text dialogue capabilities at the same time. This complete dataset, called LongAlpaca-12k, is currently open source.

Based on the LongAlpaca-12k dataset, the research team trained and evaluated different parameter sizes 7B, 13B, 70B, and open-source models including LongAlpaca-7B, LongAlpaca-13B and LongAlpaca-70B.

Second, blind selection demo, measured LongLoRA technology superimposed on 12K question and answer corpus large model LongAlpaca effect

(1) Let the system read a new paper and propose amendments to it according to the ICLR review guidelines, so as to improve the acceptance rate of the paper.

LongAlpaca's opinion is that the chances of acceptance of papers will be improved by more precisely articulating novelty, providing more rigorous and comparative experimental results (including specific data sets and metrics), wider applications, and future directions, and focusing on key contributions and impacts.

Now, let the system read two new different papers, and let LongAlpaca summarize the stylistic differences between the ICLR and CVPR conferences. LongAlpaca concludes that CVPR papers tend to be more structured and experimental, focusing on practicality and technicality. ICLR's essay style, on the other hand, is more flexible, focusing on key theoretical analyses and mathematical derivations rather than standard formats.

It can be seen that the trained LongAlpaca model can easily accept new long-form academic papers, and it is quite accurate in answering academically relevant questions.

(2) Test the interpretation performance of the LongAlpaca model in the economic field with a relatively high threshold of reading and understanding.

Based on the IMF's overview collection of global economic prospects from 2012 to 2023, it summarizes the economic situation year by year and forecasts future economic trends. LongAlpaca told us that the global economic situation in 2023 is uncertain, and growth is expected to slow compared to 2022; In the coming years, the global economic expansion will be moderate and lower than expected, while major challenges such as geopolitical tensions and financial market volatility remain and need to be carefully addressed through multilateral cooperation and structural reforms.

(3) Test LongAlpaca's ability to read new novels, and instruct it to analyze the content after reading.

The comparison models selected below are LongAlpaca and Llama2 with 13B parameters, and look at their respective performances.

In the famous book "Journey to the West", why is Sun Wukong very powerful when he makes trouble in the Heavenly Palace, but he has been repeatedly frustrated on the way to get the scriptures?

LongAlpaca gave 5 reasons, which can be roughly summarized as "Monkey King's immaturity, strong opponents, limited strength, enemy deception, initial lack of companions and other factors led to his failure in the journey." But over time, he gained wisdom, experience, and comrades, which enabled him to defeat his most formidable enemies. ”

Llama2's answer is simpler. "While he does make a fuss about the Heavenly Palace, it is not necessarily accurate to say that he represents an undefeated force. In fact, he was eventually defeated by the powerful Tang Sanzang in Journey to the West. Llama2 did not give the right answer and also raised the need for more content information.

This time, let the system read "The Three-Body Problem" again, and then ask why Ye Wenjie contacted the aliens in the first part, and why did he regret it in the later stage?

Long Alpaca believes that Ye Wenjie's initial contact with aliens was motivated by idealism and revenge, but over time, her view of humanity has changed, realizing the danger she may bring to all of humanity, and her remorse grows day by day. The answer was very clear.

Llama2's answer was, "Through interacting with aliens, she hopes to gain a deeper understanding of the nature of the universe and her place in it." Ye Wenjie gradually realized that aliens and their technology could not solve her problem. The answer was general, and then he began to comment on the novel as a whole, answering non-questions.

From the answers given by the models, it can be seen that some models such as Llama2 [2] may have seen the relevant novel during the pre-training process, but the answer is not ideal if the question is asked only for a short text based on the novel title. LongAlpaca is slightly better at long text processing such as academic papers, commenting on global economic trends, and reading novels.

Third, how LongLoRA technology breaks through the ability to process long texts

It turns out that in the process of processing long text in large language models, the main cost of computation is concentrated in the self-attention mechanism, and its overhead increases squared with the length of the text.

In response to this problem, the research team proposed LongLoRA technology and simulated the global self-attention mechanism by grouping and offset.

Simply put, it is to split the tokens corresponding to long text into different groups, do self-attention calculations within each group, and the way of grouping is offset from different attention heads. This method can not only greatly save the amount of calculation, but also maintain the transmission of the global receptive field.

And this implementation method is also very concise, only two lines of code can be completed.

LongLoRA also explores ways to train at low ranks. The original low-rank training methods, such as LoRA [5], cannot achieve good results in text length transfer. On the basis of low-rank training, LongLoRA introduces embedding layers (Embedding layer and Normalization layers) for fine-tuning, so as to achieve the effect of full fine-tune.

When performing text expansion and training of different lengths, the specific effects of LongLoRA, LoRA and all-parameter fine-tuning techniques can be referred to in three dimensions:

1

In terms of Perplexity-perplexity, the performance of the original LoRA method is deteriorating, while LongLoRA and all-parameter fine-tuning can maintain good results under various text lengths.
2

In terms of memory consumption, LongLoRA and the original LoRA have significant savings compared to full-parameter fine-tuning. For example, for model training with a length of 8k, LongLoRA reduces the memory consumption from 46.3GB to 25.6GB compared to full-parameter fine-tuning.
3

In terms of training time, for 64k length model training, compared with conventional LoRA, LongLoRA reduces the training time from about 90~100 hours to 52.4 hours, while the full parameter fine-tuning exceeds 1000 hours.

The minimalist training method, minimal computing resources and time consumption, and excellent accuracy make LongLoRA possible on a large scale. At present, the relevant technologies and models have all been open source, and interested users can deploy their own experience.

Code and Demo address: https://github.com/dvlab-research/LongLoRA

Thesis address: https://arxiv.org/pdf/2309.12307.pdf

bibliography

[1] LLaMA team. Llama: Open and efficient foundation language models. Arxiv, 2302.13971, 2023a.

[2] Llama2 team. Llama 2: Open foundation and fine-tuned chat models. Arxiv, 2307.09288, 2023b.

[3] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. Arxiv, 2306.15595, 2023.

[4] Szymon Tworkowski, Konrad Staniszewski, Mikolaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Milos. Focused transformer: Contrastive training for context scaling. Arxiv, 2307.03170, 2023.

[5] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022.

Jia Jiaya's team and MIT released the ultra-long text extension technology: two lines of code to solve the limitations of large model dialogue

Jia Jiaya's team and MIT released the ultra-long text extension technology: two lines of code to solve the limitations of large model dialogue

Read on

CNCC | The future of multimodal affective computing under large models

The "Fuxi Eye" large model was released! It has the world's largest ophthalmic image database

New car | The AI large model is on the car, 13 new/27 optimizations, and the ZEEKR 009 glorious OTA upgrade

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

Surveying and Mapping Bulletin | Ren Ping: Noise data visualization based on LOD1 city model

The terminal AI grading standard has been implemented, and the "fire" of the mobile phone model has burned to the agent

J Clin Invest丨Yang Weili/Li Shihua/Li Xiaojiang's team used monkey models to reveal new pathological mechanisms of Parkinson's disease

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase

Nearly 1,000 teenagers in the city gathered at Zhonghai Expo to show their skills in the three major model competitions of navigation, aviation and architecture

DeepMind and MIT developed Fluid, which enables autoregressive models to achieve large-scale expansion of Wensheng graphs

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China

Agitated! The mortise and tenon craft will be on the moon "News Fast Food" for three minutes a day, taking you to a quick overview of the world

She hid 150,000 worth of gold in a cosmetic box...... [News Morning Broadcast]

ByteDance responded to the attack on the intern for the training of the large model: it has been dismissed and does not affect the online business

What's New! News summary before 4 p.m. on October 20, 8 domestic news today!

How outrageous can news reports be? The insider vomited blood and shared, and netizens shouted incredible after reading it!

News Analysis: The Economic Account Behind the "Marathon Fever".

In the early morning of November 13th, 12 good news came from China, let's take a look at the latest domestic news summary!

What's New! News summary before 14 pm on November 25, China will build a base on the moon!

Before 1 a.m. on November 26, WTT champions Wang Chuqin and Sun Yingsha ranked first in men and women

Today's news! The incident occurred before 1 a.m. on November 27, and there were 13 wonderful news summaries in China

At 5 a.m. on November 30, 20 good news came from China, all of which were encouraging!

The latest news: The incident occurred before 0:00 a.m. on December 2 News summary Read today's news in 3 minutes!

A number of large models have been rolled out in the field of traditional Chinese medicine, and the "AI old Chinese medicine" is coming?

What's New! News summary before 0:00 a.m. on December 4, officially announced China's first recyclable rocket!

What's New! Before 1 a.m. on December 20, the news summary showed a major breakthrough in China's sea-launched rockets

Shoot the king to bomb? Photorealistic generative world model, with Pixar investment

What's New! News summary before 16 p.m. on December 20th! Today's domestic news!

Today's news! December 20 News Summary, Large-capacity Energy Storage Systems: March to the Desert!

What's New! News summary before 20 o'clock in the evening on December 25th, read today's news in two minutes!

Tencent, Huawei, etc. access to DeepSeek lose more than 400 million yuan per month, and the MaaS model as a service is about to be subverted? Titanium media AGI

The sex robot was unexpectedly empowered by a large model, and the concept stocks of adult products rose collectively, against the sky?