Still rolling long text?Google's latest paper directly stems the text to... Infinitely long

While people were still competing for context windows, Google released a paper called "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention". According to the paper, the team has invented a new attention technique called "Infini-attention", which enables large transformer models to process input of infinite length with limited computing resources.

In a Transformer model, attention is used to allow the model to assign weights to all other elements in the sequence based on the input elements (such as tokens or tokens) at the current position. The context window, on the other hand, limits the actual scope of the attention mechanism, i.e., the model only considers other elements within a specific range (several positions before and after) around the current element when calculating attention.

Infinite attention allows the model to process an infinitely long sequence of inputs while still maintaining access to contextual information, and no longer needs to discard the attention state of the previous input segment when processing new inputs. That is to say, its context window can be...... Boundless.

The key technology behind the infinite attention mechanism is called the compressed memory system, which is a structure that can store and retrieve large amounts of information in a compact form, and by changing its parameters to capture new information, ensuring that the information can be recovered later. From the logic of operation alone, the compressed memory system is exactly the same as the compressed files in our daily life.

The biggest role of the compressed memory system is to overcome the secondary complexity of memory footprint and computation time when processing long sequences with the transformer standard attention mechanism, and only need to use a fixed number of parameters to store and recall information to ensure that the storage and computational costs remain within a controllable range. Because the number of parameters does not change with the growth of the input sequence, that is, no matter how long the input sequence is, it will not affect the complexity of the model.

Next, the infinite attention mechanism divides the input sequence into a series of petite, continuous subsequences, each with a fixed length, so that the model can keep the memory requirements and computational complexity low when processing these shorter segments. This segmentation approach avoids the challenge of loading and processing an entire infinitely long sequence at once, allowing the model to process the input step by step in a streaming fashion, i.e., processing only one or a few segments at a time, rather than loading all the data at once.

Within each segment, the infinite attention model employs a local attention mechanism to process contextual information within that segment. Local attention limits the scope of the model's attention calculation between tokens in the current segment, usually in the form of causal or autoregressive, ensuring that when the model processes the current token, it can only see all tokens before the token, and cannot see any tokens in the future (i.e., after the current token).

In order to generate the final contextual output, the infinite attention model combines the long-term memory information retrieved from the compressed memory with the context calculated by the current local attention. This fusion ensures that the model takes into account both the local dependencies of the current input segment and the long-term context of the historical input.

When you understand the infinite attention mechanism and then go back to the title, the infinite attention model is able to process extremely long input sequences in a streaming fashion, and instead of loading the entire infinite input at once, it will be processed in batches based on the history. For the model, it is able to adapt and process an infinite length of context within the constraints of limited memory and computing resources.

Firstly, the performance of the infinite attention model was evaluated on the long-context language modeling benchmark, and compared with various models including transformer-XL.

Still rolling long text?Google's latest paper directly stems the text to... Infinitely long

The infinite attention model has achieved far better results than transformer-XL on both PG19 (long document dataset) and Arxiv-math (mathematical dataset), and achieved 114 times memory compression, which improves model efficiency while maintaining low confusion.

In order to further verify the performance of the infinite attention mechanism, this paper transforms a large language model with 1 billion parameters, replaces the multi-head attention (MHA) module of this model with infinite attention, and continues to pre-train it. The verification process involves the team asking the model to locate and retrieve hidden key information in an input of up to 1 million tokens.

In the pre-training phase, the model uses an input sequence length of only 4K tokens to accommodate the processing mode of infinite attention. After 30,000 steps of pre-training, the key retrieval task is fine-tuned. In the fine-tuning phase, the model is fine-tuned on a length input containing 5K tokens in order to simulate a longer context that may be encountered in a real-world application.

After completing the pre-training and fine-tuning, the team evaluated the model for accuracy in retrieving keys in long input text of different lengths (from 32K to 1M) and different key positions (start, middle, end). Experimental results show that the infinite attention model can successfully retrieve the hidden key in all test scenarios, showing its excellent processing ability for extremely long context information.

Subsequently, in order to prove the performance of the wireless attention mechanism on a larger parameter model, the team pre-trained an 8 billion parameter large language model modified with infinite attention. 30,000 steps were trained with an input of 8k token lengths. The model is fine-tuned on the BookSum dataset and the input length is set to 32K for fine-tuning, but increases to 500K during the evaluation phase.

Based on summaries generated in books with infinite attention spanning 500,000 text lengths, the model surpasses the encoder-decoder model built specifically for summarization tasks and its long context extension version to achieve a new SOTA (state-of-the-art) performance on the BookSum dataset. With the increase of the amount of input book text, the summary performance indicators of the model (such as the Rouge score) show a clear upward trend.

An effective memory system is not only crucial for large language models to understand long texts, although the paper does not drastically modify the attention mechanism of the transformer model, but uses a technique similar to minimally invasive surgery to tightly integrate the compressed memory module into the model's standard dot-product attention layer (vanilla dot-product attention). layer), but it completely improves the problems encountered by the Transformer model when dealing with long sequences.

In 2022, deepmind published an article "∞-former: Infinite Memory Transformer", in which a model called ∞-former was proposed, which uses the continuous spatial attention mechanism to pay attention to long-term memory, so that the attention complexity of the model becomes independent of the context length. In terms of methodology, infinite attention and ∞-former are somewhat similar. The latter sacrifices precision in exchange for memory length, but infinite attention can find key information in keys of extreme length, and the accuracy is much higher than before.

In fact, at the end of the day, both infinite attention and ∞-former are improvements to the transformer's memory system. However, one of the major drawbacks of Transformer is that it cannot handle non-contiguous data. Because Transformer was originally designed to process continuous text sequences such as natural language, with the rise of applications in multiple fields such as image generation, music generation, and video generation, models must be able to handle non-continuous data in order to cope with multimodal data structures. If Google wants to expand its leadership in multimodality, it may begin research on data structures.

Still rolling long text?Google's latest paper directly stems the text to... Infinitely long

Read on

Please put away this detailed dissertation defense strategy!

Read a 10,000-word essay in 3 seconds, the new world's most powerful model, what is the origin?

The annual essay joke is here again!

Drafting a manuscript for my supervisor, but unexpectedly getting a doctoral dissertation......

Li Baochun: The Beauty of Scientific and Technological Papers | CCCF Column

The fastest 24-hour return manuscript, SCI paper polishing in the field of plant science, this is reliable!

Unemployed Africans at home, ghostwriting 50% of the world's papers

Spending money to find a "professional institution" to reduce the weight of the paper is suspected of being deceived, and industry insiders: violating academic norms

The lineup of the "strongest" doctoral dissertation defense committee: academicians × 5, and the vice president can only be a secretary

An Ling's Records (135) - Intensive Reading of Journal Papers 5

An Ling's Records (133) - Code Replication of Intensive Reading and Replication Papers (20)

An Ling Xueji (134) Intensive reading of doctoral dissertations: standard hesitant multiplicative preference relationship and its distance measure

How to download the paper search report

Huang Fuxiang: Improve steel defects and write "papers" on the production line

College girls wear cheongsams to defend!Netizen: The tutor said that it would be nice if the thesis was as beautiful as the cheongsam

Fees for paper citation reports