谷歌DeepMind和著名大模型平台Anthropic的研究人员联合推出了创新神经压缩方法——Equal-Info Windows。
Researchers said that as the parameters and functions of large language models such as ChatGPT, Gemini, and Claude become more and more complex, their training costs have risen exponentially. If the model can be trained with neural compressed text data, it will bring a qualitative improvement in training and inference efficiency, and it will also be easier to process ultra-long texts.
However, using neural compressed data directly often results in opaque, unstable content output. For example, simple text compression through arithmetic coding does not enable large language models to learn effective training knowledge.
Equal-Info Windows, on the other hand, splits the text into multiple windows, each of which is compressed into a fixed-length bitstream, with each window having roughly the same amount of information. This innovative approach can provide a stable mapping relationship, making the compressed text data easier to learn by large language models.
Address: https://arxiv.org/abs/2404.03626
Window splitting
First of all, Equal-Info Windows splits the original text data into a series of consecutive character sequences through the "window segmentation" method, and each sequence is treated as an independent window.
The size of the window can be adjusted according to specific needs, but it is usually a fixed length to facilitate subsequent data compression.
In the large language model of the Transformer architecture, the self-attention mechanism needs to be computed on the entire sequence, which is very time-consuming and computing power for long texts.
This window segmentation method helps to reduce the computational burden of large language models when dealing with long-distance dependencies, so that the model can focus on the local context, thereby improving the processing speed and efficiency.
Window compression
After obtaining the window segmentation data, each partitioned window is independently compressed into a fixed-length bit string by the "window compression" method. This can reduce the amount of storage space and AI computing resources required as much as possible while maintaining the original text information.
Each text window is first converted into a numeric sequence, usually mapping characters to their role identifiers in the character set. These numerical sequences are then fed into Arithmetic Coding (AC) for compression. Typically, these algorithms optimize the compression process by learning the symbol frequencies and patterns in the text, resulting in efficient bit-level compression.
In the process of compression, the researchers trained two models, M1 and M2. The main function of M1 is to convert the raw text data into a compressed bitstream. This step is a key part of the neural network compression, so that the subsequent model pre-training can be carried out on a more compact data representation.
The M2 model learns how to recover and understand the information of the original text from the compressed bit stream. Includes learning how to process and decode compressed data generated by the M1 model.
At the same time, in the inference stage, the M2 model is able to generate uncompressed text output based on compressed inputs. This means that M2 can not only understand compressed text, but also reverse the compression process, restore the original text or generate a new sequence of text.
Experimental data
To evaluate the performance of the method, the researchers compared text compressed by Equal-Info Windows with text processed by traditional word splitters such as SentencePiece.
The results show that Equal-Info Windows has a clear advantage in reducing the length of the sequence, although the confusion is slightly higher than that of the subword splitter with the same number of model parameters. This shows that Equal-Info Windows is able to generate text in fewer autoregressive steps, which reduces the latency of model inference.
In addition, the research team found that Equal-Info Windows performs very well with long texts. Because each compression window contains roughly the same amount of information, large language models are better able to capture long-distance dependencies in text. This is especially important when dealing with tasks such as document retrieval and encoding issues.