谷歌DeepMind和著名大模型平台Anthropic的研究人员联合推出了创新神经压缩方法——Equal-Info Windows。

Researchers said that as the parameters and functions of large language models such as ChatGPT, Gemini, and Claude become more and more complex, their training costs have risen exponentially. If the model can be trained with neural compressed text data, it will bring a qualitative improvement in training and inference efficiency, and it will also be easier to process ultra-long texts.

However, using neural compressed data directly often results in opaque, unstable content output. For example, simple text compression through arithmetic coding does not enable large language models to learn effective training knowledge.

Equal-Info Windows, on the other hand, splits the text into multiple windows, each of which is compressed into a fixed-length bitstream, with each window having roughly the same amount of information. This innovative approach can provide a stable mapping relationship, making the compressed text data easier to learn by large language models.

Address: https://arxiv.org/abs/2404.03626

Window splitting

First of all, Equal-Info Windows splits the original text data into a series of consecutive character sequences through the "window segmentation" method, and each sequence is treated as an independent window.

The size of the window can be adjusted according to specific needs, but it is usually a fixed length to facilitate subsequent data compression.

In the large language model of the Transformer architecture, the self-attention mechanism needs to be computed on the entire sequence, which is very time-consuming and computing power for long texts.

This window segmentation method helps to reduce the computational burden of large language models when dealing with long-distance dependencies, so that the model can focus on the local context, thereby improving the processing speed and efficiency.

Window compression

After obtaining the window segmentation data, each partitioned window is independently compressed into a fixed-length bit string by the "window compression" method. This can reduce the amount of storage space and AI computing resources required as much as possible while maintaining the original text information.

Each text window is first converted into a numeric sequence, usually mapping characters to their role identifiers in the character set. These numerical sequences are then fed into Arithmetic Coding (AC) for compression. Typically, these algorithms optimize the compression process by learning the symbol frequencies and patterns in the text, resulting in efficient bit-level compression.

In the process of compression, the researchers trained two models, M1 and M2. The main function of M1 is to convert the raw text data into a compressed bitstream. This step is a key part of the neural network compression, so that the subsequent model pre-training can be carried out on a more compact data representation.

The M2 model learns how to recover and understand the information of the original text from the compressed bit stream. Includes learning how to process and decode compressed data generated by the M1 model.

At the same time, in the inference stage, the M2 model is able to generate uncompressed text output based on compressed inputs. This means that M2 can not only understand compressed text, but also reverse the compression process, restore the original text or generate a new sequence of text.

Experimental data

To evaluate the performance of the method, the researchers compared text compressed by Equal-Info Windows with text processed by traditional word splitters such as SentencePiece.

The results show that Equal-Info Windows has a clear advantage in reducing the length of the sequence, although the confusion is slightly higher than that of the subword splitter with the same number of model parameters. This shows that Equal-Info Windows is able to generate text in fewer autoregressive steps, which reduces the latency of model inference.

In addition, the research team found that Equal-Info Windows performs very well with long texts. Because each compression window contains roughly the same amount of information, large language models are better able to capture long-distance dependencies in text. This is especially important when dealing with tasks such as document retrieval and encoding issues.

谷歌、Anthropic推出创新神经压缩Equal-Info Windows

Window splitting

Window compression

Experimental data

Read on

The Anthropic Claude 3 Opus base model is generally available on Amazon Bedrock

初创巨头Anthropic：Claude3的iOS版登场，安卓版已在路上

OpenAI竞争对手Anthropic突然发布强大AI模型Claude3.5!GPT-4o不香了

马斯克宁愿亲眼见证 AI 毁灭人类；小米 SUV 谍报曝光；Anthropic 发布模型 Claude 3.5

Daily recommendations for popular GitHubTrending open source projects. 1. grafana/alloy has almost 1000Star, which is developed with Go.

Anthropic facilitates third-party AI model evaluation

CEO of Anthropic: The cost of large model training has skyrocketed, and it will reach $100 billion in 2027!

【Long-termism】No. 290 Intelligent Theory: Yang Zhilin Dialogue, Interview with the Founder of Anthropic

突发！OpenAI联创、ChatGPT架构师叛逃，官宣入职劲敌Anthropic

OpenAI再迎重大人事变动，联合创始人Schulman转投Anthropic

OpenAI's management is in turmoil again: the co-founder joins Anthropic and the president takes a long leave

On the same day as the OpenAI developer conference, Anthropic poached another founding member of OpenAI

Anthropic CEO万字长文预言:强大AI26年降临,智力击败诺奖得主

CEO of Anthropic: How AI can make the world a better place

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase