Scientists propose an optimized Transformer method, and the large model is expected to have both low energy consumption and high performance

近期，美国麻省理工学院研究科学家、初创公司 BitEnergy AI 创始人罗鸿胤和合作者提出了一种名为线性复杂度乘法（L-Mul，linear-complexity multiplication）的算法。

The biggest feature of L-MUL is that it uses integer addition to approximate floating-point multiplication, which greatly reduces the calculation accuracy of large models by changing the way of numerical calculation.

This kind of "lazy computing" can significantly reduce the computational effort of the model, and can reduce energy consumption by up to 95%.

In addition, the algorithm is also suitable for scenarios that do not require extremely high-precision calculations.

The potential business value lies in:

For data centers, significant reductions in energy consumption allow data centers to support more computing power with the same amount of energy consumption.

It is worth mentioning that L-Mul can also be used in scenarios that require AI chips, such as embodied intelligence and edge computing, such as robots, laptops, and mobile phones.

L-Mul, on the other hand, brings a new way of thinking about simplifying the design of chips.

It makes chip design and manufacturing easier by removing floating-point multipliers, helping chip manufacturers improve the quality and power stability of chip products.

For example, Luo Hongyin said: "Based on L-Mul, chips with the same area are expected to be equipped with more computing units, and it is possible to use a 5nm chip to achieve the computing speed of more advanced process chips." ”

An industry insider commented on the study: "L-Mul is basically a direct substitute for all multiplication operations in the model. Although its benchmarks are primarily focused on the inference phase, the tests use real-world models such as Llama 3.1 8b and exhibit nearly the same performance in regular benchmarks. ”

日前，相关论文以《加法是构建高能效语言模型的全部所需》（Addition is All You Need for Energy-efficient Language Models）为题发表在预印本网站 arXiv 上 [1]。

Scientists propose an optimized Transformer method, and the large model is expected to have both low energy consumption and high performance

Figure丨Related papers (source: arXiv)

"Multiplication to Addition"

At present, the development of AI technology is accelerating, and as large neural network models become more widely used, their energy consumption during training and inference is becoming more and more prominent.

At the same time, energy consumption is becoming the biggest bottleneck for new data centers, which can be confirmed in the actual location of data centers.

Large data centers often require an abundant, stable power supply and low operating costs, so they are mostly located in sparsely populated areas that can provide power security.

For example United States National Security Agency built the nation's largest data center in the United States town of Bluffdale, Utah; In China, Huawei, Tencent, and Apple have chosen Guizhou Province for their data centers.

As a result, some people have put forward the idea that "the end of AI is natural gas, coal, and nuclear power plants".

While there have been many breakthroughs in optimization and hardware acceleration in cloud data centers, improvements in the core computing operation of floating-point multiplication have been relatively conservative.

In floating-point arithmetic, each number is usually represented as a sign bit, an exponent, and a mantissa number. Multiplication operations often require operations on these components, including exponential addition and mantissa multiplication, possible normalization, and rounding steps.

L-Mul, on the other hand, achieves the same calculation effect by omitting mantissa multiplication, using only integer addition and some simple displacement operations, which significantly reduces computational complexity and energy consumption.

Fig丨The process of replicating regular floating-point multiplication and linear complexity multiplication (L-Mul) between two fp32 numbers (Source: arXiv)

Luo Hongyin and his team tried to improve the inference efficiency of the model by reducing the amount of floating-point computation required for model inference.

In the Transformer model, a large number of tensor multiplication operations are used.

He speculates that a large number of components in the model are only concerned with the size relationship between the dimensions of the input tensor, and are not very sensitive to their exact exact values.

For example, both the attention mechanism and the decoding layer, which is responsible for predicting the next token, are only concerned with dimensions with larger values.

On the other hand, the accuracy of the multiplication result with a small value has a negligible impact on the overall performance of the model.

The attention mechanism works in such a way that after inputting a high-dimensional vector, it looks for a similar vector in the high-dimensional space.

To put it simply, it's like marking a coordinate on a map and looking up restaurants near that coordinate. Somewhere outside of the "nearby" area takes 5 hours or 2 days to drive, which doesn't make much sense to the searcher.

Based on this principle, previous research work related to model efficiency, such as model quantization (reducing the number of bits of model parameters) or model pruning (reducing the number of non-zero parameters), has achieved good results.

Can we break through the shackles of basic operations and make some more extreme attempts?

So, the researchers tried to replace all multiplication with addition, and found that it was still accurate after some calculations. Subsequently, they began to do theoretical derivation and numerical analysis to reveal why it remained accurate.

"The ingenuity of L-Mul is that its mathematical principles are very intuitive, and at the same time correspond to the most concise hardware implementation; Its algorithm complexity is lower than that of 8-bit floating-point multiplication, but it can achieve higher computational accuracy. ”

Experimental Verification: L-Mul excels in a wide range of tasks

In order to verify the accuracy of the L-Mul algorithm and explore the effect of large L-Mul-based models in real-world tasks, the researchers experimented with different models in various benchmarks.

作者评估了 Llama-3.1-8b-Instruct、mistral-7b-v0.3-Instruct、Gemma2-2b-It 和 Llava-v1.5-7b 模型，并发现所提出的方法可以在微调或无需训练的情况下，替换 Transformer 层中的不同模块。

In the natural language reasoning task, the performance loss of the L-MUL-based attention mechanism is about 0.07%, covering common sense, structured reasoning, and language understanding.

In the visual task, L-Mul-based attention improved the accuracy rate by 0.12% on the visual question answering, object hallucination, and free-form visual instruction tasks.

Figure丨Comparing the error level of linear complexity multiplication (L-Mul) with the mantissa number of digits of 8-bit FP multiplication in different formats (Source: arXiv)

It is worth noting that these experimental results were obtained by directly switching the standard attention of pre-trained large models to a new L-Mul-based mechanism without additional training.

Error estimation and ablation studies have shown that the 4-digit mantissa L-MUL can achieve an accuracy comparable to float8 e4m3 multiplication without training, while the 3-digit mantissa L-MUL exceeds float8 e5m2 multiplication.

Experiments have also shown that fine-tuning can bridge the performance gap between L-MUL and standard multiplication.

In operations involving attention mechanisms, linear transformations, and element-by-element multiplication, if all multiplication operations are replaced with 3-digit mantissa L-Mul operations, the performance of such a model in the fine-tuning phase is comparable to that of a model that uses float8 e4m3 precision as the standard.

A "non-mainstream" approach in the field of model acceleration

At present, many large enterprises and startups are actively exploring model acceleration to improve computing efficiency and reduce costs.

To reduce the amount of model computation, the industry is developing 4-bit chips that represent relevant values with fewer bits through model quantization techniques.

On the other hand, some companies optimize model speed and energy consumption by accelerating the data reading and writing of optimization chips, such as Google's Tensor Processing Unit (TPU).

In addition, researchers are also exploring non-transformer architectures.

L-Mul is "different" compared to the current mainstream approaches in the field of model acceleration. "Our approach can be vertically complemented by other technologies to improve the efficiency of the model, and it is not mutually exclusive," said Luo. ”

For example, after optimizing the input/output chips, the computational complexity can be further optimized with the L-Mul algorithm; The quantized model can be multiplied by the L-Mul optimization model; Non-transformer architectures can also use L-Mul's ideas to speed up multiplication.

Figure丨Comparison of attention mechanisms implemented with 16-bit and 8-bit tensor multiplication operations and L-Mul approximation (Source: arXiv)

In terms of theoretical and numerical simulations, the L-Mul algorithm has shown excellent performance.

Although due to the lack of corresponding hardware instructions, the existing hardware does not support direct L-MUL operation with floating-point numbers. However, the L-MUL algorithm can be implemented by simply adding a new simple instruction at the hardware level, resulting in significant energy efficiency gains.

According to reports, at present, some relevant research groups have achieved some energy consumption reduction at the software level through the fp32 operation of the central processing unit (CPU).

From the perspective of the broader impact on the field of AI and machine learning, L-Mul is expected to alleviate the problem of computing monopoly in the AI industry, reduce the cost of computing power for users, and provide more computing resources.

"From the perspective of improving numerical computation and numerical stability, we hope to improve the training effect of the model, reduce the storage space, and optimize the inference efficiency," Luo said. ”

Therefore, L-Mul is expected to help improve the efficiency of model training and reduce the training difficulties caused by numerical instability, so as to improve the algorithm efficiency and storage space utilization in the fields of model quantization and model pruning.

We will continue to work on improving the efficiency of large models

Luo received his bachelor's degree from Tsinghua University under the supervision of Prof. Liu Zhiyuan and Prof. Sun Maosong. In 2022, he received his Ph.D. in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, with a research focus on self-training of language models. After graduating, he stayed on as a postdoctoral fellow and researcher.

Previously, he built a small model with only 350 million parameters, which was trained entirely on synthetic data and outperformed GPT-3, which has 175 billion parameters, in terms of performance on text classification tasks[2]. After that, he worked to transfer this efficiency improvement approach to generative models.

Picture丨Luo Hongyin (source: Luo Hongyin)

At present, Luo Hongyin mainly focuses on improving the efficiency and reasoning ability of AI.

In terms of efficiency, he pays special attention to deepening the modeling granularity of large models from vectors to the bit level, and improving AI efficiency through the collaborative design of model architecture and computing architecture. In terms of inference ability, he focuses on the programming ability and fault tolerance of the model.

"Programming allows the model to improve efficiency by reusing the inference process, while fault tolerance enables the model to output at one time, reducing the number of repeated inferences, thereby saving computing resources," said Luo. ”

In the future, he plans to conduct simulation studies on the Field Programmable Gate Array (FPGA) platform to confirm the specific reduction in energy consumption after numerical calculations to provide accurate data support.

"Our long-term goal is to promote an exponential improvement in the efficiency of large models by addressing the various characteristics and challenges currently faced by large models through numerical computational research." Luo Hongyin said.

Resources:

1.https://arxiv.org/abs/2410.00907

2.http://arxiv.org/pdf/2305.17197

Operation/Typesetting: He Chenlong

Scientists propose an optimized Transformer method, and the large model is expected to have both low energy consumption and high performance

Read on

【Child-Friendly City Construction】Meteorological Science Popularization into the Campus to Build a Children's Dream of Science and Technology - Zhu Dingzhen, a meteorological scientist from the Chinese Academy of Sciences, walked into Qifeng Primary School

CNCC | The future of multimodal affective computing under large models

The "Fuxi Eye" large model was released! It has the world's largest ophthalmic image database

New car | The AI large model is on the car, 13 new/27 optimizations, and the ZEEKR 009 glorious OTA upgrade

Good news | Hu Bing, chief scientist of Yuanlu Optoelectronics, was selected as an entrepreneurial talent in the "3551 Optics Valley Talent Program".

Hong Kong media: Can Beidou radar lock on to the US F22 stealth fighter? Chinese scientists give the answer

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

Surveying and Mapping Bulletin | Ren Ping: Noise data visualization based on LOD1 city model

Chinese scientists successfully "resurrected" pig brains in 50 minutes of "death"

Funding 297 Young Scientists in Six Years, Tencent Xplorer Prize Supports Scientists to Brave the "No Man's Land"

Can you live longer with less? U.S. scientists released blockbuster research: one meal a day, 28% longer survival

Humans are not indigenous to Earth? Scientists' astonishing discovery sparks heated discussions!

Scientists are using new artificial intelligence to uncover the secrets of infant learning and development

Scientists reveal how the solar wind obtains energy

The terminal AI grading standard has been implemented, and the "fire" of the mobile phone model has burned to the agent

J Clin Invest丨Yang Weili/Li Shihua/Li Xiaojiang's team used monkey models to reveal new pathological mechanisms of Parkinson's disease

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase

Nearly 1,000 teenagers in the city gathered at Zhonghai Expo to show their skills in the three major model competitions of navigation, aviation and architecture

DeepMind and MIT developed Fluid, which enables autoregressive models to achieve large-scale expansion of Wensheng graphs

Scientists have developed a topologically programmed DNA origami system that enables graphical calculations at the single-molecule scale

49 award-winning scientists walked on the red carpet, and the "Xplorer Prize" valued them not only for their immediate achievements, but also for their future research directions!

In 1986, Deng Jiaxian, who was dying, first put forward a proposal to the central government, pointing out that the theoretical level of nuclear bombs in the world is close to the limit, and the next step is likely to be a new round of games! But he already has

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China

Scientists have teamed up to deflect polymerized nitrogen, bringing the development of high-energy-density materials to a new level

ByteDance responded to the attack on the intern for the training of the large model: it has been dismissed and does not affect the online business

A number of large models have been rolled out in the field of traditional Chinese medicine, and the "AI old Chinese medicine" is coming?

Shoot the king to bomb? Photorealistic generative world model, with Pixar investment

American scientists were shocked: blocking Chinese drones is "just crazy!" ”

Tencent, Huawei, etc. access to DeepSeek lose more than 400 million yuan per month, and the MaaS model as a service is about to be subverted? Titanium media AGI

The sex robot was unexpectedly empowered by a large model, and the concept stocks of adult products rose collectively, against the sky?

Lori Island is filthy beyond imagination, Clinton likes young girls, and the scientist Stephen Hawking is a frequent visitor

Can you live 20 more years if you drink it? Scientists have found that when you drink water, your cells accelerate your metabolism