近期,美国麻省理工学院研究科学家、初创公司 BitEnergy AI 创始人罗鸿胤和合作者提出了一种名为线性复杂度乘法(L-Mul,linear-complexity multiplication)的算法。
The biggest feature of L-MUL is that it uses integer addition to approximate floating-point multiplication, which greatly reduces the calculation accuracy of large models by changing the way of numerical calculation.
This kind of "lazy computing" can significantly reduce the computational effort of the model, and can reduce energy consumption by up to 95%.
In addition, the algorithm is also suitable for scenarios that do not require extremely high-precision calculations.
The potential business value lies in:
For data centers, significant reductions in energy consumption allow data centers to support more computing power with the same amount of energy consumption.
It is worth mentioning that L-Mul can also be used in scenarios that require AI chips, such as embodied intelligence and edge computing, such as robots, laptops, and mobile phones.
L-Mul, on the other hand, brings a new way of thinking about simplifying the design of chips.
It makes chip design and manufacturing easier by removing floating-point multipliers, helping chip manufacturers improve the quality and power stability of chip products.
For example, Luo Hongyin said: "Based on L-Mul, chips with the same area are expected to be equipped with more computing units, and it is possible to use a 5nm chip to achieve the computing speed of more advanced process chips." ”
An industry insider commented on the study: "L-Mul is basically a direct substitute for all multiplication operations in the model. Although its benchmarks are primarily focused on the inference phase, the tests use real-world models such as Llama 3.1 8b and exhibit nearly the same performance in regular benchmarks. ”
日前,相关论文以《加法是构建高能效语言模型的全部所需》(Addition is All You Need for Energy-efficient Language Models)为题发表在预印本网站 arXiv 上 [1]。
Figure丨Related papers (source: arXiv)
"Multiplication to Addition"
At present, the development of AI technology is accelerating, and as large neural network models become more widely used, their energy consumption during training and inference is becoming more and more prominent.
At the same time, energy consumption is becoming the biggest bottleneck for new data centers, which can be confirmed in the actual location of data centers.
Large data centers often require an abundant, stable power supply and low operating costs, so they are mostly located in sparsely populated areas that can provide power security.
For example United States National Security Agency built the nation's largest data center in the United States town of Bluffdale, Utah; In China, Huawei, Tencent, and Apple have chosen Guizhou Province for their data centers.
As a result, some people have put forward the idea that "the end of AI is natural gas, coal, and nuclear power plants".
While there have been many breakthroughs in optimization and hardware acceleration in cloud data centers, improvements in the core computing operation of floating-point multiplication have been relatively conservative.
In floating-point arithmetic, each number is usually represented as a sign bit, an exponent, and a mantissa number. Multiplication operations often require operations on these components, including exponential addition and mantissa multiplication, possible normalization, and rounding steps.
L-Mul, on the other hand, achieves the same calculation effect by omitting mantissa multiplication, using only integer addition and some simple displacement operations, which significantly reduces computational complexity and energy consumption.
Fig丨The process of replicating regular floating-point multiplication and linear complexity multiplication (L-Mul) between two fp32 numbers (Source: arXiv)
Luo Hongyin and his team tried to improve the inference efficiency of the model by reducing the amount of floating-point computation required for model inference.
In the Transformer model, a large number of tensor multiplication operations are used.
He speculates that a large number of components in the model are only concerned with the size relationship between the dimensions of the input tensor, and are not very sensitive to their exact exact values.
For example, both the attention mechanism and the decoding layer, which is responsible for predicting the next token, are only concerned with dimensions with larger values.
On the other hand, the accuracy of the multiplication result with a small value has a negligible impact on the overall performance of the model.
The attention mechanism works in such a way that after inputting a high-dimensional vector, it looks for a similar vector in the high-dimensional space.
To put it simply, it's like marking a coordinate on a map and looking up restaurants near that coordinate. Somewhere outside of the "nearby" area takes 5 hours or 2 days to drive, which doesn't make much sense to the searcher.
Based on this principle, previous research work related to model efficiency, such as model quantization (reducing the number of bits of model parameters) or model pruning (reducing the number of non-zero parameters), has achieved good results.
Can we break through the shackles of basic operations and make some more extreme attempts?
So, the researchers tried to replace all multiplication with addition, and found that it was still accurate after some calculations. Subsequently, they began to do theoretical derivation and numerical analysis to reveal why it remained accurate.
"The ingenuity of L-Mul is that its mathematical principles are very intuitive, and at the same time correspond to the most concise hardware implementation; Its algorithm complexity is lower than that of 8-bit floating-point multiplication, but it can achieve higher computational accuracy. ”
Experimental Verification: L-Mul excels in a wide range of tasks
In order to verify the accuracy of the L-Mul algorithm and explore the effect of large L-Mul-based models in real-world tasks, the researchers experimented with different models in various benchmarks.
作者评估了 Llama-3.1-8b-Instruct、mistral-7b-v0.3-Instruct、Gemma2-2b-It 和 Llava-v1.5-7b 模型,并发现所提出的方法可以在微调或无需训练的情况下,替换 Transformer 层中的不同模块。
In the natural language reasoning task, the performance loss of the L-MUL-based attention mechanism is about 0.07%, covering common sense, structured reasoning, and language understanding.
In the visual task, L-Mul-based attention improved the accuracy rate by 0.12% on the visual question answering, object hallucination, and free-form visual instruction tasks.
Figure丨Comparing the error level of linear complexity multiplication (L-Mul) with the mantissa number of digits of 8-bit FP multiplication in different formats (Source: arXiv)
It is worth noting that these experimental results were obtained by directly switching the standard attention of pre-trained large models to a new L-Mul-based mechanism without additional training.
Error estimation and ablation studies have shown that the 4-digit mantissa L-MUL can achieve an accuracy comparable to float8 e4m3 multiplication without training, while the 3-digit mantissa L-MUL exceeds float8 e5m2 multiplication.
Experiments have also shown that fine-tuning can bridge the performance gap between L-MUL and standard multiplication.
In operations involving attention mechanisms, linear transformations, and element-by-element multiplication, if all multiplication operations are replaced with 3-digit mantissa L-Mul operations, the performance of such a model in the fine-tuning phase is comparable to that of a model that uses float8 e4m3 precision as the standard.
A "non-mainstream" approach in the field of model acceleration
At present, many large enterprises and startups are actively exploring model acceleration to improve computing efficiency and reduce costs.
To reduce the amount of model computation, the industry is developing 4-bit chips that represent relevant values with fewer bits through model quantization techniques.
On the other hand, some companies optimize model speed and energy consumption by accelerating the data reading and writing of optimization chips, such as Google's Tensor Processing Unit (TPU).
In addition, researchers are also exploring non-transformer architectures.
L-Mul is "different" compared to the current mainstream approaches in the field of model acceleration. "Our approach can be vertically complemented by other technologies to improve the efficiency of the model, and it is not mutually exclusive," said Luo. ”
For example, after optimizing the input/output chips, the computational complexity can be further optimized with the L-Mul algorithm; The quantized model can be multiplied by the L-Mul optimization model; Non-transformer architectures can also use L-Mul's ideas to speed up multiplication.
Figure丨Comparison of attention mechanisms implemented with 16-bit and 8-bit tensor multiplication operations and L-Mul approximation (Source: arXiv)
In terms of theoretical and numerical simulations, the L-Mul algorithm has shown excellent performance.
Although due to the lack of corresponding hardware instructions, the existing hardware does not support direct L-MUL operation with floating-point numbers. However, the L-MUL algorithm can be implemented by simply adding a new simple instruction at the hardware level, resulting in significant energy efficiency gains.
According to reports, at present, some relevant research groups have achieved some energy consumption reduction at the software level through the fp32 operation of the central processing unit (CPU).
From the perspective of the broader impact on the field of AI and machine learning, L-Mul is expected to alleviate the problem of computing monopoly in the AI industry, reduce the cost of computing power for users, and provide more computing resources.
"From the perspective of improving numerical computation and numerical stability, we hope to improve the training effect of the model, reduce the storage space, and optimize the inference efficiency," Luo said. ”
Therefore, L-Mul is expected to help improve the efficiency of model training and reduce the training difficulties caused by numerical instability, so as to improve the algorithm efficiency and storage space utilization in the fields of model quantization and model pruning.
We will continue to work on improving the efficiency of large models
Luo received his bachelor's degree from Tsinghua University under the supervision of Prof. Liu Zhiyuan and Prof. Sun Maosong. In 2022, he received his Ph.D. in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, with a research focus on self-training of language models. After graduating, he stayed on as a postdoctoral fellow and researcher.
Previously, he built a small model with only 350 million parameters, which was trained entirely on synthetic data and outperformed GPT-3, which has 175 billion parameters, in terms of performance on text classification tasks[2]. After that, he worked to transfer this efficiency improvement approach to generative models.
Picture丨Luo Hongyin (source: Luo Hongyin)
At present, Luo Hongyin mainly focuses on improving the efficiency and reasoning ability of AI.
In terms of efficiency, he pays special attention to deepening the modeling granularity of large models from vectors to the bit level, and improving AI efficiency through the collaborative design of model architecture and computing architecture. In terms of inference ability, he focuses on the programming ability and fault tolerance of the model.
"Programming allows the model to improve efficiency by reusing the inference process, while fault tolerance enables the model to output at one time, reducing the number of repeated inferences, thereby saving computing resources," said Luo. ”
In the future, he plans to conduct simulation studies on the Field Programmable Gate Array (FPGA) platform to confirm the specific reduction in energy consumption after numerical calculations to provide accurate data support.
"Our long-term goal is to promote an exponential improvement in the efficiency of large models by addressing the various characteristics and challenges currently faced by large models through numerical computational research." Luo Hongyin said.
Resources:
1.https://arxiv.org/abs/2410.00907
2.http://arxiv.org/pdf/2305.17197
Operation/Typesetting: He Chenlong