Li Mou: When large model inference encounters a computing power bottleneck, how to optimize engineering?

Author | Li Zhongliang

Since the release of ChatGPT by OpenAI, the amazing effect of large language models has attracted more and more people and capital to pay attention to this field, and the number of parameters and sequence length of the model itself have also increased exponentially in recent years, and the problem of computing power bottlenecks to face has followed.

At the AICon Global Artificial Intelligence Development and Application Conference & Large Model Application Ecology Exhibition 2024, InfoQ invited Li Mou, a senior algorithm expert of 0100000 Things, to give a speech to share his insights on the optimization techniques used by 01000 Creatures in the process of building Yi model online inference services based on the computing power requirements and model structure of large models. In order to let the audience know more about the content, we interviewed Mr. Li in advance, and the following is a summary of the content:

InfoQ: You mentioned the computing power demand and growth trend of large models in your talk, can you elaborate on the main computing power challenges that large models are currently facing in the inference process? In response to this rapidly growing demand for computing power, do you think the current technology and resources are sufficient to cope with it?

Mou Li: The calculation of large models is mainly divided into two steps: training and inference, and their emphasis on computing power is different. Model training focuses on throughput, which requires large-scale, high-scalability, and low-power distributed computing clusters, while inference focuses on latency, which requires powerful computing chips and high-speed memory access technologies in terms of computing power. The demand for this kind of computing power has increased exponentially in recent years after the popularity of deep learning and large models, which is a huge challenge for hardware manufacturers and power supply manufacturers.

InfoQ: What do you think are the structural differences between traditional and large language models, and are there any differences in inference optimization?

Mou Li: Traditional models, including CNN, NLP, ASR and other networks, are characterized by complex structures, many operator types, and many variants of models, and different software frameworks have their own model description languages and model structures. The vast majority of large language models are based on the Transformer network structure, which is obtained by connecting multiple Transformer Blocks in series, which is characterized by a simple network structure but a huge number of parameters.

InfoQ: Knowing that distributed parallel acceleration is a measure in large model inference, how does Zero-One do this in this regard?

Mou Li: To put it simply, the inference optimization methods for distributed parallelism are mainly tensor parallelism and context parallelism, which segment parameters from the model dimension and input sequence dimension respectively, and use multiple devices to perform parallel computing to achieve the purpose of acceleration.

InfoQ: The memory consumption of large models is often an important consideration during inference. Do you have any optimization strategies or experiences for memory management?

Mou Li: The memory consumption of large models mainly comes from the loading of the model weights itself and the Key/Value in the Transformer Block. Secondly, the pagedAttention of the Key/Value matrix in the model can also greatly improve the memory utilization, and even when the task is idle, we can temporarily switch the Key/Value matrix to other memory areas, and then switch back when needed, exchanging time for space.

InfoQ: When faced with computing bottlenecks, there are sometimes trade-offs, such as sacrificing a certain amount of model accuracy for faster inference. How do you weigh and make decisions? Are there some general guidelines?

Mou Li: From the perspective of perception, the larger the number of parameters of the model, the higher the degree of information redundancy, and low-precision quantization is already a common optimization method in traditional small model inference, especially for language models with a larger number of parameters. The low-precision quantization of zero and one things covers the entire process of training and inference, so it is non-destructive quantization for inference, and there is no need to verify this process. From a production environment perspective, I think it's perfectly acceptable if model quantization can improve the cost performance of the service by more than 1 times while maintaining the accuracy of mainstream task evaluation with little to no degradation (or a few tenths of a point).

InfoQ: Another challenge that large models can face in their inference process is latency, which is a critical metric, especially for real-time or interactive applications. How do you approach the optimization problem of inference latency?

Mou Li: Optimizing latency is trickier than optimizing throughput, first of all, the best case is to buy hardware with more powerful computing power, or reduce latency from the perspective of hardware design. At the software level, for example, for NVIDIA GPUs, a more efficient CUDA Kernel can be developed, using multi-card parallelism and other means, of course, this optimization often has a large manpower and time cost.

InfoQ: In addition to hardware accelerators and distributed parallel acceleration, are there any other types of acceleration techniques or optimizations that can be used to alleviate the computing pressure of large model inference?

Li Mou: There are a lot of technical points in this aspect, and we will share them at the AIConAICon Global Artificial Intelligence Development and Application Conference and Large Model Application Ecology Exhibition 2024 on May 17, welcome to pay attention.

InfoQ: Do you have different inference optimization strategies for tasks of different sizes and complexity? Can you share some experience of adapting your strategy to the needs of the task?

Li Mou: Tasks of different complexity use different amounts of hardware with different ratios. For example, for the same model Yi-34B, we deploy two sets of hardware clusters (low-configuration version/high-configuration version, different computing power and cost) to decide which cluster service to use based on the specific input length of the user's online request, so as to take into account the user experience, service pressure and service cost.

InfoQ: What do you think are the possible technological breakthroughs or development directions in the future in response to the current bottleneck of large model inference computing power?

Li Mou: First of all, there are some related products in China, but the problem is that these special chips and software supporting systems have not formed a good ecology in the market, and the lack of user use and consensus is a challenge for ecological development. Second, as the demand for computing power for large models and AI grows, as the size of computing clusters grows, power supply in local areas may become an issue, which may promote the development of some clean energy and efficient power generation technologies (such as wind power and controlled nuclear fusion).

Speaker Introduction:

Li Mou is a senior algorithm expert of Zero One Things, graduated from Harbin Institute of Technology, and is the head of the online inference service of the Zero One Ten Things large model, and has served as a technical expert in the Alibaba Damo Academy and the EI Service Product Department of Huawei Cloud. He has been engaged in the R&D and optimization of AI model inference and training for a long time, and has led the team to develop a general inference engine and underlying acceleration library, and has won the TOP1 inference performance ranking in the Standford DAWNBench GPU rankings.

Original link: Li Mou: When large model inference encounters a computing power bottleneck, how to optimize engineering? _AI& Large Model_Li Zhongliang_InfoQ Selected Articles

Li Mou: When large model inference encounters a computing power bottleneck, how to optimize engineering?

Read on

Unraveling the Mystery of Memory: Ebbinghaus's Forgetting Curve and Mind Model Playing Cards Help You Grow and Leap

United Nations Relief and Works Agency for Palestine Refugees in the Near East: Nearly 800,000 people fled Gaza Rafa

2023 edition of China Construction Curtain Wall Engineering Construction Technology Standardization 3D Atlas 82P

The Northern Cooperation Zone of the "361" Project held the 2024 annual mobilization and deployment meeting for large-scale military training

The list of 211 Project universities has finally been sorted out, so let's collect it and take a look

The daily work after graduating from the electronic information engineering major

After GPU, NPU becomes the standard configuration again, how do mobile phones and PCs carry large AI models?

Be a sneak peek! ByteDance is unprecedented! The large model is stunningly unveiled, and the price is as low as 99%!

Detailed interpretation of the water conservancy valve of the hotel project

【Comment】Be wary of projects under construction becoming a hotbed of fund transfer and financial fraud

39 million people watched Lei Jun's live test drive; Musk recruits second brain-computer experiment patient; DeepMind launches a large-scale model risk assessment framework

From "sky-high prices" to "fracture prices", large models are about to change

Wangzhuang Village, Tunliu District, uses the "Ten Million Project" to lead the "Thousands of Butterfly Changes" in the countryside

If you want to land a large model, let everyone afford to use it first

The ecological restoration of the Xingren area and the water supply project of the irrigation area undertaken by the Ningbo Industry Bureau passed the acceptance of the lower gate water storage

Doncic will be the next player to complete Project 411!

Direct interaction with hundreds of millions of users Third-party AI models accelerate access to the Weibo ecosystem

iFLYTEK Xinghuo large model empowerment, opening up the "new consciousness" of virtual people

Clerks, Security Guards, Engineers...... Sheshan these companies are recruiting

The municipal engineering project of Fuxing Avenue in Fengdong New Town was officially opened to traffic

Leading in China, 75-meter water depth offshore comprehensive engineering survey and scientific research platform delivered

Weak current engineering cabinet installation standard rules and process scheme

District leaders investigate and promote the construction of key traffic engineering projects

When open source meets large models, what kind of changes will occur?

It is said that the senior management of the Tsinghua Department of the large model company has changed

58.com Sun Qiming: How to build a large model of life service vertical? Self-developed + open source with both hands

Atlas of cases of installation problems in multi-online air conditioning engineering

AI Dimensity Full Push, China's First End-to-End Large Model Mass Production on the Car Xpeng opens the era of AI intelligent driving

The price of large models has fallen, and the Internet-style "turf war" has reappeared, will big factories really lose money?

The Past of China's Large Model Capital: 20 Large Model Insiders Walk on the "Life and Death Table"

The price war of AI large models starts, and the winner will be decided in a year?

Baidu's first Wenxin large model learning machine Z30 is on sale, and 8G +256G is sold for 6694 yuan