laitimes

Li Mou: When large model inference encounters a computing power bottleneck, how to optimize engineering?

author:InfoQ

Author | Li Zhongliang

Since the release of ChatGPT by OpenAI, the amazing effect of large language models has attracted more and more people and capital to pay attention to this field, and the number of parameters and sequence length of the model itself have also increased exponentially in recent years, and the problem of computing power bottlenecks to face has followed.

At the AICon Global Artificial Intelligence Development and Application Conference & Large Model Application Ecology Exhibition 2024, InfoQ invited Li Mou, a senior algorithm expert of 0100000 Things, to give a speech to share his insights on the optimization techniques used by 01000 Creatures in the process of building Yi model online inference services based on the computing power requirements and model structure of large models. In order to let the audience know more about the content, we interviewed Mr. Li in advance, and the following is a summary of the content:

InfoQ: You mentioned the computing power demand and growth trend of large models in your talk, can you elaborate on the main computing power challenges that large models are currently facing in the inference process? In response to this rapidly growing demand for computing power, do you think the current technology and resources are sufficient to cope with it?

Mou Li: The calculation of large models is mainly divided into two steps: training and inference, and their emphasis on computing power is different. Model training focuses on throughput, which requires large-scale, high-scalability, and low-power distributed computing clusters, while inference focuses on latency, which requires powerful computing chips and high-speed memory access technologies in terms of computing power. The demand for this kind of computing power has increased exponentially in recent years after the popularity of deep learning and large models, which is a huge challenge for hardware manufacturers and power supply manufacturers.

InfoQ: What do you think are the structural differences between traditional and large language models, and are there any differences in inference optimization?

Mou Li: Traditional models, including CNN, NLP, ASR and other networks, are characterized by complex structures, many operator types, and many variants of models, and different software frameworks have their own model description languages and model structures. The vast majority of large language models are based on the Transformer network structure, which is obtained by connecting multiple Transformer Blocks in series, which is characterized by a simple network structure but a huge number of parameters.

InfoQ: Knowing that distributed parallel acceleration is a measure in large model inference, how does Zero-One do this in this regard?

Mou Li: To put it simply, the inference optimization methods for distributed parallelism are mainly tensor parallelism and context parallelism, which segment parameters from the model dimension and input sequence dimension respectively, and use multiple devices to perform parallel computing to achieve the purpose of acceleration.

InfoQ: The memory consumption of large models is often an important consideration during inference. Do you have any optimization strategies or experiences for memory management?

Mou Li: The memory consumption of large models mainly comes from the loading of the model weights itself and the Key/Value in the Transformer Block. Secondly, the pagedAttention of the Key/Value matrix in the model can also greatly improve the memory utilization, and even when the task is idle, we can temporarily switch the Key/Value matrix to other memory areas, and then switch back when needed, exchanging time for space.

InfoQ: When faced with computing bottlenecks, there are sometimes trade-offs, such as sacrificing a certain amount of model accuracy for faster inference. How do you weigh and make decisions? Are there some general guidelines?

Mou Li: From the perspective of perception, the larger the number of parameters of the model, the higher the degree of information redundancy, and low-precision quantization is already a common optimization method in traditional small model inference, especially for language models with a larger number of parameters. The low-precision quantization of zero and one things covers the entire process of training and inference, so it is non-destructive quantization for inference, and there is no need to verify this process. From a production environment perspective, I think it's perfectly acceptable if model quantization can improve the cost performance of the service by more than 1 times while maintaining the accuracy of mainstream task evaluation with little to no degradation (or a few tenths of a point).

InfoQ: Another challenge that large models can face in their inference process is latency, which is a critical metric, especially for real-time or interactive applications. How do you approach the optimization problem of inference latency?

Mou Li: Optimizing latency is trickier than optimizing throughput, first of all, the best case is to buy hardware with more powerful computing power, or reduce latency from the perspective of hardware design. At the software level, for example, for NVIDIA GPUs, a more efficient CUDA Kernel can be developed, using multi-card parallelism and other means, of course, this optimization often has a large manpower and time cost.

InfoQ: In addition to hardware accelerators and distributed parallel acceleration, are there any other types of acceleration techniques or optimizations that can be used to alleviate the computing pressure of large model inference?

Li Mou: There are a lot of technical points in this aspect, and we will share them at the AIConAICon Global Artificial Intelligence Development and Application Conference and Large Model Application Ecology Exhibition 2024 on May 17, welcome to pay attention.

InfoQ: Do you have different inference optimization strategies for tasks of different sizes and complexity? Can you share some experience of adapting your strategy to the needs of the task?

Li Mou: Tasks of different complexity use different amounts of hardware with different ratios. For example, for the same model Yi-34B, we deploy two sets of hardware clusters (low-configuration version/high-configuration version, different computing power and cost) to decide which cluster service to use based on the specific input length of the user's online request, so as to take into account the user experience, service pressure and service cost.

InfoQ: What do you think are the possible technological breakthroughs or development directions in the future in response to the current bottleneck of large model inference computing power?

Li Mou: First of all, there are some related products in China, but the problem is that these special chips and software supporting systems have not formed a good ecology in the market, and the lack of user use and consensus is a challenge for ecological development. Second, as the demand for computing power for large models and AI grows, as the size of computing clusters grows, power supply in local areas may become an issue, which may promote the development of some clean energy and efficient power generation technologies (such as wind power and controlled nuclear fusion).

Speaker Introduction:

Li Mou is a senior algorithm expert of Zero One Things, graduated from Harbin Institute of Technology, and is the head of the online inference service of the Zero One Ten Things large model, and has served as a technical expert in the Alibaba Damo Academy and the EI Service Product Department of Huawei Cloud. He has been engaged in the R&D and optimization of AI model inference and training for a long time, and has led the team to develop a general inference engine and underlying acceleration library, and has won the TOP1 inference performance ranking in the Standford DAWNBench GPU rankings.

Original link: Li Mou: When large model inference encounters a computing power bottleneck, how to optimize engineering? _AI& Large Model_Li Zhongliang_InfoQ Selected Articles

Read on