laitimes

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

author:Yan Yuelong

What has been the hottest thing in the past year? "Collins English Dictionary" directly selected "AI" as the "word of the year" in 2023, and "artificial intelligence model" is also among the top ten popular words of 2023 released by "Bite and Chew Words".

However, when AI is hot, some new trends and new problems have emerged, especially how the computing infrastructure as the AI base can maximize the release of kinetic energy, which deserves the attention of the industry.

Breaking through the bottleneck of computing power has become a top priority

The popularity of AI is inseparable from the development of technology, the needs of the industry, and the attention of policies, all of which have led to an explosive growth in the demand for computing power.

In terms of technological development, generative artificial intelligence, represented by large models, has ushered in a big explosion in the past year. As Zhang Yunquan, a researcher at the Institute of Computing Technology of the Chinese Academy of Sciences, said, "large model + large computing power + big data" has become the basic paradigm for the development of a new generation of artificial intelligence. Relevant data predicts that the emergence of large models has increased the training power of large models based on the Transformer architecture by an average of ten times per year since 2018, and a new Moore's law has emerged, that is, the training power doubles every 20 months.

In terms of industry demand, large models are moving towards thousands of industries, giving birth to in-depth intelligent transformation of various industries. At the 2024 Lenovo Innovation and Technology Conference, Yang Yuanqing, Chairman and CEO of Lenovo Group, said that hybrid artificial intelligence is the inevitable path for artificial intelligence to reach and empower the industry. According to Gartner, 80% of enterprises will use generative AI in 2026, and enterprises will spend nearly four times as much on generative AI in 2027 as they will in 2024. Obviously, the increased penetration of AI in the industry will further accelerate the demand for computing power.

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

It is worth noting that AI and the resulting computing infrastructure construction have also received great attention from the policy. In this year's government work report of the two sessions, "artificial intelligence +" was proposed for the first time. In particular, the new quality of productivity that China is accelerating its development has enabled AI and computing power to show their talents. The Action Plan for the High-Quality Development of Computing Infrastructure issued by the Ministry of Industry and Information Technology clearly points out that computing power is a new type of productivity that integrates information computing power, network carrying capacity, and data storage capacity, and mainly provides services to the society through computing infrastructure. It can be seen that computing power, as the foundation to support the development of artificial intelligence, is a typical new quality productivity.

On the one hand, there is an explosive demand for computing power, but on the other hand, the utilization rate of computing power is not high, and there are at least three mountains that plague the use of computing power.

First of all, there is a high and complex mountain, on the one hand, there are many application scenarios in thousands of industries, and on the other hand, there are rich algorithm frameworks and operator libraries, and how to choose the most suitable computing power for yourself is a big problem.

In the context of 10000 or even 10,000 calorie clusters, frequent AI training failures lead to long failure recovery time, weak GPU virtualization capabilities, and large network communication bottlenecks, all of which lead to low utilization and poor availability of AI computing power. For example, Meta once mentioned in the log of training the OPT-175B model that almost the entire training process had to face non-stop restarts and interruptions, especially more than 40 restarts in a two-week period due to hardware, infrastructure, or experimental stability issues.

Again, it is a mountain with high energy consumption. Tesla CEO Elon Musk believes that in the next few years, the AI industry will change from "lack of silicon" to "lack of electricity". Jensen Huang, founder and CEO of NVIDIA, recently said that the end of AI is photovoltaic and energy storage, which are the challenges of high energy consumption brought about by the development of AI.

To sum up, the main contradiction in the field of computing power has become the contradiction between the explosive demand for computing power in all walks of life and the shortage and low utilization rate of computing power.

A "Foolproof Solution" to Improve Computing Power Performance

Chen Zhenkuan, Vice President of Lenovo Group and General Manager of Lenovo China Infrastructure Business Group, gave three major capabilities of Lenovo's AI infrastructure: matching users with the best computing power that has been verified and optimized, enabling users to make full use of computing power and improving computing efficiency, and using advanced liquid cooling technology to help users save energy and increase efficiency and break through the bottleneck of chip heat dissipation.

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

Chen Zhenkuan, Vice President of Lenovo Group and General Manager of Lenovo China Infrastructure Business Group

It can be said that these three capabilities, as well as the five latest innovations released by Lenovo, are all direct to the pain points of users and use technological innovation to deal with the contradictions in the field of computing power.

In particular, it is worth noting that the newly released Lenovo Heterogeneous Intelligent Computing Platform uses differentiated technology to allow users to obtain more efficient and stable computing power. Lenovo Heterogeneous Intelligent Computing Platform is a platform that can complete the whole process of AI development with a high degree of automation, which contains a rich computing power ecology, models and AI toolsets optimized for various scenarios, which is not only like a super resource scheduler and amplifier, but also assumes the role of a super brain similar to computing power and efficiency.

Chen Zhenkuan described the positioning of Lenovo's Heterogeneous Intelligent Computing Platform: "Lenovo Heterogeneous Intelligent Computing Platform is the core of Lenovo's China infrastructure strategic framework in the AI 2.0 era, which integrates Lenovo's five major technological innovations and is the infrastructure foundation for large model training and inference in the AI 2.0 era. ”

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

The biggest breakthrough of Lenovo's heterogeneous intelligent computing platform is in the technological innovation of algorithms. Taking GPU kernel virtualization technology as an example, it solves the problems of most GPU virtualization algorithms at the operating system level in multi-tenant and multi-container scenarios, such as disordered resource preemption, waiting performance overhead, and overly granular granularity. "Lenovo Research Institute has developed a kernel-state virtualization algorithm at the GPU driver layer, and the new algorithm can reduce the GPU computing power loss caused by virtualization to less than 5%, and in the extreme case, it can be reduced to less than 1%, greatly improving GPU utilization. Chen Zhenkuan said.

Minute-level AI breakpoint training solves the problem of poor computing power availability caused by AI training failures. After a failure occurs, the recovery time usually takes several hours to diagnose, isolate or resolve the fault, and the computing power is seriously wasted. Lenovo's minute-level AI breakpoint training technology can achieve minute-level recovery of faults through multi-level backup strategies, comprehensive real-time monitoring, and especially AI prediction of AI faults, which greatly improves the availability of computing power, and can save millions of yuan in computing power expenses every month by taking the thousand-calorie cluster as an example.

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

In addition, Lenovo's integrated communication library technology, which breaks through the bottleneck of cluster computing, can improve training efficiency by 10-15%, and the heterogeneous cluster super scheduling technology, which breaks through the silos of computing power, can open the door to the sharing of AI and HPC computing power, which can help customers obtain continuous and stable computing power output.

On the whole, Lenovo is promoting the release of the maximum kinetic energy of AI infrastructure, and the key lies in three "forces".

First, technology. The four major algorithm innovations contained in Lenovo's heterogeneous intelligent computing platform demonstrate its strong technical strength. Huang Shan, Director of Strategy at Lenovo's China Infrastructure Business Group, mentioned a particularly interesting example when explaining Lenovo's collective communication library technology, which was studied with reference to the ant colony algorithm. When the ant colony transports food, an ant will release pheromones that succeed in action after successfully getting the food, and the higher the pheromone concentration of a path, the greater the success rate, and this is how the ant finds a shortcut to success. This is how the ant colony algorithm was generated, and successfully solved the problem of urban traveling merchants, that is, the optimal path for a businessman to traverse ten cities from one city to the starting point, and also provided an important reference for Lenovo's collective communication library technology. Lenovo is constantly exploring and innovating algorithms.

The second is evolutionary power. Chen Zhenkuan revealed that in the future, Lenovo will challenge the breakpoint training within minutes, continue to optimize the communication algorithm of ultra-large-scale clusters, conduct in-depth research on phase change liquid cooling technology, and lay out modular liquid cooling data centers. It can be seen that Lenovo's AI infrastructure is constantly evolving, which also means that it will continue to refresh the height of computing power release.

The third is ecological power. Another noteworthy event at the 2024 Lenovo Innovation and Technology Conference is the launch of the Heterogeneous Intelligent Computing Industry Ecological Alliance. This ecological alliance, which covers the AI chip layer, AI equipment and system layer, AI platform and application layer, IaaS platform, AI training and inference, and industry scenario solutions, will undoubtedly bring together the strength of upstream and downstream infrastructure enterprises, academia and research institutions, integrate resources, improve industrial competitiveness, and promote the standardized development of the industry.

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

All of this is the key to Lenovo's efficient use of computing power and the release of maximum kinetic energy.

The strategic layout of "one horizontal and five vertical" has created a solid foundation

It is worth noting that with the release of Lenovo's heterogeneous intelligent computing platform, Lenovo's strategic layout of "one horizontal and five vertical" in infrastructure is very clear.

At the MWC2024 held in February this year, Liu Jun, executive vice president and president of Lenovo China, announced Lenovo's AI-oriented infrastructure layout of "one horizontal and five vertical" for the first time. Specifically, the "one horizontal" refers to Lenovo's heterogeneous intelligent computing platform, and the "five verticals" are server, storage, data network, software, hyper-convergence, and edge infrastructure products and solutions.

Chen Zhenkuan said that as the backbone of Lenovo's "full-stack AI" strategic layout, Lenovo's infrastructure business in China is building a solid and reliable intelligent computing foundation for the intelligent transformation of enterprises by building a complete, stable and efficient AI-oriented infrastructure with a strategic framework of "one horizontal and five vertical".

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

It can be seen that Lenovo's "one horizontal and five vertical" strategic layout together constitutes the core competitiveness of Lenovo's AI infrastructure. Chen Zhenkuan said in an interview: "The five major products of Lenovo's AI infrastructure will be upgraded and reconstructed based on AI, and at the same time, Lenovo will integrate the five products into one and achieve full integration through Lenovo's heterogeneous computing platform. ”

Therefore, it is this strategic layout of "one horizontal and five vertical" that allows Lenovo to give a "foolproof strategy" for a solid foundation for infrastructure achievements in the AI era.

Postscript: The engine that stimulates new quality productivity

Chen Runsheng, an academician of the Chinese Academy of Sciences, used a word when evaluating the development of AI: "paradigm": "The development of artificial intelligence, especially large models, provides us with a new paradigm. ”

The term "paradigm" is a concept first proposed by Thomas Kuhn in The Structure of Scientific Revolutions, which refers to a major change in the way of thinking in a certain field. Every major change in human history is the product of a "paradigm revolution".

For AI to truly trigger a paradigm revolution, the key is whether the computing infrastructure as the foundation can unleash the maximum value. From this perspective, we can see the significance of Lenovo's Heterogeneous Intelligent Computing Platform and Heterogeneous Intelligent Computing Industry Ecological Alliance.

It can be said that standing at the special point of 40 years old, Lenovo has once again ushered in the historic opportunity provided by the paradigm revolution of AI, whether it is AI PC in the field of intelligent terminals or AI infrastructure in the field of infrastructure.

The reason for this is that opportunities are always reserved for those who are prepared. Taking infrastructure as an example, Lenovo is at the forefront of the AI field because of at least three reasons:

First of all, it is experience. As Liu Jun said, as early as the period of local computing and storage, Lenovo launched the first PC server with IA architecture, creating a new era of domestic PC servers, and in the era when the Internet gave rise to the demand for general computing power and scientific computing power, Lenovo became a leading enterprise in China's computing power. The rich experience allows Lenovo to carry on the past and the future, and has the opportunity to continue to lead in the AI era represented by large models.

Under the AI paradigm revolution, Lenovo gives a "foolproof strategy" for AI infrastructure

Secondly, it is a complete layout. In the field of infrastructure, Lenovo has built a complete business layout of "one horizontal and five vertical", especially the "one vertical" of the Wanquan heterogeneous intelligent computing platform released this time, which is the finishing touch of AI infrastructure, with both software and hardware, so that AI infrastructure can be transformed.

Then, there's the insight. The five major innovations in the field of AI infrastructure released by Lenovo this time are all the results of insight into customer pain points and technological innovation. The final result of these innovations is to leave the complexity to the technology and the simplicity to the customer, all so that the customer can maximize the value of AI infrastructure and better support the development of AI and large models.

"The new era brought to us by artificial intelligence technology is an era of a hundred flowers blooming and a hundred ships competing for the stream, and it will also be an era of heroes. Lenovo will continue to invest, insist on innovation, and continue to upgrade the power of everything to help heroes and achieve heroes. Accelerate China's intelligent transformation and unleash new momentum for social progress. Chen Zhenkuan said.

Artificial intelligence has become the core driving force for the mainland to accelerate the development of new quality productivity. It can be expected that when the AI infrastructure is as solid as a rock, AI will be able to enter thousands of industries and households more quickly, and better burst out new quality productivity.