laitimes

To strengthen the computing power ecosystem, Zhishen Supercomputing aims at the key to the development of intelligent computing

author:The world of communication

In 1956, the term "artificial intelligence" was first proposed. At that time, the potential of data and computing power had not yet been tapped, and artificial intelligence had entered the decades-long preparation zone. Since the beginning of the 21st century, with the support of big data and large computing power, the wave of deep learning has swept artificial intelligence, and artificial intelligence has ushered in a period of prosperity and development. Correspondingly, the growth of artificial intelligence represented by large models has accelerated, and intelligent computing has gradually become the focus of large model competitions.

At a time when intelligent computing is becoming more and more important, what are the development opportunities in China? What challenges are still to be overcome? If you want to win in the intelligent computing track, what key outlets should you seize? Recently, at the high-quality development forum of the intelligent computing industry of the 2024 China Industrial Economy Summit Forum, the all-media reporter of Communication World interviewed Xia Ke, general manager of Beijing Zhishen Supercomputing Network Technology Co., Ltd. (hereinafter referred to as "Zhishen Supercomputing"), in an attempt to collide with the spark of thinking and add annotations to the above questions.

To strengthen the computing power ecosystem, Zhishen Supercomputing aims at the key to the development of intelligent computing

Computing power welcomes development opportunities, and challenges also come with it

At present, the development achievements of large models at home and abroad are obvious to all, and the most intuitive manifestation is the rapid growth of the number of large models. According to the latest generative AI filing information released by the Cyberspace Administration of China, as of March 2024, a total of 117 AI large AI models in mainland China have completed the filing. While the number of large models has skyrocketed, the scale has also expanded rapidly, and the required computing power has gradually exploded.

As a senior practitioner in the front line of computing services, Xia Ke said frankly that the most intuitive change in computing power in recent years is that in 2016, the calculation is still basically based on small models, and the scenarios are mostly used for industrial quality inspection, license plate recognition, advertising recommendation and click-through rate estimation, etc., and the use of computing power is mainly focused on the inference level. With the rise of large models, the use of computing power has gradually changed to transition from inference layer scenarios to a larger proportion of training requirements, especially with the increasing parameters of trained large models, traditional computing power cards have been difficult to support such large-scale computing power, which has also brought about a change in the underlying computing architecture, that is, AI infrastructure.

The training effect, cost, and time of large models are closely related to the underlying computing power. From the most intuitive point of view, the exponential growth of the number of large model parameters from tens of millions to trillions directly drives the increase in computing power demand. In terms of the most typical ChatGPT, the number of GPT-3 parameters in May 2020 reached 174.6 billion, and the computing power required reached 3640PFlop/s-day per day (if it is calculated 100 trillion times per second, it needs to be calculated for 3640 days), after 3 years of development, the number of GPT-4 parameters launched by it in 2023 has expanded to 1.8 trillion, an increase of 10 times, and it is estimated that the daily computing power required by GPT-4 has reached 248,842PFlop/ s-day。

Such a large-scale reliance on computing power not only reflects the importance of computing power, but also potentially indicates that the shortage of computing power will bring huge risks. It is important to know that insufficient computing power will directly lead to the slower intelligent upgrade of the model and the backwardness of the model's capabilities.

"The importance of computing power is undeniable, but computing power still faces daunting challenges. Xia Ke believes that the challenges of computing power in the era of large models are mainly reflected in three aspects.

First, the supply of GPUs is seriously insufficient. According to statistics, there is currently a global chip gap of more than 1 million. From the perspective of NVIDIA, which is the leader in the GPU industry, on the one hand, its production capacity depends on the superposition of core logic chips, HBM memory chips, and CoWoS packaging, and it is difficult to accurately estimate the production capacity; on the other hand, it also faces many restrictions in terms of shipments, and some industry exposures that in the fourth quarter of 2023, the supply of NVIDIA GPUs is strictly limited around the world, resulting in an obvious shortage of supply in the global market.

Second, it is difficult to develop a closed loop ecology. In the face of CUDA and other companies still in an ecological monopoly position, Xia Ke believes that the two main routes currently taken by domestic chip companies are either compatible with CUDA and embracing the CUDA ecosystem, or compatible with mainstream frameworks and large models to form their own software ecology. At present, fortunately, many domestic large models have appeared, which reduces the importance of the frame layer. "The large model is like a super APP in the mobile era, which will block the underlying Android or ios, and the head large model will block the training framework, which is expected to become a breakthrough point in the transformation of computing power. Xia Ke said metaphorically.

Third, the combination of model and industry has been implemented for a long time. It is just like in the construction of the project to achieve satisfactory results with the minimum cost, so as to achieve the implementation of the industry. Xia Ke reminded that the combination of large models and industries should also be analyzed for specific scenarios, for example, for various industry segmentation inference scenarios, edge side subdivision deployment clusters can be built, or centralized large-scale clusters, and for the training scenarios of basic large models and industry large model training and promotion scenarios, more computing power needs to be configured, from thousands of calories to 10,000 cards or even hundreds of thousands of cards.

Break the metaphysics of large model training and win by strengthening the computing power ecology

Since computing power and large models are closely combined, there is also a saying in the industry that "those who get computing power get large models". But in fact, this is not the case, having thousands, tens of thousands of GPU cards is only the basis for success, and investing in large-scale computing power clusters to do training, once the training card is dropped but not replaced in time, it will cause the enterprise to lose investment in the early stage of training, the amount may be millions or tens of millions, or even more.

In this regard, Xia Ke also made a vivid analogy, large model training is like Taishang Laojun's "alchemy", both technical components and other factors, when enough high-quality datasets are collected, the training starts, at this time, an experienced team of engineers needs to regularly check the progress and optimize, to ensure that the results of computing power training converge to the appropriate value as much as possible; after the model is trained, it is necessary to reduce the cost of model inference through various similar methods such as low-bit quantization model, model pruning, etc.

It can be seen that if enterprises want to use large models to achieve business improvement, they need not only intelligent computing power, but also flexible and compatible framework platforms, powerful basic large models, high-quality industry data sets, and solutions that are more suitable for business scenarios to implement industry large models.

In this regard, Xia Ke said that in order to realize the linkage of the intelligent computing industry ecology, it is necessary to encourage the formation of a good business model and form a closed loop, and to deal with the relationship between capital costs, computing resources and user needs. At the same time, by using its own software platform and related advantages of its own and partners' partners, it is committed to doing a good job in the operation of computing power in the intelligent computing industry ecology, and Zhishen Supercomputer is also willing to contribute its own strength to this.

It is understood that at present, Zhishen Supercomputing relies on mature computing power supply chain, construction, operation and large model optimization capabilities, and can provide powerful overall solutions for large model training/inference scenarios at home and abroad. Zhishen Supercomputer has also built a one-stop computing power trading and service platform, which can not only ensure high-performance computing power scheduling, but also realize the application of large models in the industry with the help of the platform.

From the perspective of computing power scheduling, Xia Ke introduced that the high-performance scheduling of Zhishen supercomputing is mainly reflected in the inference side, and when customers request large model services, nodes with better service capabilities and lower costs are preferred to meet customers' computing power needs.

In addition, it is worth noting that in order to promote the ecological process of the intelligent computing industry, Zhishen Supercomputer is also actively promoting the adaptation of domestic artificial intelligence ecology. Xia Ke introduced that the specific strategy is reflected in three aspects. First, with years of industry accumulation, we can accurately find users to use the existing computing power. The second is to invest more domestic chip capabilities in the existing platform and make corresponding adaptations. The third is to gather the forces of relevant policymakers, research institutions and other parties to gather resources to promote the construction of industrial ecology.

Gathering wisdom can do everything, and those who gather strength are all prosperous. It is believed that in the future, with the efforts of Zhishen Supercomputer and many partners, the intelligent computing industry will be able to gather more innovative resources and wisdom, form a closer industrial alliance and cooperation mechanism, inject more vitality and vitality into the development of the digital economy and even social progress, and jointly create a better digital future.

Read on