laitimes

Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?

author:Communication Industry News
Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?
Being able to build is strength, and using it well is the key.

From ChatGPT to Sora, Claude 3, Llama 3, etc., the parameters of large models have moved from tens of billions and hundreds of billions to trillions, model capabilities have become more generalized, the battle of large models is in full swing, and domestic large models urgently need to accelerate to catch up or even surpass. At the same time, it has also triggered a new wave of AI computing power shortage, from the previous shortage of chips to the "hunger and thirst" of AI computing power clusters.

The data shows that by 2030, the general computing power will increase by 10 times to 3.3 ZFLOPS, while the intelligent computing power will increase by 500 times to 105 ZFLOPS. As the computing power base of the artificial intelligence industry, the intelligent computing center is expected to maintain a rapid growth of more than 30% in the next 3~5 years, and the 10000 calorie cluster and 10,000 calorie cluster will be an important springboard for it to complete the transition.

Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?

破解AI 算力荒。

"Cluster" compensates for "single card"

The necessary path to solve the AI computing power shortage

With the exponential explosion of demand for large model training and inference, coupled with the interference in the supply of GPUs, the gap between the supply and demand of computing chips is quite huge. "N cards are hard to find" has triggered a rush to buy and hoard, and it is difficult to find products in the market that directly benchmark the performance of a single card of international giants.

Industry experts pointed out that at present, the mainland's intelligent computing power is in a serious state of short supply, and the growth rate of computing power demand for large models has been much higher than the growth rate of the performance of a single AI chip. Considering the superposition of multiple factors and the need to jointly build a commercial closed loop of the AI industry with domestic large models, it is even more urgent to build a localized cluster of independent innovation.

Obviously, the 1000-calorie cluster and 1000-calorie cluster are the starting point to meet the demand for AI computing power. What is Vanka Cluster? That is, tens of thousands of GPUs are used to build large-scale AI computing power clusters to train basic large models. This kind of clustering helps to significantly reduce the training time of large models to enable rapid iteration of model capabilities and respond to market trends in a timely manner.

From the era of 10000-calorie clusters to 1000-calorie clusters, from volume "models" to volume "applications", the industry urgently needs efficient and sustainable computing power to run through a variety of new computing tasks. Driven by these diversified new needs, the new Wanka intelligent computing center that combines chip systems has become an important starting point to meet the needs of the large-scale model industry, and has also become the standard new infrastructure for AI competition in major countries.

It is understood that there are two main options for the domestic intelligent computing center to build clusters in the future. The first is to adopt the cluster mode of "mixing and matching" chips at home and abroad, which has high requirements for system optimization, and the "shortest plank" may affect the full release of the overall efficiency of computing power, and it is expected that it will take a long time to run in to find the optimal path; The second is to adopt the cluster model of localization, continue to move forward to "easy to use" on the basis of "usable", and open up a broad world of independent innovation with solid practical results.

In the past year, the layout of intelligent computing infrastructure in the mainland's thousand-pit intelligent computing center has exploded. At present, the domestic Vanka Intelligent Computing Center is still in the early stage of development and is facing development challenges. Therefore, the opportunity left for domestic AI computing power lies in the clusters above kilocalories and the software ecology behind them. As Zheng Weimin, an academician of the Chinese Academy of Engineering, said, although it is difficult to build a domestic vanka system, it is necessary.

Liu Qiujiang, a large model expert and the operator of the first AI large model industry empowerment center in China, told the all-media reporter of "Communication Industry News" that there are more and more Wanka computing power clusters under construction, but most of the large models are still in the stage of training iteration and small-scale use, which cannot meet the visible industrial needs, and more computing power clusters need to be built in the future.

Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?

Vanka Cluster Competition

Tech companies "fight in groups"

At present, the computing power cluster has moved from a 10000-calorie cluster to a 10,000-calorie cluster and a 50,000-calorie cluster. There are even predictions that when GPT-6 is deployed in the future, it will need 700,000~800,000 cards to support it.

In the matter of accumulating computing power, major technology companies have invested in research and development, and proposed various schemes for training large models on the Vanka cluster. However, the number of companies that can design and effectively run Vanka clusters is still in the hands of a few.

On the international stage, technology giants such as Google, Meta, Microsoft, Amazon, and Tesla are using the Chaowanka cluster to promote technological innovation in the fields of pedestal large models, intelligent algorithm research and development, and ecological services. For example, Google launched the supercomputer A3 Virtual Machines, which has 26,000 Nvidia H100 GPUs, and builds a TPUv5p8960 card cluster based on self-developed chips. Meta launched the Al Research Super Cluster with 16,000 Nvidia A100s in 2022, and announced two 24,576-block Nvidia H100 clusters in early 2024 to support the training of next-generation generative Al models.

In China, communication operators, leading Internet companies, large-scale AI R&D enterprises, and AI start-ups are all continuously promoting technological innovation in the construction and use of ultra-10,000 card clusters.

As the backbone of the national computing infrastructure construction, operators are accelerating the construction of intelligent computing centers in the 10,000-calorie cluster. China Mobile revealed not long ago that this year it will commercialize three independent and controllable 10,000 card clusters in Harbin, Hohhot and Guiyang, with a total scale of nearly 60,000 GPU cards. In the first half of this year, China Telecom plans to build a domestic 10,000-calorie computing power pool with a total computing power of over 4,500P in Shanghai, which will be the first ultra-large-scale domestic computing power liquid-cooled cluster in China, and also the industry's leading national public intelligent computing center for the integration of cloud and intelligence. China Unicom's Shanghai Lingang International Cloud Data Center will build China Unicom's first 10,000 card cluster within this year.

Among Internet companies, except for ByteDance, which is a well-known "N card" collector, and Alibaba and Baidu have some self-developed chips, the vast majority of large, medium and small factories are frantically looking for alternatives to domestic AI computing power. Among them, ByteDance has built a 12,288-card Ampere architecture training cluster, and developed a MegaScale production system for training large language models. Last year, Ant Group revealed that it had built a 10,000-card heterogeneous computing power cluster. In 2023, Tencent will launch a high-performance network with the industry's highest 3.2 T communication bandwidth, bringing a 10-fold increase in communication performance to AI large models, and based on Tencent Cloud's next-generation computing power cluster HCC, it can support a large computing scale of 100,000 GPUs.

In addition, in July 2023, Huawei announced a comprehensive upgrade of the Ascend AI cluster, expanding the cluster size from 4,000 cards to 16,000 cards, making it the industry's first 10,000-card AI cluster, with faster training speed and a stable training cycle of more than 30 days. In 2023, iFLYTEK will build the first ultra-10,000-card cluster computing platform "Feixing-1" to support large-scale model training. On February 4, 2024, the "Shenzhen Open Intelligent Computing Center" will light up the "Shenzhen Smart City Computing Power Coordination and Scheduling Platform", which will help Shenzhen build a 100,000-card "strongest computing power" cluster.

It is worth mentioning that for server manufacturers, when their size is no longer limited to providing a single hardware product, but a comprehensive solution, these solutions may include servers, storage, networking, security, etc., and also need to provide customized solutions for the specific needs of downstream customers. As an important carrier of computing resources, servers are also becoming a core part of the construction of Wanka clusters by enterprises.

So, from 10000 calories and 10,000 calories to 100,000 calories and millions of calories, why is intelligent computing still "stacking cards"? Is this trend sustainable?

Obviously, the non-linear increase in the number of computing power cluster cards will bring greater instability and collaboration difficulty. Xinhua III experts believe that compared with the "N card", there is a gap between us and the single card, but the multi-card cluster service cannot fight in groups.

Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?

From "Build" to "Use"

How the Vanka cluster overcomes five challenges

There are many misunderstandings in the industry about the path exploration of cluster construction. Some people think that "a cluster is a bunch of servers stacked on top of each other", and some people think that "the more computing power of a cluster, the better", these views underestimate the difficulty of running complex systems and the importance of multi-element collaborative breakthrough.

Experts believe that cluster construction is undoubtedly a systematic and complex project, from GPU to server to cluster, covering computing, storage, network, software, and large model scheduling and other links, and has high requirements for computing power utilization, stability, reliability, scalability, compatibility and other indicators, and the market is looking forward to the emergence of a "turnkey" solution that can meet the full-stack needs of intelligent computing centers.

There is no doubt that under the development path of combining large computing power with big data to generate large models, the construction of a super 10,000 card cluster is not a simple stacking of computing power, but to make tens of thousands of GPU cards run efficiently like a "supercomputer", the overall design of the super 10,000 card cluster should follow the five principles of insisting on building the ultimate cluster computing power, insisting on building a collaborative optimization system, insisting on achieving long-term stable and reliable training, insisting on providing flexible computing power supply, and insisting on promoting green and low-carbon development.

Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?

Image source: White Paper on New Intelligent Computing Technology for Vanka Clusters

However, in the era of large models, computing power is productivity, and market giants are still in their infancy to build AI clusters with tens of thousands of GPUs. In the interview, the all-media reporter of "Communication Industry News" found that the construction of the Wanka cluster still faces five challenges.

First, the challenge of extreme computing power efficiency. The linear increase in cluster size does not directly lead to a linear increase in computing power, and the adaptation and optimization of the interconnection network, software, and hardware between cards and nodes is the key challenge to pursue the ultimate effective computing power of the cluster. The White Paper on New Intelligent Computing Technology for Ultra-10,000-Card Clusters points out that it is necessary to use system engineering methods to comprehensively improve the efficiency of cluster computing power through the refined design of ultra-10,000-card cluster networks and the full-stack integration and optimization of software and hardware.

Second, the challenge of processing massive amounts of data. The training of 100 billion models requires the use of multiple protocols to process petabytes of datasets, and the training of trillions of models in the future requires the read/write throughput performance of checkpoints to be as high as 10 TB/s. It is necessary to provide efficient data sharing and processing capabilities through a series of technical means such as protocol fusion and automatic grading to meet the needs of large model training.

Third, the challenge of hyperscale interconnection. With the expansion of the model scale, multi-machine and multi-card interconnection and parallel training strategies are required, which puts forward extremely high requirements for the ScaleOut and ScaleUp of the network. This requires that the parametric plane network and data plane network need to have high bandwidth, low latency, and high reliability to support the data throughput and computing requirements of large model training.

Fourth, ensure high stability and efficiency in the process of large-scale training. Stability is critical in large model training, as failures and delays, while common, can be costly. It is urgent to shorten the failure recovery time, because once there is a person left behind, it will not only affect the progress of the individual, but also may cause the overall operation of tens of thousands of GPUs to be blocked. This challenge needs to be carefully optimized to ensure that training is stable and efficient.

Fifth, the challenge of domestic software ecology. Although more than 30 domestic companies have launched domestic AI chips, users do not like to use them, and the core problem is that the domestic software ecology is not good. At present, although key software such as programming frameworks, parallel acceleration, communication libraries, operator libraries, AI compilers, programming languages, schedulers, memory allocation systems, fault-tolerant systems, and storage systems are all domestically produced, there are still shortcomings, such as incomplete functions, poor performance, and insufficient prosperity of ecological contributors.

With the increasing parameters of AI large models, higher dependence and thirst for computing power clusters are proposed, requiring computing power manufacturers to work hard in chips, tuning, communication, and systematic development and management, so as to truly accelerate the development of the large model industry.

Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?

Written by: Hu Yuan

Chart: Dawn

Editor and proofreader: Hu Yuan

Guidance: Xin Wen

Intelligent Computing Center: "igniting" new computing infrastructure (with industry map)

Intelligent Computing Center Industry Map: Typical Enterprises and Competitiveness at a GlanceWho is in the first position? Operators attack the intelligent computing center

Frontline Talk: Challenges and Responses of Intelligent Computing Center

Frontline Talk: How to give full play to the advantages of the intelligent computing center

The government work report first mentions the "national integrated computing power system": why? How to build? Two years of "Eastern Data and Western Computing": the national integrated computing power network is about to emerge

Expert interpretation: how to build a national integrated computing power system?

Expert interpretation: accelerate the construction of a national integrated computing network

Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?
Vanka Cluster: How far is it from "group fighting" to "group breakthrough"?