laitimes

Throwing money at the construction of the Wanka cluster, Chinese companies are catching up

author:Titanium Media APP
Throwing money at the construction of the Wanka cluster, Chinese companies are catching up

Image source@pixabay

At present, the 10,000 card or super 10,000 card cluster has become an important strategic resource for the major technology giants to compete for deployment.

Titanium Media has previously explained that the core of the Vanka cluster is to combine tens of thousands of GPU computing units to build high-performance computing resources. As the key force to promote AI training and inference, the GPU is also regarded as the number one player of AI who holds the most GPU cards in his hand. According to CB Insights, Nvidia accounts for about 95% of the machine learning GPU market, with more than 40,000 companies purchasing Nvidia GPUs, with Meta, Microsoft, Amazon, and Google contributing a total of 40% of its revenue. Taking Meta as an example, in 2022, it announced the launch of an Al research super cluster with 16,000 NVIDIA A100s, and in early 2024, it announced the completion of two 24,576 GPU clusters, with the goal of building infrastructure including 350,000 NVIDIA H100 GPUs by the end of 2024.

In fact, not only international manufacturers, but also domestic manufacturers have also purchased a large number of GPUs to promote the construction of Wanka clusters. Especially since the beginning of this year, the three major operators have also announced the deployment of super 10,000 card clusters. For example, China Mobile will commercialize three independent and controllable 10,000 card clusters in Harbin, Hohhot and Guiyang this year, with a total scale of nearly 60,000 GPU cards; In the first half of this year, China Telecom planned to build a domestic 10,000-calorie computing power pool with a total computing power of over 4,500P in Shanghai, which will be the first ultra-large-scale domestic computing power liquid-cooled cluster in China. China Unicom said that it will build China Unicom's first 10,000 card cluster in Shanghai Lingang International Cloud Data Center this year.

Among them, several types of key roles have emerged, namely communication operators, leading Internet manufacturers, and AI leading manufacturers. It is worth noting that in recent years, newly established AI startups with basic large models are also renting or purchasing computing power with the help of investment in cloud manufacturers, similar to the cooperation model of OpenAI + Microsoft. Titanium media sorted out the data released by leading manufacturers at home and abroad, as shown in the figure.

Throwing money at the construction of the Wanka cluster, Chinese companies are catching up

Nowadays, the number of model parameters has increased from 100 billion to trillions, and the demand for underlying computing power for large models has been further upgraded. Taking GPT-4 as an example, it has 16 expert models with a total of 1.8 trillion parameters, and a training session requires 90 to 100 days of training on about 25,000 NVIDIA A100s, which shows the huge consumption of computing power. In this context, scattered small-scale computing resources are scarce, and Vanka clusters can provide sufficient computing power support to meet the needs of large-scale model training.

In addition, the era of rapid development of AI applications has brought challenges that are completely different from those of traditional general-purpose computing in the past. Simply looking at the CPU as the center cannot meet the current needs, and the multi-computing power system combining AI chips such as GPU and TPU is being repeatedly mentioned and verified by the industry.

Server vendors focus on interconnection

For infrastructure service companies, there is a solution. For example, at this stage, the requirements of the AI demand market for the construction of data centers are affecting the judgment and play methods of server companies. Focusing on customer AI scenarios, it focuses on government and enterprises, operators, and leading Internet industries, and gradually expands to more industries. In fact, this business motivation also prompts server companies to work closely with more upstream hardware partners.

One of the core elements is the focus on chip-based interconnect technology. Because it can realize the high-speed pass-through of chips inside a single server, it is the basis for efficient collaboration of large-scale computing power clusters.

Xu Run'an, senior vice president of New H3C Group and president of cloud and computing storage product line, said that from the perspective of the whole server, it is also necessary to keep up with the development of the market. H3C hopes to be a more open platform, the best choice for upstream GPU manufacturers, and use its own network advantages and understanding of network communication to help more GPU manufacturers achieve better interconnection and interoperability of computing power.

In his view, in the past, everyone's goal may be to make a single chip with stronger computing power, and now we will work hard from another angle, how to make the chip into a larger cluster, and at the same time make the communication effect of the cluster higher and the processing power of the cluster stronger.

Xu Run'an noted that the biggest challenge in cluster interconnection between different regions is the latency between regions. After evaluation, it is believed that the delay of more than 1,000 kilometers actually produces a sharp decline in the training of large models in large-scale training, which cannot be solved according to the current WAN technology.

To this end, the scientific computing power scheduling platform Alpha 3.0 provided by H3C realizes unified management across clusters, splits training tasks, and places appropriate subtasks in appropriate near-end or remote module clusters. However, it is difficult to achieve this operation with a single model and a single subtask.

Throwing money at the construction of the Wanka cluster, Chinese companies are catching up

Taking the computing network brain, a joint project between H3C and a university, as an example, in order to solve the problem of computing resource scheduling in various places, the computing network brain carries out unified pool management of the entire computing resources, and visualizes the computing power information and the number of scheduling tasks. In the scheduling task allocation, according to the principle of affinity, the nearby resource tasks are arranged in the cluster that is idle and closely related to the task scheduling, so as to maximize the efficient utilization of resources. At the same time, deterministic network technology is used to ensure that the delay is within a certain range, which can not only solve the problem of allocation of task scheduling resources in each center, but also solve the role of the whole cluster collaboration.

The path of Chinese enterprises to build a Vanka cluster

Considering the difficulty of building a Wanka cluster, many domestic enterprises are pulling through infrastructure service providers such as servers and chips to jointly solve this thorny problem.

At the technical level, it is first necessary to achieve large-scale and efficient training. Since model training needs to be distributed across multiple GPUs, and a lot of communication between GPUs is required to drive progress, how to achieve efficient model distributed training and communication is one of the important challenges faced by Vanka cluster. Secondly, it is also necessary to achieve high stability of training on a large scale. Building an AI training cluster requires a very complex network structure, but the problem is that once the cluster scales up, the system reliability brought by bandwidth decreases sharply, which often leads to interruptions in the training process and a significant increase in training costs. Therefore, how to ensure the efficiency and stability of the whole training process is another key problem that needs to be solved by the Vanka cluster.

In addition, domestic companies are also affected by the tight supply of upstream NVIDIA A100 GPU chips. Although domestic AI chips have made great progress in the past two years, there is still a certain gap in terms of overall performance and ecological construction. At this stage, enterprises still mainly rely on NVIDIA GPUs and supporting equipment to achieve them, and Chinese enterprises are still in the initial stage of building Vanka clusters.

Therefore, how to build a Vanka cluster based on the domestic ecosystem and leading technology has many challenges in terms of extreme computing power efficiency, massive data processing, ultra-large-scale interconnection, and high-energy consumption and high-density computer room design.

Combined with the core design principles of the ultra-10,000 card clusters proposed in the "White Paper on New Intelligent Computing Technology for Ultra-10,000 Card Clusters", it can be found that Chinese enterprises need to make efforts in the fields of computing, storage, network, platform and computer room support.

Throwing money at the construction of the Wanka cluster, Chinese companies are catching up

For example, the white paper also elaborates on the above-mentioned chip interconnection problem, proposing to build a single-node computing power peak based on Scale-up interconnection, and to push the scale of a single cluster to more than 10,000 cards based on Scale-out interconnection, and superimpose the two to build a large computing power base for a cluster of over 10,000 cards.

In fact, this explanation also distinguishes between two different ways of stacking, Scale-up and Scale-out.

According to Titanium Media App, at present, Nvidia advocates Scale-up, and the NVlink it builds provides an efficient and scalable chip-to-chip communication protocol, allowing all GPUs to communicate in real time at full speed at the same time. The company's official products have been upgraded from NVL36 and NVL72 to NVL576, that is, from the original 36 or 72 GPUs, to 576 GPUs at once. This actually further optimizes the problem of scale-out interconnection, which is aimed at the problem of current customers building 10ka clusters. Amazon, Google, Microsoft, and Oracle are already planning to use NVL72-designed clusters in their cloud service offerings.

So back to the domestic ecology, the computing power requirements of Wanka clusters brought about by large-scale model training and push scenarios, such as the realization of high-speed pass-through of chips within a single server, and the efficient collaboration of large-scale computing power clusters?

Zheng Huiping, chief product manager of H3C Group's smart computing product line, added that heterogeneous integration will be the future development direction, which means that there will be challenges in how to quickly dispatch heterogeneous resources. At the same time, the management of Vanka cluster resources also needs to be implemented at the business level.

Combined with the white paper, we can also sort out the solutions for the communication between GPU cards: first, to promote the super-node form server that surpasses the single 8-card, and second, to introduce the Scale-up Switch chip to optimize the interconnection efficiency and scale of the GPU southbound; The third is to optimize the interconnection protocol between GPU cards to achieve a leap in communication efficiency. The domestic intelligent computing industry is taking a different path with its own advantageous background. (This article was first published in Titanium Media APP, author|Yang Li, editor|.) Gai Hongda)

Read on