Alibaba Cloud will take the lead in formulating international standards for next-generation intelligent computing network architectures

author：Yang Jianyong 2024-05-16 14:15:00

Text/Yang Jianyong

Since the birth of ChatGPT, the development of global artificial intelligence has jumped to another level and entered the era of large AI models. However, the number of parameters and the scale of the training set of large models have increased significantly, and the improvement of the computing power of GPU chips alone can no longer meet the demand, and the industry's attention has turned to innovation at the system architecture level. Among them, the underlying core technology, the network, has become a key breakthrough, and global technology companies have launched the most fierce AI network technology chase.

In this context, China's AI network has ushered in a new breakthrough. According to the latest news, Alibaba Cloud's self-developed HPN7.0 high-performance AI intelligent computing cluster network has been recognized by the international academic community and has become the first related paper in the more than 50-year history of SIGCOMM. Just yesterday, the International Hyper-Ethernet Alliance UEC announced the list of new technical committees, and Alibaba Cloud became the only Chinese company selected, which will take the lead in developing the next generation of AI network architecture standards with Microsoft and Meta.

Alibaba Cloud will take the lead in formulating international standards for next-generation intelligent computing network architectures

Led by Alibaba Cloud, China's underlying technology seized the opportunity in the second half of AI

In the past year, AI models have exploded, and UEC, an international open organization that seems to have nothing to do with AI, has become the hottest organization that tech giants are vying to join.

In July 2023, the Linux Foundation co-founded the Ultra Ethernet Consortium (UEC) open organization, which is committed to building new ultra-large-scale network technologies and systems for the AI era. Based on the current Ethernet-based open, interoperable, high-performance, full-communication stack architecture, it meets the requirements of HPC and AI for high bandwidth and low latency.

To put it simply, the Hyper-Ethernet Alliance is developing AI infrastructure networks to better support large model training and inference. This is the most critical underlying technology in the field of AI technology, so it has also attracted the world's most powerful technology companies to join, such as Microsoft, Meta, AMD, Intel, Cisco, etc.

More importantly, the Hyper-Ethernet Alliance also undertakes the development of international technical standards, which is of strategic significance for the future development of AI underlying technologies and upper-layer applications. In the Super Ethernet Alliance, there is a special technical committee, which is responsible for formulating the technology roadmap, controlling the core technologies and directions, coordinating the work of each group, and coordinating all technical proposals and standard formulation.

Just yesterday, three new members were added to the Technical Committee of the Hyper-Ethernet Alliance, and Alibaba Cloud became the only Chinese company selected, which will work with Microsoft, Meta, AMD, Broadcom and other 12 members to jointly promote the research and development of open network systems and core technologies and the formulation of standards, and build the next generation of AI network infrastructure.

This is the latest breakthrough for a Chinese company in the Hyper-Ethernet Alliance. It is worth noting that Alibaba Cloud is led by the UEC Technical Committee, including Huawei, Byte, Baidu, H3C and other Chinese companies are important members of the UEC Alliance. Chinese companies and Chinese technology are forming a synergy and playing an increasingly important role in the second half of the AI wrestling.

China's new practice of AI networking is expected to replace Google as the new paradigm

Generative AI and large models are in full swing around the world, and the demand for computing power will also increase significantly. According to data from the Academy of Information and Communications Technology, the global computing power scale of computing equipment in 2023 will be 1369EFlops. The scale of computing power in mainland China has reached the second place in the world, and the growth rate is higher than that of the world.

In order to improve the efficient carrying capacity of computing power, the "Action Plan for the High-Quality Development of Computing Infrastructure" issued by the mainland in October 2023 mentioned that it is necessary not only to strengthen the ability of computing power access network, but also to improve the transmission efficiency of hub networks, including accelerating the R&D and deployment of 400G/800G high-speed optical transmission networks, and promoting the high-quality development of computing infrastructure.

In this context, Chinese companies have actively launched a series of cutting-edge explorations in AI infrastructure construction, and the latest achievements come from Alibaba Cloud. The HPN7.0 architecture proposed by Alibaba Cloud was accepted by SIGCOMM, the top conference in the field of international communication networks, and became the first academic paper on AI high-performance network architecture.

SIGCOMM is the oldest top academic conference in the field of computer networking, from the TCP/IP network protocol in computer textbooks to the classic architecture of cloud data centers, SIGCOMM has witnessed the birth and development of many key computer network technologies.

Google's Jupiter for data center network architecture has set a global industry benchmark through SIGCOMM and has become the most representative network architecture in the CPU era. Now, Alibaba Cloud's HPN7.0 new architecture has taken over the baton, and experts believe that it is very likely to become a new paradigm of network architecture in the GPU era.

HPN7.0 is designed with the characteristics of large scale, large streams, strong bursts, and high stability requirements in large model training scenarios, and innovatively designs a network architecture of "dual uplink + multi-track + dual-plane", and cooperates with the latest generation of 51.2Tbps single-chip Ethernet switch and 400G high-performance network card, and self-developed Solar-RDMA and ACCL communication libraries to achieve high-performance and high-stability interconnection of single-layer kilocalories and two-layer tenkilocalories.

The new network architecture based on HPN 7.0 can support the scalable scale of clusters of up to 100,000 cards, efficiently connect heterogeneous computing resources, break through the bottleneck of a single performance chip, and truly make cloud computing an intelligent supercomputer.

Not long ago, Alibaba Cloud released the Tongyi Qianwen 2.5 version of the large model, and the Chinese performance has fully caught up with GPT-4Turbo, which is based on HPN7.0 high-performance network cluster training.

GPU breakthrough battle, Chinese enterprises carry the banner of open source and open system

The successful practice of HPN7.0 proves that through the optimization and innovation of the underlying system architecture such as network technology and the integration of heterogeneous computing resources, it can also support the training of AI large models with ultra-large-scale parameters.

Through system innovation, amplifying the potential of hardware, and forming a dimensionality reduction strike from top to bottom, this technology iteration idea is of great strategic significance in the current international competition. What's even more rare is that it completely targets the technology choices of the world's mainstream tech giants.

In fact, there are two main technical lines for the current AI high-performance network, one is the Infiniband system led by NVIDIA, but this architecture is relatively closed, which can be understood as the InfiniBand based on NVIDIA GPU has basically become NVIDIA's private protocol, which cannot make full use of the current prosperous Ethernet ecosystem.

The other is the open source and open standard system led by the UEC Alliance, including AMD, Intel, Cisco, Broadcom and other chip and hardware companies are founding members, which is to promote innovation through open source and open technical cooperation, and compete with NVIDIA in the GPU-based AI era, so it has also become the dispute between Apple (corresponding to NVIDIA) and Android (corresponding to UEC Alliance) in the field of AI networking.

This is undoubtedly a strategic new opportunity for Chinese technology. At that time, the rise of domestic smartphones was inseparable from the open-source Android system. Relying on open source technology, China's mobile phones are gradually promoting comprehensive self-development from top to bottom, from the whole machine manufacturing, to the operating system, and then to the current chip, and then contribute to the international open source community, and have also made outstanding contributions to the global mobile phone technology progress.

There is reason to believe that with the increasing accumulation and breakthrough of China's AI self-developed technology, and the increasingly important role of Chinese companies such as Alibaba Cloud in the open UEC alliance, China will become more and more confident and louder in AI, a technological highland that will be contested in the future.

Yang Jianyong, a writer for Forbes China, expresses his views on an individual basis. It is committed to in-depth interpretation of cutting-edge technologies such as the Internet of Things, cloud services, artificial intelligence and smart home.

Alibaba Cloud will take the lead in formulating international standards for next-generation intelligent computing network architectures

Read on