laitimes

【Digital Age】AI+, what kind of computing power network is needed?

author:Shandong Provincial Association for Science and Technology
【Digital Age】AI+, what kind of computing power network is needed?
【Digital Age】AI+, what kind of computing power network is needed?
【Digital Age】AI+, what kind of computing power network is needed?
【Digital Age】AI+, what kind of computing power network is needed?

In the AI+ era, computing power will play an increasingly important role and become a key driving force for industrial upgrading and productivity leap. At present, the mainland computing power network is in a critical period of integration and unification, a leap period of ubiquitous intelligence, a breakthrough period of original technology, and a period of shaping industrial ecology. In order to better serve the AI+ era, the development of computing networks needs to achieve "three qualitative changes": with "large-scale intelligent computing clusters" as the core, the qualitative changes in infrastructure should be realized; With the "intelligent upgrade of computing network brain" as the core, the qualitative change of orchestration and scheduling is realized; With the "integration and unified computing leader" as the core, the qualitative change of the service model is realized.

How to achieve these goals and make the computing network fully support AI+? At the 2024 China Mobile Computing Network Conference, China Mobile gave its latest practices and planning paths.

Build a large cluster and build a super factory for AI model training. China Mobile will continue to optimize the overall layout of computing network resources, facing AI+ upgrades, and will commercialize three autonomous and controllable 10,000 KA clusters in Harbin, Hohhot, and Guiyang this year, with a total scale of nearly 60,000 GPU cards, to fully meet the needs of centralized training of large models. As large-scale model training gradually shifts to large-scale industry applications, and ubiquitous inference needs continue to emerge, China Mobile will deploy inference computing power on 1,500 edge nodes on demand to form an intelligent computing network with "large central clusters, wide distribution at the edge, and integration of training and pushing". At the same time, it will continue to improve the technical system and promote full-stack technology innovation. The first is to break through the blockage point and accelerate the move towards the super 10,000 card cluster. For inter-machine interconnection, the original fully dispatched Ethernet technology system (GSE) is proposed to build a new intelligent computing center network with no blockage, high bandwidth and ultra-low latency, benchmarking the international mainstream IB and UEC solutions, forming China's independent technical system, and will carry out GSE pilot test this year to accelerate the maturity of GSE key technologies and industries. For card-to-card interconnection, we build a standard open bus-level interconnection architecture, improve the high-bandwidth and low-latency communication capabilities between GPU cards, realize full-stack optimization of interconnection topologies and protocols, and contribute Chinese solutions to the new standard and open intelligent computing interconnection. The second is multiple heterogeneity to build an integrated and open big computing power ecosystem. This year, we will upgrade the "Xinhe" computing power native platform to support the rapid migration of intelligent computing applications to more GPU chips, and also support distributed heterogeneous mixed training for large models, breaking the current limitation that large models can only be trained in a single manufacturer and single model cluster. China Mobile will further achieve key technological breakthroughs in the cloud base, upgrade the "Dayun Tianyuan" operating system, and commercially promote cloud-native databases and a new generation of SDN networks. The third is the integration of training and promotion to create out-of-the-box AI services. The self-developed intelligent computing platform builds an "automated production line" for model training, realizes the full life cycle service of AI models, supports full-stack, fully autonomous and controllable, and unified resource management and scheduling in all regions, and provides a one-stop development toolbox to support 10,000-calorie parallel training, 15-day stable training of 1,000-calorie, and minute-level breakpoint retraining, ensuring that large models are trained well, quickly, and stably.

Unblock the arteries and build an information highway for AI data flow. In the near future, China Mobile has given full play to its network advantages, accelerated the opening of 400G high-speed interconnection links between national hub clusters, opened up network resilience capabilities, built a new transportation capacity network with large bandwidth, wide coverage, low latency and intelligence, further reduced the cost of business migration to the west, and actively contributed to the public transmission channels within and between national hub nodes. In the medium and long term, it will lead the formation of a multi-level converged network solution with "high throughput, low latency, and integration". The first is high throughput, aiming at the bottleneck of long-distance network transmission performance, developing a new high-throughput transmission protocol, which will be jointly verified with the National Astronomical Observatories and BGI this year to achieve long-distance, high-throughput, high-elasticity, wide coverage and high-security data express services; The second is low latency, which lasted five years, and the joint industry tackled the anti-resonant hollow-core optical fiber technology, designed its own structure, and reduced the transmission delay by 33% compared with the traditional solid-core optical fiber, and will start 20 kilometers of pilot verification this year, and is expected to exceed 100 kilometers next year, which is expected to change the pattern of the optical communication industry; The prototype verification shows that the average service delay is reduced by 15%, and the system capacity is increased by 30%.

Build a powerful hub and develop the strongest brain for AI task distribution. The computing network brain is a network-based distributed computing power task distribution system, and China Mobile will achieve a comprehensive leap in the scheduling capability and intelligence level of the whole network on the basis of last year's trial commercial use. This year, it will manage its own intelligent computing center and edge nodes, widely absorb third-party computing power, realize integrated scheduling of intelligent edges and efficient circulation of global data, open computing network capabilities of more than 3,000, and multi-factor capabilities to achieve full coverage of ABCDNETS; Second, the performance has been improved, the number of daily dispatches has increased from 10 million to 100 million, more dimensions such as energy efficiency have been introduced, high-dimensional combination optimization problems have been solved, and distributed cross-cluster task scheduling has been accelerated based on new parallel algorithms. The third is to leap in service capabilities, integrate the large model of the Jiutian computing network, innovate AI interactive ordering, realize a new model of personalized and intelligent interaction, continuously empower new computing network services, and promote the upgrading of localized and single traditional products to global and composite products.

Shaping large industries and prospering the "tropical rainforest" of AI application innovation. The first is to strengthen the foundation of innovation, and at present, it has initially built a standard system covering international and domestic computing networks, especially in the computing routing working group established by the IETF, which is leading the overall architecture design of computing network integration. In the future, China Mobile will further improve the standard system, accelerate the formulation of standards such as the intelligent computing center network, and contribute more Chinese solutions to global standards. The second is to flourish with innovation, accelerate the construction of future industries and innovation consortia, deepen the innovation of computing power grid-connected and task-based service models, stimulate the vitality of AI+ application innovation, and improve business models. At the same time, China Mobile is speeding up the construction of an intensive and efficient data network, building a data circulation infrastructure, making high-quality data "live, moving, and using", and supporting the construction of a unified national market for data elements. By the end of this year, China Mobile's grid-connected computing power will exceed 5EFlops, with more than 80 computing network service model rooms and more than 10 data networking transaction nodes. The third is to cultivate fertile soil for innovation, which will be based on the cross-regional and cross-subject national computing network experimental scientific device, and unite more industry-university-research partners to support the construction of a national integrated computing network prototype test site and incubate more new technologies and applications of computing power network. At the same time, as a state-owned central enterprise, China Mobile will give full play to its advantages in intelligent computing resources, security, and operation to help the whole society use intelligent computing services conveniently and quickly, and create a "tropical rainforest" of AI innovation.

Source: People's Post and Telegraph

Author: Yixin Linjiang

Read on