laitimes

"Computing power × connection" H3C Network makes AI computing more inclusive

author:Zhongguancun Online

In the era of generative AI, people need not only large-scale accelerated computing, but also the construction of high-quality intelligent computing centers. In this process, computing power, network, storage, operation and maintenance and other links are indispensable, you must know that the inference and training of large models require thousands of GPUs, and the supercomputing clusters formed by the connection of these massive GPUs are inseparable from highly reliable, low-latency, and resilient intelligent computing networks. Especially for the development of China's computing industry, the parallel and collaborative development of computing clusters is more important. For example, ChatGPT-3, which has 175 billion parameters, is composed of 10,000 V100 GPUs and 285,000 CPUs, and each GPU server needs 400 Gb/s of network performance to meet the computing power demand, not to mention GPT-4 with trillion-level parameters.

The "Implementation Opinions on Accelerating the Construction of a National Integrated Computing Network" (hereinafter referred to as the "Opinions") issued by the National Development and Reform Commission and other departments pointed out that by the end of 2025, the comprehensive computing infrastructure system of inclusive, easy-to-use, green and safe will take shape, the collaborative scheduling mechanism of eastern and western computing power will be gradually improved, and the agglomeration of multiple computing power such as general computing power, intelligent computing power, and super computing power will be accelerated, and all kinds of new computing power in the national hub node area will account for more than 60% of the new computing power in the country. The utilization rate of computing resources of national hub nodes significantly exceeds the national average. The 1ms delay urban computing power network, the 5ms delay regional computing power network, and the 20ms delay cross-national hub node computing power network have been initially realized in the demonstration area. A two-way synergy mechanism for computing power and electricity has taken shape, and the proportion of green electricity in new data centers in national hub nodes exceeds 80%. The ease of use of various computing power by users has been significantly improved, the cost has been significantly reduced, and the network transmission cost between national hub nodes has been greatly reduced. The key core technologies of the computing network are basically safe and reliable, and the high-quality development pattern of the computing network characterized by networking, inclusiveness, and greening has gradually taken shape.

"If in the past, the network was an analysis level and a single point of connection, then in the era of intelligent computing, the extension and endogenous connection between networks are expanding, penetrating into the inside of servers, between servers, between data centers, and between WANs and campus networks." Zeng Fugui, Senior Vice President and President of Network Product Line of H3C Group, said, "The heterogeneity of diversified computing power and the greatly improved computing performance require stronger network support, so H3C Network has also ushered in a comprehensive upgrade, evolving from 'computing power + connection' to 'computing power × connection'. ”

"Computing power × connection" H3C Network makes AI computing more inclusive

Zeng Fugui, Senior Vice President of New H3C Group and President of Network Product Line

It can be said that "computing power × connection" is the technical foundation for the implementation of H3C's "AI in All" and "AI for All" strategies. In the context of the explosion of computing power demand, the connection within and between data centers and the massive data transmission have put forward more stringent requirements for the network, and it is necessary to realize lossless data communication over ultra-long distances. Therefore, high-quality and deterministic network connectivity is particularly important, which requires the coordinated upgrade of network infrastructure, control and management layers, and O&M service layers. More importantly, it is necessary to have a "smart brain" to visualize the real-time collection and analysis of the entire network, and make it flexible and stable.

Relying on the concept of "computing power × connection", H3C released the Lingxi large model, and implemented the Lingxi large model through AD-NET 7.0, which fully integrates AI capabilities, supports its deployment in the cloud or locally, and combines the NAI (Native Artificial Intelligence) intelligent native technology embedded in network equipment to upgrade the capabilities of data centers, campus networks, WANs and other scenarios. The entire network is empowered by the Lingxi model, which enables the data center to achieve higher efficiency and lower energy consumption computing power production with computing power as the goal. With the goal of making the campus network intelligent, fast and simple, realize ultra-broadband, anytime, anywhere online computing application access; With the goal of intelligent service sharing, the WAN realizes business-oriented and refined management of computing power transmission, and meets the needs of efficient computing power scheduling in different regions. Based on H3C's intelligent computing capabilities, industry knowledge, and practical experience, AD-NET 7.0 has realized the evolution from "application-driven network" to "application+AI-driven".

At the same time, the explosion of model parameters has brought the acceleration card to the order of hundreds of thousands, and the nonlinear relationship between this parameter and the number of hardware makes the number of corresponding network equipment, ports, optical modules, etc. rise to the scale of millions, and the density of data centers will become higher and higher, and the port density will increase, and the high density makes a port have multiple connection cores. At this time, the first problem brought about by scale is network resilience, which involves challenges in technology, architecture, energy, and other aspects, such as lossless routers across metro areas and lossless data center networks.

The second is to solve the power supply and cost in a limited space, if you want to support 1000 cabinets, you need several megawatts of electricity, and the cost of electricity and computing network is equally important.

Li Yutao, vice president of New H3C Group, vice president of network product line and general manager of switch product line, said that New H3C focused on energy consumption in the research and development process of 400G and 800G switches, and supported LPO linear optical modules, the energy consumption of traditional 400G optical modules is 10-12 watts, and LPO optical modules are 6.5 watts, and the scale has decreased significantly. At the same time, H3C has also launched a silicon photonic switch (CPO), which saves the energy consumption of optical modules and greatly benefits the long-term operation of the computer room.

In order to ensure the network reliability of the large model during the training process, H3C's computing network products have undergone various rigorous tests at the factory, and the failure rate of the optical link after the launch has been reduced to a very low level through special guidance and specifications, and at the software level, load balancing and other technologies are used to provide sufficient redundancy mechanisms that can be quickly converted in the event of hardware failure. For example, with the help of the DPSH protocol, the second-level or millisecond-level conversion after the link was disconnected in the past can be changed to the current microsecond conversion, and the hardware can sense the link status by itself without even software intervention.

It can be seen that the computing infrastructure, AD-NET 7.0, and the Lingxi model constitute a positive cycle of Xinhua 3's network AI capabilities, that is, to strengthen computing with the network, improve intelligence with computing, and increase efficiency with intelligence. AD-NET uses AIGC to achieve efficient anomaly detection, trend prediction, fault diagnosis and intelligent tuning, and Lingxi Assistant is a typical application, which supports users to obtain network knowledge, configuration, product recommendations and other information in a natural language way, helping various complex networks to automatically tune, and similarly, AI can also make system troubleshooting, protection more professional, and network operation more efficient. Based on the accumulation of knowledge corpus in the ICT field, tens of thousands of network experts will use computing power training and fine-tuning to obtain the Lingxi large model and upgrade the unified O&M capabilities. This is also the embodiment of the integration of network and computing power, and AD-NET provides the network infrastructure for computing power generation and computing power connection, bringing efficient computing power to the training of AI models and supporting the connection services required for intelligent services.

While improving computing power, it is also necessary to greatly improve the utilization rate of computing power through lossless and large throughput of the network, and load balancing is a very representative technology, including enhancing flow-by-flow balancing, packet-by-packet balancing, and cell balancing. In the past, the traffic between different services was 10G and 20G, but now the balance difference may be hundreds of Gbit/s, which leads to a large-scale imbalance in the pay-by-flow load, resulting in congestion and packet loss, affecting the efficiency of the entire model, and consuming too much network bandwidth and training time. If the packet load is based on the packet, although the number of packets can be balanced, there may be out-of-order packets at the end of the network, so that the terminal NIC or network needs to reorganize the out-of-order packets, otherwise there will be problems in the application. In this process, not all NICs and network terminals have the ability to reorganize and sort. In this regard, H3C's box-box networking, box-box networking, and DDC networking can support various heterogeneous computing power, standard cards and smart network cards at the end of the network, and can implement various load balancing algorithms according to customer needs, realizing global load balancing, and non-blocking networks can make use of each link.

"For example, for a 400G link, there is usually very little traffic data during the training process, and once the training is completed and the set calculation is done, the data will appear in a jagged form, the jitter is very large, and the traffic bandwidth is almost occupied. If multiple links are retransmitted at the same time, packet loss will occur, which requires load balancing. We can predict instability and schedule traffic to minimize packet loss. Li Yutao said, "The open standard Ethernet should give full play to the maximum lossless capability of RoCE through load balancing technology, whether it is DLB or global path planning, as well as with the agent software, our efficiency in the network link is obviously improved." "H3C's load balancing architecture can provide a suitable combination of load balancing technologies for different intelligent computing scenarios, and improve the scale and efficiency of computing power in intelligent computing centers.

H3C's diversified product and technology portfolio opens up the effective connection of heterogeneous computing power, supports open intelligent computing solutions, and allows customers to choose free decoupling solutions through different networking forms and networking solutions, which greatly saves the network construction cost of the intelligent computing center, and brings a guarantee for the diversity and continuous reliability of the supply chain. In terms of data centers, H3C has launched the H3C S12500 AI series of computing cluster core switches based on the DDC architecture (Disaggregated Distributed Chassis), aiming to provide users with a more scalable, easier operation and maintenance management, and more cost-effective distributed decoupled chassis solution. The H3C S12500 AI series features cell-level load balancing, native losslessness, and ultra-scale, which can build a lossless network with zero natural packet loss, provide automatic deployment, and NCF and NCP ad hoc network capabilities.

In terms of campus networks, the Ethernet all-optical + PON convergence technology greatly improves the bandwidth of users at the access layer, further reduces the energy consumption and TCO of the campus network, and improves the service life of the entire network. At the same time, H3C is also introducing more AI capabilities into the campus network, improving the efficiency and experience of O&M management through more refined granularity, and building an intelligent, fast and simple campus network, so that computing power can be obtained and used anytime, anywhere. To this end, H3C upgraded the all-optical network + Wi-Fi 7 solution, providing a "last hop" high-quality access experience for various AI-based scenario-based applications, and innovating the lightweight campus BRAS (Broadband Remote Access Server), visual intelligent management and O&M, and other links. In addition, H3C is also promoting the speed of new products to market for FTTD access products, scenario-based Wi-Fi 7 APs and industrial switches.

"We will focus on the development of AI technologies to upgrade the intelligent operation and maintenance of campuses, such as wireless 4i technologies (iRadio, iStation, iEdge, and iHeal), and use AI algorithms to optimize the combination of software and hardware to make the overall network experience better." Zhao Yujin, vice president of New H3C Group and general manager of the wireless product line, said. Based on this, H3C has launched the Central AC solution integrated with wireless 4i technology, coupled with lightweight campus BRAS, which can greatly simplify the management complexity of wired and wireless user policies, reduce the O&M workload, and provide on-demand campus policy management and consistent campus user experience.

As mentioned in the "Opinions", with the high-quality development of computing power empowering high-quality economic development as the main line, we should give full play to the leading role of the national hub node of the national integrated computing network, and jointly promote the "Eastern Data and Western Computing" project, form a cross-regional and cross-departmental collaborative development force, coordinate the collaborative computing of general computing power, intelligent computing power, and super computing power, and the collaborative layout of the eastern, central, and western regions and large, medium, and small cities, and the collaborative application of computing power, data, and algorithms, the collaborative construction of computing power and green power, and the collaborative guarantee of computing power development and security. The national integrated computing power network that is inclusive, easy-to-use, green and secure will help build a cyber power and a digital China, and build a digital foundation for Chinese-style modernization.

In order to unify the core hub computing network, cross-regional computing network, and urban computing network, technology, cost, and bandwidth are all very important. For WAN computing networks, bandwidth, algorithms, and reliability are key. Taking the main core routing products such as CR19000 and CR16000E-F as examples, H3C has made three upgrades: first, it provides a higher 400G forwarding rate, and uses deterministic network technology to greatly reduce the latency and jitter of the WAN, with the help of DetNet, DetNetOAM and other technologies, H3C routers can achieve ultra-low transmission delay of 1 millisecond in metro, 5 milliseconds in the region, and 20 milliseconds in the core, as well as 15 microseconds network jitter, which greatly improves the quality of computing power network; Second, the computing power factor is integrated into the routing algorithm embedded in the network equipment, so that the WAN is naturally suitable for transmitting computing power. Third, it allows users to build dedicated computing power channels on demand and provide service-oriented computing power dedicated lines. Through a series of features such as parameter selection, on-demand construction, automatic network construction, dismantling when used up, and dynamic bandwidth adjustment, H3C routers can further improve the resource utilization and network SLA of the computing power network.

Wang Xiaoyong, general manager of the router product line of H3C Group, said: "Through product innovation, network-wide energy saving, and large-scale model optimization, network costs will be further reduced, and at the same time, the network will be more flexible and elastic, and it will be more convenient for customers to operate by themselves." H3C's products can achieve end-to-end IPv6+ capabilities from access convergence to the core network, and open up the technical base from the urban computing network to the core computing network. ”

Under the boom of AIGC, the intelligence of all walks of life is accelerating, and the AD-NET of H3C Network has also been upgraded from the Application driver to the "dual AI stage" of Application + AI. "Xinhua III has carried out an in-depth layout in cloud, network, security, computing, storage and end, and launched a large model of Baiye Lingxi last year, emphasizing private domain deployment, specialization and refinement." Ao Xiangqiao, general manager of the intelligent management and operation and maintenance product line of New H3C Group, said, "From the perspective of the network, we have many years of technology and experience accumulation, and firmly integrate into the ability of large models. AD-NET is to take AIGC as the starting point, gradually improve the level of intelligence, and further strengthen its ability as an 'expert consultant'. Nowadays, users can use natural language to make the "Lingxi Assistant" recommend solutions and configurations, automatically network, meet the needs of various knowledge questions and answers, and at the same time, it can also predict faults in specific scenarios, such as optical module diagnosis, traffic prediction, etc., and the ability of the agent is still improving.

It can be said that computing power and connectivity have accelerated their integration and mutual motivation in the era of intelligent computing. With more than 20 years of experience in the enterprise network field, H3C has released the multiplier effect of "computing power × connection" through the innovation of data centers, campuses, and WANs, providing high-quality network connections for industry customers and promoting the process of industry intelligence. "We have used AI technology to empower and upgrade our network, and at the same time, our network has also provided a solid foundation for AI innovation. We hope that H3C's network can contribute more to the inclusiveness of computing power, which is not only a future technology trend, but also our mission. Zeng Fugui said.

(8713491)

Read on