NVIDIA Spectrum-X 助力 IBM 为 AI Cloud 提供高性能底座

In the era of hybrid cloud and AI, enterprises and organizations need to create, analyze, and store massive amounts of data, creating a variety of data silos in a distributed application environment, resulting in complex systems that are difficult to manage and costly. To be able to get the insights you need from your data faster, the underlying information architecture must support hybrid cloud, big data, and artificial intelligence (AI) workloads, as well as traditional applications, while ensuring security, reliability, data efficiency, and high performance, and it needs to scale seamlessly to handle the exponential growth of unstructured data.

IBM Storage Scale is a high-performance, parallel data storage solution that helps users get the compute or analytics they need, faster, manage rapidly scaling data and infrastructure, while ensuring data security and reducing overall storage costs.

NVIDIA Spectrum-X 助力 IBM 为 AI Cloud 提供高性能底座

Figure 1: The need for data storage in AI and hybrid clouds

In the face of the explosion of generative AI, the computing performance of GPU clusters is critical, requiring not only higher GPU computing power and faster storage, but also dedicated network infrastructure to ensure optimal performance across multiple nodes in parallel. NVIDIA has developed Spectrum-X, the industry's first AI-oriented Ethernet networking platform, designed to enhance the performance and efficiency of AI clouds. At the heart of the Spectrum-X platform are NVIDIA Spectrum-4 Ethernet switches, NVIDIA BlueField-3® SuperNIC/DPU, NVIDIA DOCA software stack and switch software stack, and NVIDIA LinkX high-quality interconnect devices, a combination that forms the foundation of an AI-accelerated computing network architecture. NVIDIA has integrated BlueField-3 SuperNICs and DPUs into its systems for AI training, recommendation, and inference, meeting the needs of Ethernet on a multi-tenant cloud while ensuring the best compute and storage performance for AI clusters.

图2：NVIDIA Spectrum-X 平台介绍

When it comes to choosing an AI cloud storage platform, IBM Storage Scale provides a proven, enterprise-grade data platform. Originating from GPFS, IBM Storage Scale has more than 30 years of R&D and a proven track record of successful deployments around the world in the industry's hyperscale and demanding application environments, including the world's most powerful AI and high-performance computing environments over the past few decades.

To meet the data access needs of different types of applications, IBM Storage Scale integrates interfaces for file, big data analytics, objects and container applications into a unified, scale-out storage solution. It provides a unified namespace for all of this data, enables protocol interoperability, and provides a single point of management through an intuitive graphical user interface (GUI). Reduce costs with storage policies that are transparent to end users and allow data to be tiered, compressed, or migrated to tape or the cloud; Data can also be tiered to high-performance data storage media, including server caches, to reduce latency and improve performance. Intelligent data caching at remote sites ensures that data is available across geographically dispersed sites with local read/write performance with Active File Management (AFM) capabilities, eliminating the need to replicate all data and reducing network overhead for data delivery.

图3：IBM Storage Scale 概览

For AI cluster applications, in order to meet the increasing computing power and the need for larger parameter scales of various basic models, higher speed data access capabilities are also required to avoid inefficient I/O caused by insufficient storage power, which makes GPUs useless. GPU server clusters consisting of multiple services require hundreds of GBps to several terabytes of high-speed data storage to meet their storage requirements. In addition, in order to improve the application efficiency of GPUs, NVIDIA has developed GPUDirect Storage technology, which can directly transfer data from external storage to GPU memory through RDMA high-speed network, which can effectively reduce CPU I/O bottlenecks, improve the bandwidth of GPU access data, and greatly reduce communication latency. In addition, for AI applications, each step from data ingestion to production inference requires the use of different tools to achieve massive data processing, and this is an iterative process. Users need to build an end-to-end, high-speed data pipeline that streamlines processes and enables the secure and efficient flow of data.

The fully optimized IBM Storage Scale System leverages the benefits of parallel architecture and high-speed networking to accelerate a wide range of AI workload applications, including:

Extreme performance: Delivering industry-leading file read and write performance, a single SSS module currently provides more than 310 GB/s file access bandwidth and 13M IOPS, which can be scaled to thousands of modules to meet the needs of higher performance and capacity, while the built-in Decluster RAID technology can minimize the impact of various hardware failures on performance.
Certification support: IBM Storage Scale is NVIDIA's officially certified storage technology that supports GPUDirect Storage, which can avoid GPU I/O bottlenecks, help users accelerate various AI services and data-intensive applications, and greatly improve the utilization of valuable GPU resources.
Global access: IBM Storage Scale provides global data platform access capabilities, supports multiple application access protocols (such as objects, containers, HDFS, etc.) and different storage environments, realizes data integration and scheduling, and combines with other storage devices (including tape) to achieve tiered storage, reducing data total cost of ownership and improving end-to-end data processing efficiency.
Security resiliency: Provides end-to-end comprehensive data security and resiliency solutions, including comprehensive data high availability and disaster recovery solutions, as well as Safeguarded Copy and security log audit capabilities for network security resiliency.

Figure 4: Performance measurements for a single IBM SSS 6000 module

To take full advantage of IBM Storage Scale's high bandwidth and low latency, users typically use RDMA-enabled networks for data access, including InfiniBand networks and RoCE (RDMA over Converged Ethernet) networks. The NVIDIA Spectrum-X platform features NVIDIA's unique Adapt Routing and other AI-specific Ethernet network optimization technologies, which can fully leverage the high-bandwidth performance of storage systems in large-scale clusters, providing customers with a stable network foundation for building high-performance and stable AI clusters.

Taking the data service flow of an AI cluster as an example, the network path from the GPU memory to the NAS server passes through the Leaf switch on the storage plane on the GPU cluster to the Spine switch, then to the Leaf switch, and finally to the storage server. It is not difficult to see that there will be multiple equivalent paths between the Leaf-layer switch and the Spine switch in the GPU cluster, including the equivalent path from the leaf to different Spines, and also the equivalent paths of multiple links from the same Leaf to the Spine Layer switches cannot fully allocate streams to different equivalent paths, which will affect the transmission bandwidth of storage data streams for storage services in large-scale AI clusters, and even if the storage system itself has strong performance, it will not be able to perform due because the network becomes a bottleneck. When the Adapt Routing technology is adopted, because it is a packet-based granular forwarding mechanism, regardless of the number of stored data streams, traffic can be evenly forwarded to all equivalent paths, thereby eliminating bottlenecks on the network, maximizing the performance of the storage system, improving storage bandwidth, and reducing storage plane latency. This is extremely important for building AI clusters on Ethernet.

Figure 5: Comparison of forwarding paths with AR enabled and AR disabled

In order to demonstrate the practical effect of the Spectrum-X platform in the storage field, as shown in the figure below, a demo environment was built to simulate a typical scenario of an AI storage application, using four servers with NVIDIA BlueField-3, two compute nodes with BlueField-3 DPUs, two storage nodes with BlueField-3 SuperNIC, and six SN5600s with Spectrum-4 switches The switches form a typical two-layer Spine-Leaf fat tree network. In addition, both the BlueField DPU and SuperNIC are dual-port cards, each port is connected to a different Leaf switch to ensure high reliability of the storage plane, and port bonding is turned on to maximize port performance. The test covers two scenarios: 2 dozen 1 and 2 dozen 2 scenarios, and constructs RDMA traffic for testing.

图6：Spectrum-X 存储 AR 测试 Topo

In the 2-on-1 and 2-on-2 scenarios, two compute nodes send traffic to one or two storage nodes at the same time, simulating the impact of a typical storage-write scenario on the switching network. As shown in the figure below, it can be clearly seen that after Adapt Routing is enabled, the network bandwidth of the receiver is close to more than 95% of the physical bandwidth, whether it is a 2-on-1 or 2-on-2 scenario. Without the Adapt Routing test case enabled, the network bandwidth utilization between the switches dropped significantly, and the bandwidth of the final test was less than half of the Adapt Routing turned on. It can be seen that the Spectrum-X platform with Adapt Routing technology can effectively solve the bandwidth bottleneck in the storage network, give full play to the performance of the storage system, and improve the overall performance of the AI cluster.

Figure 7: Comparison of bandwidth between enabling and disabling Adapt Routing in the 2-on-1 and 2-on-2 scenarios

Through the cooperation with the NVIDIA network team, IBM Storage Scale and the NVIDIA Spectrum-X platform are used to implement a software-defined data infrastructure, so that IBM's Storage Scale built on the Spectrum-X platform can not only provide a variety of services based on the Ethernet storage ecosystem for cloud applications, but also greatly improve the performance of storage, giving full play to IBM's Storage Scale The performance advantages of high throughput and large bandwidth meet the requirements of high-performance data storage on the cloud in the AI era. Solve the challenges and technical bottlenecks of the next generation of data-centric infrastructure, provide a high-performance foundation for AI cloud applications, and help customers achieve competitive advantage in the hybrid cloud and AI era.

Click "Read More" or scan the QR code below to watch a selection of GTC 2024 presentations you may have missed on-demand. Keep up to date with the latest AI breakthroughs and learn how to accelerate your business with technologies like high-performance computing.

NVIDIA Spectrum-X 助力 IBM 为 AI Cloud 提供高性能底座

Read on

Enterprise digital agent: Build an enterprise digital and intelligent AI foundation

How to lay the foundation for the construction of computing infrastructure? The leading teachers of the Science and Technology Innovation Board will discuss the "Hard Science and Technology Hard" to build the foundation together

Focusing on building the foundation for general artificial intelligence computing power Suiyuan Technology has participated in the World Artificial Intelligence Conference for five consecutive years

The open source model is sought after The automotive industry has started the battle of "digital base".

WAIC 2024: CLP Jinxin builds a solid foundation for intelligent computing and promotes the application of AI scenarios to be deeper and more practical

Esports World Cup GG wins! won a bonus of 1.5 million, and the CN team didn't even become a base?

Domestic basic middleware: the core base of Xinchuang software, the five leading manufacturers are strong and strong

"Arm Technology Licensing Subscription" helps build a "core" foundation

【Financial Analysis】Artificial Intelligence Drives Surge in Demand for Computing Power Domestic AI chips accelerate the construction of computing power base

If the toilet base is moldy, don't be stupid to change the glass glue, teach you the correct way to prevent mildew and cracking

What should I do if the toilet base is moldy and black? Teach you a few tricks and the mold will disappear automatically

DingTalk released the exclusive DingTalk solution for science and education to build a digital and intelligent AI foundation for colleges and universities

Digital services|There is a "way" for security, and Guan'an Information builds a solid digital "foundation" for thousands of industries

In China, 2000 years ago, a base was also full of designs

The computing power base is upgraded, carbon reduction and computing power are "all required"丨ToB industry observation

Independent innovation builds a solid foundation for the development of digital finance