The integration of CPU, GPU and DPU is the inevitable architecture of the future of the data center

2021-12-21 21:07:20

As an important resource and production factor, data has become a consensus worldwide. And the fulcrum behind all this – the data center – the area where data is computed and stored, will surely be a mecca for future technology companies to compete.

The data center is no longer the original mainframe era, that is, to deal with the key single task, but also through the software-defined data center, how to optimize the use of resources when running multi-service, the current data center is extending from the vertical extension to the horizontal expansion, the existing computing power has become a bottleneck, NVIDIA Network Division Asia Pacific market development senior director Song Qingchun said.

GPUs solve the problem of computing power bottlenecks very well, but only for stand-alone machines, and then for a wider range of data centers, especially for security and performance isolation, how to solve?

Nvidia chose the DPU. "Now in the data center, CPU, GPU, DPU 3U are indispensable, which is the basis for the data center to become a computing unit, and it is also the basis for the computing power to become a service." Song Qingchun pointed out.

The DPU, or Data Processing Unit, is a processor for the infrastructure of the data center. From a certain point of view, the emergence of DPU has released the resources of CPUs and GPUs very well, and in Nvidia's eyes, its emergence has brought different ideas to new data-centric computing architectures. The DPU performs communication framework, storage framework, security framework and service isolation, and "decompresses" the CPU and GPU computing power resources to the application, so that the performance can be better released. Song Qingchun said that after having a DPU, the communication and computing are overlapped, so that the communication in the HPC service can be accelerated through the DPU, allowing the CPU and GPU to perform real floating-point calculations.

He pointed out that the emergence of DPU makes up for the lack of basic service acceleration capabilities in the data center, and realizes a new data center architecture with 3U integration, making the data center a new computing unit, which is also an inevitable architecture.

Nvidia unveiled its next-generation InfiniBand web platform, Quantum-2, at GTC 2021. Includes NVIDIA Quantum-2 switches, ConnectX-7 network cards, BlueField-3 data processor DPU (Data Processor) and all software that supports this new architecture. It is also the most advanced end-to-end network platform to date.

Song Qingchun said that Quantum-2 is a computing network that truly meets the needs of supercomputing and cloud-native networks. When supercomputers and cloud-native supercomputing systems are to achieve high performance, all resources must be involved in the computing.

In the process of data network communication, many communication models will restrict the development of the performance of the entire system, and the traditional computing model of the von Neumann architecture will cause network congestion. Neither increasing bandwidth nor reducing latency can solve this problem, and how to continue to improve the performance of the data center has become a new challenge facing the industry.

Where the data is, the calculation is there, Song Qingchun pointed out. The new data-centric architecture addresses packet loss and other bottlenecks in network transport. The new architecture can reduce communication latency by more than 10 times, so network computing has become one of the key technologies of today's data-centric architectures.

With a high throughput of 400 Gbps per second, NVIDIA Quantum-2 InfiniBand doubles the speed and triples the number of network ports. It delivers a 3x performance improvement while reducing the number of switches required for the data center network by a factor of 6, while reducing data center power consumption and space by 7% each.

The NVIDIA Quantum-2 platform enables performance isolation between multiple tenants, which allows one tenant's behavior to not interfere with other tenants, while ensuring reliable data throughput and no spikes in user or application demand by leveraging advanced telemetry-based congestion control mechanisms that support cloud native.

NVIDIA's Quantum-2 SHARPv3 network computing technology provides AI applications with 32 times more acceleration engines than the previous generation, and with the NVIDIA UFM Cyber-AI platform, it will provide data centers with advanced InfiniBand network management capabilities, including predictive maintenance.

The NVIDIA Quantum-2 platform integrates a timing system with nanosecond precision to synchronize distributed applications, such as in database processing, helping to reduce waiting and idle time. This new feature, which makes cloud data centers part of a telecom network, can host software-defined 5G wireless services.

Compared with the traditional supercomputing platform, Song Qingchun introduced that Quantum-2 can allow the network to directly participate in the computing, in the Quantum-2 platform, through advanced network computing technology, dynamic routing, congestion control technology to achieve business performance isolation, when running multiple services, each service can play the best performance, the supercomputing cloud performance to the best, can maintain the performance of Bare-metal. Even through the Quantum-2 InfiniBand DPU to achieve the overlap of computing and communication, through the overlap of computing and communication to provide another new optimization idea, that is, the computing on the CPU, GPU, the communication framework on the DPU, at this time for some services can even achieve better performance than Bare-metal, just like doing fast Fourier transform, 3D FFT such a business, It can achieve better performance than Bare-metal. Therefore, if it is to push the cloud-native technology platform, Quantum-2 is the best network platform to support cloud-native.

For the concept of cloud native, Song Qingchun said that from Nvidia's point of view, cloud native may change its name in the future, but it will definitely go to the direction of related technologies, and now the computing power has become a resource, including the government's call for energy saving and emission reduction, improved performance, and reduced power consumption, all of which hope that the data center can provide maximum performance with a minimum power consumption and the least equipment, so there is no doubt that there is no doubt in the direction of cloud native performance improvement. (Proofreading/Sharon)

The integration of CPU, GPU and DPU is the inevitable architecture of the future of the data center

Read on