laitimes

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

Reporting by XinZhiyuan

Edit: Editorial Office

【New Wisdom Meta Guide】"Assembled" CPU, 4nm graphics card, the world's fastest AI supercomputing, and the meta universe of game developers. This time, what's in Lao Huang's treasure chest?

Today, Lao Huang is wearing his leather coat again!

On the evening of March 22, NVIDIA GTC 2022 opened.

Although there is no familiar kitchen, this battle is more luxurious.

Nvidia rendered the new headquarters from the inside out with Omniverse!

80 billion transistors of Hopper H100

With the platform on the ground, Nvidia unveiled the Hopper H100, the latest AI graphics card designed for supercomputing.

Compared to its predecessor A100, which had only 54 billion transistors, NVIDIA loaded 80 billion transistors into the H100 and used a custom TSMC 4nm process.

That said, the H100 will have better power/performance characteristics and some improvements in density.

In terms of computing power, the FP16, TF32 and FP64 performance of the H100 are three times that of the A100, which is 2000 TFLOPS, 1000 TFLOPS and 60 TFLOPS respectively.

In addition, the H100 also adds support for FP8, with a hash rate of up to 4000 TFLOPS, which is 6 times faster than the A100. After all, in this regard, the latter has to rely on FP16 due to the lack of native FP8 support.

In terms of memory, the H100 will also support HBM3 with a bandwidth of 3TB/s by default, which is 1.5 times higher than the HBM2E of the A100.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

The fourth-generation NVLink interface supported by the H100 can provide up to 128GB/s bandwidth, which is 1.5 times that of the A100, while at PCIe 5.0 it can also reach 128GB/s, which is twice the speed of PCIe 4.0.

Meanwhile, the SXM version of the H100 increases the TDP to 700W, compared to 400W for the A100. With a power boost of 75%, it is generally possible to expect 2 to 3 times the performance.

To optimize performance, Nvidia has also introduced a new Transformer Engine that will automatically switch between FP8 and FP16 formats depending on the workload.

Hopper architecture's new DPX instructions will bring up to 40 times faster computation speeds for dynamic programming.

In AI training, the H100 can provide up to 9x the throughput. Benchmarking the Megatron 530B provides 16x to 30x inference performance. In HPC applications such as 3D FFT (Fast Fourier Transform) and genome sequencing, it can be improved by 6-7 times.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

DGX server system

The fourth generation of NVIDIA DGX server system will be the world's first AI server platform built with H100 graphics cards.

The DGX H100 server system delivers the scale needed to meet the massive computing needs of large language models, recommendation systems, healthcare research, and climate science.

Of these, each server system contains 8 H100 graphics cards, linked to a single whole via NVLink, with a total of 640 billion transistors.

At FP8 precision, the DGX H100 delivers 32 PFLOPS performance, 6 times higher than the previous generation.

In addition, each DGX H100 system includes two NVIDIA BlueField-3 DPUs for offloading, accelerating, and isolating network, storage, and security services.

Eight NVIDIA ConnectX-7 Quantum-2 InfiniBand network adapters provide 400 Gb of throughput per second to connect compute and enclosures – twice as fast as previous-generation systems.

The fourth generation of NVLink, combined with NVSwitch, provides 900 GB of connectivity per second between each GPU in each DGX H100 system, which is 1.5 times that of the previous generation.

The latest DGX SuperPOD architecture can connect up to 32 nodes, for a total of 256 H100 graphics cards.

The DGX SuperPOD delivers 1 EFLOPS of FP8 performance, which is also 6 times that of the previous generation.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

The world's fastest AI supercomputer

The "Eos" supercomputer, consisting of 576 DGX H100 server systems and 4608 DGX H100 graphics cards, is expected to provide 18.4 EFLOPS of AI computing performance, which is 4 times faster than Japan's Fugaku, the world's fastest supercomputer.

For traditional scientific computing, Eos is expected to deliver 275 PFLOPS performance.

Transformer Engine

As part of the new Hopper architecture, ai performance will be significantly improved, and training large models can be done in days or even hours.

The accuracy used in traditional neural network models during training is fixed, so it is difficult to apply FP8 to the entire model.

The Transformer Engine, on the other hand, can be trained layer by layer between FP16 and FP8, and uses the heuristics provided by Nvidia to select the minimum accuracy required.

In addition, transformer engines can package and process FP8 data at 2 times faster than FP16, so that each layer of the model can process FP8 data at 2 times faster.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

Grace CPU super chip

In addition to graphics cards, Nvidia today also unveiled its first arm Neoverse architecture processor, the Grace CPU Super Chip.

It is based on the previously released Grace Hopper CPU + GPU design, but the graphics card is replaced by a CPU.

Nvidia Labs estimates that the grace CPU super chip performance can be increased by more than 1.5 times when using similar compilers.

In terms of technical specifications, it can be summarized as:

2 x 72 core chips, up to 144 Arm v9 CPU cores

A new generation of LPDDR5x memory with ECC technology with a total bandwidth of 1TB/s

The SPECrate 2017_int_base score is expected to exceed 740

900GB/s compliant interface, 7 times faster than PCIe 5.0

The package density is 2 times higher than that of DIMM solutions

2 times the performance per watt than today's leading CPUs

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

The two CPUs in the superchip communicate through NVIDIA's latest NVLink "chip-to-chip" (C2C) interface.

This "bare die"-to-die" and "chip-to-chip" interconnects support low-latency memory consistency, allowing connected devices to work on the same memory pool at the same time.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

The Grace CPU Super Chip boasts more advanced energy efficiency and memory bandwidth, and its innovative memory subsystem consists of LPDDR5x memory with ECC.

LPDDR5x can provide twice the bandwidth of traditional DDR5 while significantly reducing CPU plus memory power consumption to 500 watts.

In contrast, AMD's chips in benchmarks ranged from 382 to 424, and each chip consumed up to 280W (not including memory).

In addition, the Grace CPU Super Chip, together with the NVIDIA ConnectX-7 NIC, provides the flexibility to configure into a server, either as a standalone pure CPU system or as an acceleration server with 1, 2, 4, or 8 Hopper-based graphics cards.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

Ampere architecture adds new products

Today, Nvidia offers seven Ampere architecture-based graphics cards for laptops and desktops — the RTX A500, the RTX A1000, the RTX A2000 8GB, the RTX A3000 12GB, the RTX A4500, and the RTX A5500.

The new RTX A5500 desktop graphics deliver outstanding rendering, AI, graphics, and compute performance. Its ray-traced rendering is 2x faster than the previous generation, and its motion blur rendering performance can be improved by up to 9x.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

2nd Generation RT Core: Up to 2 times the throughput of the first generation, capable of running ray tracing, shading and denoising tasks simultaneously.

Third Generation Tensor Cores: Training throughput is 12 times higher than the previous generation, supporting the new TF32 and Bfloat16 data formats.

CUDA core. Up to 3x the single-precision floating-point throughput of the previous generation.

Up to 48GB of GPU memory: The RTX A5500 has 24GB of GDDR6 memory with ECC (Error Correction Code). Using NVLink to connect two GPUs, the RTX A5500's memory can be expanded to 48GB.

Virtualization: The RTX A5500 supports NVIDIA RTX Virtual Workstation (vWS) software for multiple high-performance virtual workstation instances, enabling remote users to share resources and drive high-end design, AI, and compute workloads.

PCIe 4.0: 2 times the bandwidth of the previous generation, speeding up data transfer for data-intensive tasks such as AI, data science, and creating 3D models.

Game developers also have metaversities

Omniverse, which already has a place in the metaverse, has been strengthened again.

Old Yellow Crazy CPU! Nvidia's 80 billion transistor graphics card and the world's fastest AI supercomputer Eos

At the conference, Nvidia released new features of NVIDIA Omniverse that make it easier for developers to share assets, categorize asset libraries, collaborate, and deploy AI to animate characters' facial expressions in a new game development process.

With the NVIDIA Omniverse Real-Time Design Collaboration and Simulation Platform, game developers can easily build custom tools to simplify, accelerate, and improve their development workflows using AI- and NVIDIA RTX-enabled tools. Its components include:

Omniverse Audio2Face, an application powered by NVIDIA AI, enables character artists to generate high-quality facial animations from audio files. Audio2Face supports full facial animation, and artists can also control the emotions of their performances. With Audio2Face, game developers can quickly and easily add realistic emoticons to their game characters, fostering a stronger emotional connection and enhancing immersion between players and characters.

Omniverse Nucleus Cloud is now available in Early Access, enabling simple, one-click sharing of Omniverse scenarios without the need to deploy Nucleus on-premises or in a private cloud. With Nucleus Cloud, game developers can easily share and collaborate 3D assets in real time between internal and external development teams.

Omniverse DeepSearch, an AI service now available to Omniverse enterprise users, allows game developers to instantly search their entire catalog of unlabeled 3D assets, object objects, and characters using natural language input and images.

Omniverse Connectors implements plug-ins for "real-time synchronized" collaborative workflows between third-party design tools and Omniverse. The new Unreal Engine 5 Omniverse Connector allows game artists to exchange USD and material definition language data between the game engine and Omniverse.

Transforming data centers into "AI factories"

Whether it's Hopper graphics architecture, AI acceleration software, or powerful data center systems.

All of this will be brought together by Omniverse to better simulate and understand the real world and serve as a testing ground for new types of robotics, the so-called "next wave of AI."

The advances in AI have been staggering due to accelerated computing technologies, which have fundamentally changed what software can do and how it is developed.

Lao Huang said that Transformer has freed itself from the need for human labeled data, making self-supervised learning possible, while artificial intelligence has leapt at an unprecedented pace.

Google BERT for language understanding, Nvidia Mega MolBART for drug discovery, and DeepMind AlphaFold2 are all breakthroughs brought about by Transformer.

Nvidia's AI platform has also received major updates, including the Triton inference server, the NeMo Megatron 0.9 framework for training large language models, and the Maxine framework for audio and video quality enhancements.

"We're going to try to achieve a million-fold increase in computing power over the next decade," Mr. Huang said at the end of his speech, "and I can't wait to see what the next million-fold will bring."

Resources:

https://www.nvidia.cn/gtc-global/keynote/

Read on