80 billion transistor nuclear bomb GPU architecture in-depth interpretation, is it "assembled goods"?

2022-03-25 18:13:50

At NVIDIA GTC in March 2022, NVIDIA founder and CEO Jen-Hsun Huang introduced an H100 GPU based on the new Hopper architecture, NVIDIA's strongest GPU chip to date for accelerating tasks such as artificial intelligence (AI), high-performance computing (HPC) and data analytics.

6 times more performance than the previous generation A100, how does the NVIDIA Hopper architecture do it?

80 billion transistor nuclear bomb GPU architecture in-depth interpretation, is it "assembled goods"?

Major upgrade to hopper architecture H100 GPU

The Hopper architecture is named after the surname of Grace Hopper, a pioneer in computational science. "Hopper H100 is the largest generational leap in history. With 80 billion transistors, the H100 is NVIDIA's "new nuclear bomb" in performance.

So, what is the core of the "new nuclear bomb"? This article will provide an in-depth interpretation and analysis of the Hopper architecture.

Hopper architecture H100 performance comparison with previous generations of GPUs

Note: Dr. Grace Hopper was one of the first programmers at Harvard Mark 1 and is known as the mother of compiled languages. She allegedly discovered the first bug in a computer program, while also creating the biggest bug in the computer world, the Millennium Bug.

01.

Hopper's overall structure is disassembled

The NVIDIA Hopper architecture H100 chip adopts TSMC's 4nm process (N4 is an optimized version of TSMC's N5 process) with a chip area of 814 square millimeters (14 square millimeters smaller than the A100).

Performance specifications for the H100 Tensor Core GPU

The Hopper architecture can be thought of as a splicing of two symmetrical structures. (Isn't it a bit like the stitching idea of Apple's UltraFusion architecture that we introduced earlier?) However, the GPU here is still monolithic. For a review of Apple's UltraFusion architecture, see the article "The secret recipe for Apple's chip "assembly" is found in the patent" article. ）

On the top-level topology, Hopper doesn't seem to differ much from her predecessor, ampere architecture. The Hopper architecture GPU in the figure consists of eight Graphics Processing Clusters (GPCs) "stitched together".

Hopper architecture basic structure

The outer periphery is packaged with multiple sets of HBM3s (Chiplet technology) to form the entire chip module - from the module point of view, it is another "assembled cargo". Each GPC on the chip is in turn "stitched" by nine Texture Processor Clusters (TPCs).

Computing tasks entered by the PCIe5 or SMX interface are assigned to individual GPCs through the GigaThread engine with multi-instance GPU (MIG) control. Intermediate data is shared between GPCs via L2 cache, and intermediate data computed by GPC is connected/swapped with other GPUs via NVLink. Each TPC consists of 2 Streaming Multiprocessors (SM).

The performance gains and major changes to the Hopper architecture are reflected in the new thread block clustering technology and the new generation of streaming multiprocessors with 4th generation tensor cores.

Thread block clusters and meshes with clusters

A new thread block clustering mechanism has been introduced into the Hopper architecture, which enables collaborative computation across SM units. Thread block clusters in the H100 can run concurrently on a large number of SMes within the same GPC, which provides better acceleration for larger models.

02.

Next-generation streaming multiprocessor SM and FP8 support

The hopper architecture's next-generation streaming multiprocessor introduces FP8 tensor cores to accelerate AI training and inference. The FP8 tensor core supports FP32 and FP16 accumulators, as well as two FP8 input types (E4M3 and E5M2).

Streaming multiprocessor SM

Compared to FP16 or BF16, FP8 halves data storage requirements and doubles throughput. We will also see in the analysis of the Transformer engine that the use of FP8 adaptively increases the computing speed of transformer.

Each SM includes 128 FP32 CUDA cores and four 4th generation tensor cores.

Instructions entering the SM cell are first deposited in the L1 Instruction Cache and then distributed to the L0 Instruction Cache. The Wrap Scheduler and Dispatch Unit, which are paired with the L0 cache, assign computational tasks to CUDA cores and tensor cores. (Note: The smallest unit of hardware computation execution in a GPU is the thread bundle, or Warp for short.) ）

FP8 has 2x throughput for FP16 or BF162

Each SM performs transcendental and interpolation function calculations by using 4 Special Function Unit (SFU) units.

03.

Hopper's tensor core with Transformer engine

In GPUs, tensor cores are dedicated high-performance computing cores for matrix multiplication and matrix multiply-accumulate (MMA) mathematical operations that provide breakthrough performance acceleration for AI and HPC applications.

The tensor core is the key module in the GPU to do AI acceleration, and it is also a significant difference between ampere and later GPU architecture and earlier GPUs.

Hopper's tensor core supports FP8, FP16, BF16, TF32, FP64, and INT8 MMA data types. The key point at the core of this generation of tensors is the introduction of the Transformer engine.

The H100 FP16 Tensor Core has 3 times the throughput of the A100 FP16 Tensor Core

Transformer operators are the basis of mainstream NLP models such as BERT to GPT-3, and are increasingly used in different fields such as computer vision and protein structure prediction.

Compared to the previous generation A100, the new Transformer engine, combined with the Hopper FP8 tensor core, delivers up to 9x the AI training speed and 30x the AI inference speed on large NLP models.

The new Transformer engine dynamically adjusts the data format to make the most of computing power

In order to improve the computational efficiency of Transformer, hybrid accuracy is used in this new Transformer engine, which intelligently manages the calculation accuracy during the calculation process, and in each layer of Transformer calculation, according to the next layer of neural network layer and the required accuracy, dynamic format conversion is carried out in FP8 and other floating-point formats, making full use of the computing power of the tensor core.

04.

Tensor storage accelerator with asynchronous execution

A new Tensor Memory Accelerator (TMA) has been added to the Hopper architecture to improve the efficiency of data exchange between tensor cores and global and shared storage.

In this new TMA operation, data transfer is specified using tensor dimensions and block coordinates, rather than simply addressing directly by data address. TMA significantly reduces addressing overhead and improves efficiency by supporting different tensor layouts (1D-5D tensors) and different storage access modes.

That is to say, it used to be a one-by-one picking of beans (data), and now the method is to scoop beans bowl by bowl. Is this design getting closer and closer to the addressing method of DSA?

The block coordinate addressing mode of the TMA

Of course, TMA operations are asynchronous, and multiple threads can share data channels to sort through the data transfer.

A key advantage of TMA is that it frees up threads' hash rate to perform other work while data is being copied.

For example, on the A100, the thread itself is responsible for generating all addresses to perform all data copy operations. But in Hopper, the TMA is responsible for generating the sequence of addresses (an idea similar to a DMA controller), taking over the data copy task, and letting the thread do other things.

Hopper architecture H100's TMA-based storage replication is more efficient

05.

Conclusion: GPUs are moving toward domain specialization

Overall, the H100 based on the Hopper architecture has approximately 6 times better computing performance than the Ampere architecture's A100.

The core reason for the significant improvement in performance is the introduction of tensor cores after FP8 and the Transformer engine for NLP tasks, especially TMA technology, which reduces the useless work of SM cells in data replication.

From a design philosophy point of view, there are more and more ideas for DSA (Domain Specific Architecture) in Hopper architectures for data centers, and there is more collaboration between stream multiprocessors. Probably Lao Huang also feels that GPUs should develop in the direction of field specialization.

This year's release of the Hopper architecture has more micro-improvements than the Ampere architecture, and I hope that Lao Huang can bring us more technical surprises next time.

References: NVIDIA H100Tensor Core GPU Architecture White Paper, NVIDIA; GPGPU Chip Design: Principles and Practices, Chen Wei, Geng Yunchuan

80 billion transistor nuclear bomb GPU architecture in-depth interpretation, is it "assembled goods"?

Read on