Demystifying the evolution of NVIDIA's GPU architecture for nearly a decade, from Fermi to ampere CPUs and GPU FermiKepler Maxwell PascalVoltaTuringAmpere epilogue

Author: Tomoyazhang, Tencent PCG backstage development engineer

, Tencent PCG backstage development engineer

As software evolved from 1.0 to 2.0, that is, from Turing machines to deep learning-like algorithms. Computing hardware is also accelerating migrations from CPUs to GPUs. This article attempts to collate the history of architectural evolution over the decade from Nvidia from 2010 to 2020.

Let's first have an intuitive understanding of the GPU, as shown in the following figure:

Demystifying the evolution of NVIDIA's GPU architecture for nearly a decade, from Fermi to ampere CPUs and GPU FermiKepler Maxwell PascalVoltaTuringAmpere epilogue

As we all know, since the development of memory is slower than that of the processor, a multi-level cache structure has developed on the CPU, as shown in the left figure above. In GPUs, a similar multi-level cache structure exists. It's just that GPUs use more transistors for numerical computing than FOR CPUs, rather than caching and Flow Control. This stems from the different design goals of the CPU, which are designed to execute dozens of threads in parallel, while the GOAL of the GPU is to execute thousands of threads in parallel.

As you can see in the figure on the right above, the number of CORE GPUs is much more cpu, but there are gains and losses, and you can see that the Cache and Control of the GPU are much less than the CPU, which makes the freedom of the GPU's single Core much lower than that of the CPU, which will be subject to many restrictions, and this limit will eventually be borne by the programmer. These limitations also make GPU programming fundamentally different from CPU multithreaded programming.

One of the most fundamental differences can be seen in the figure on the right above, each line has multiple Cores, but only one Control, which means that multiple Cores can only execute the same instructions at the same time, this mode is also known as SIMT (Single Instruction Multiple Threads). This is somewhat similar to the SIMD of modern CPUs, but it is fundamentally different, and this article will continue to delve into it later.

Starting from the architecture of the GPU, we will find that because of the lack of Cache and Control, only compute-intensive and data-parallel programs are suitable for using GPUs.

Compute-intensive: The proportion of numerical computations is much larger than that of memory operations, so the latency of memory access can be masked by calculations, so the demand on Cache is less than that of CPU.

Data Parallelism: Large tasks can be disassembled into small tasks that execute the same instructions, so there is less need for complex process control.

Deep learning just satisfies the above two points, and I believe that even if there is a model with lower computation and stronger expression ability than deep learning, if it does not meet the above two points, it is bound to beat deep learning under the blessing of GPUs.

<h1 class="pgc-h-arrow-right" data-track="14">Fermi</h1>

Fermi is an architecture released by Nvidia in 2010, introducing many concepts that are still outdated today, and the architecture before Fermi has not found much information, so this article starts with Fermi and starts with an overview.

The GPU reads CPU instructions through the Host Interface, and the GigaThread Engine copies specific data from the Host Memory into the internal Framebuffer. GigaThread Engine then creates and distributes multiple Thread Blocks to multiple SMs. Multiple SMs are independent of each other and separately schedule their multiple Thread Wraps to execute on CUDA Cores and other units of execution within the SM.

The above sentence has several concepts to explain:

SM: Corresponding to the SM hardware entity in the figure above, there are many CUDA Cores inside;

Thread Block: A Thread Block contains multiple threads (such as hundreds), the execution between multiple Blocks is completely independent, the hardware can arbitrarily dispatch the execution order between multiple Blocks, and the multiple thread execution rules inside the Block are determined by the programmer, and the programmer can decide how many Blocks there are in total;

Thread Warp: 32 threads are a Thread Warp, and Warp has special rules for scheduling, which will be further explored later in this article.

Since this article is not about how to write CUDA, if you still don't understand the explanation of SM/Block, you can refer to this subsection:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model

There are 16 SMs in the figure above, each with 32 Cuda Cores, for a total of 512 Cuda Cores. These quantities are not fixed and are related to specific architectures and models.

Next, let's take a closer look at SM and take a look at SM overview:

As you can see from the figure above, sm has 32 CUDA Cores, each CUDA Core contains an Integer arithmetic logic unit (ALU) and a Floating point unit (FPU). FMA instructions for single- and double-precision floating-point numbers are also provided.

There are also 16 LD/ST units, also known as Load/Store units, within SM, which support 16 threads to access data from Cache/DRAM together.

The 4 SFU are Special Function Units used to compute special instructions such as sin/cos. Each SFU can execute only one instruction per thread per clock cycle. A Warp (32 threads) needs to execute 8 clock cycles. The SFU pipeline is decoupled from the Dispatch Unit, so when the SFU is occupied, the Dispatch Unit uses other execution units.

Warp has been mentioned before, but previously it was only 32 threads, and we'll finally start detailing here, starting with an overview of the Dual Warp Scheduler.

In the previous SM overview diagram and the figure above, you can notice that SM has two Warp Schedulers and two Dispatch Units. This means that at the same time, two warps are run concurrently, each distributed to a Cuda Core Group (16 CUDA Cores), or 16 load/store units, or 4 SFU to actually execute, and only one instruction is executed per distribution, while Warp Scheduler maintains multiple (say, dozens) Warp states.

A core constraint is introduced here, at any time, a Thread in a Warp is executing the same instructions, and for programmers, it is impossible to observe the different executions of different threads in a warp.

But as we all know, different threads may go into different branches, how to execute the same instructions?

As you can see in the diagram above, when a branch occurs, only the threads that enter the branch are executed, and if there are fewer threads entering the branch, there will be a waste of resources.

In the SM overview chart, we can see 64KB of On-Chip Memory within SM, where 48KB is shared memory and 16KB is L1 Cache. For L1 Cache and non-On-Chip L2 Cache, its role is very close to that of L1/L2 Cache in the CPU multi-level cache structure, while Shared Memory is a big difference compared to CPUs. Whether it is the CPU or the L1/L2 Cache in the GPU, in general, it cannot be dispatched by the programmer, and Shared Memory is designed to be an on-chip cache that is given to the programmer to schedule.

<h1 class="pgc-h-arrow-right" data-track="34">Kepler</h1>

In 2012, NVIDIA released the Kepler architecture, and we look directly at the GTX680 using the Kepler architecture:

As you can see, sm first changed its name to SMX, but the concept it represents has not changed much, let's first look at the interior of SMX:

It is still a familiar term in Fermi, that is, the number has become much more.

I think the most noteworthy thing about this Kepler architecture is GPUDirect technology, which bypasses CPU/System Memory and completes direct data exchange with other GPUs on the native machine or other machine GPUs. After all, in 2021, Bypass CPU/OS is already one of the most important acceleration methods.

<h1 class="pgc-h-arrow-right" data-track="39">Maxwell</h1>

In 2014, NVIDIA released the Maxwell architecture, and we looked directly at the architecture diagram:

You can see that this sm is renamed SMM, Core is more, more powerful, here is not too much introduction.

<h1 class="pgc-h-arrow-right" data-track="42">Pascal</h1>

In 2016, NVIDIA released the Pascal architecture, which is the first architecture to consider Deep Learning, and it is also an architecture worth writing about, first look at the P100 below.

As you can see, as always, a lot of Cores has been added, and we take a closer look at the SM interior:

A single SM has only 64 FP32 Cuda Cores, which is much less than Maxwell's 128 and Kepler's 192, and the 64 Cuda Cores are divided into two blocks. It is important to note that the size of the Register File has not decreased, which means that more registers can be used per thread, and a single SM can also concurrently send more thread/warp/block. Since Shared Memory has not decreased, it also means that the Shared Memory and its bandwidth that each thread can use will become larger.

Added 32 FP64 Cuda Cores, which are the DP Units in the figure above. In addition, the FP32 Cuda Core has the ability to process FP16 at twice the throughput rate of FP32, which is prepared for Deep Learning.

This version introduces something very important: NVLink.

As the computing power of a single GPU becomes more and more difficult to cope with the demand for computing power in deep learning, people naturally begin to solve problems with multiple GPUs. From single-machine multi-GPU to multi-machine multi-GPU, there is also an increasing demand for the bandwidth of GPU interconnects. Between multiple machines, in the use of InfiniBand and 100Gb Ethernet to communicate, in the standalone, especially from the stand-alone single GPU to the stand-alone 8GPU, the bandwidth of PCIe often becomes a bottleneck. To solve this problem, NVIDIA provides NVLink for point-to-point communication within multiple GPUs in a single machine, with a bandwidth of 160GB/s, about 5 times that of PCIe 3 x 16. The following figure shows a typical stand-alone 8 P100 topology.

Some special CPUs can also be connected to GPUs via NVLink, such as IBM's POWER8.

<h1 class="pgc-h-arrow-right" data-track="50">Volta</h1>

In 2017, NVIDIA released the Volta architecture, which can be said to be completely centered on Deep Learning, which is also a large version compared to Pascal. First of all, as always, the SM/Core has been added, so let's look directly at the individual SM interiors.

Similar to the Pascal change, to Volta, 4 blocks are directly disassembled, each block is equipped with an additional L0 instruction cache, and the Shared Memory/Register File has not become less, and just like pascal's change, a single thread can use more resources. A single block also has two more units called Tensor Core, which is the core of this version. As you can complain, this version merges L1 and Shared Memory again.

We first look at CUDA Core, and we can see that the original CUDA Core was split into FP32 Cuda Core and INT32 Cuda Core, which means that FP32 and INT32 can be performed at the same time.

As we all know, the computational bottleneck of DeepLearning is matrix multiplication, called GEMM in BLAS, tensorCore is the unit that only does GEMM calculations, and it can be seen that from here, NVIDIA has gone from SIMT to SIMT+DSA hybrid.

Each TensorCore only does the following:

namely:

where A, B, C, and D are all 4x4 matrices, and A and B are FP16 matrices, and C and D can be FP16 or FP32. Typically, larger matrix calculations are broken down into such 4x4 matrix multiplications.

Such matrix multiplication is exposed to programmers as thread Warp-level operations at CUDA 9, in addition to this, tensorCore is also enabled using cublas and cudnn.

Another important update in this release is NVLink, which is simply more and faster. Each connection provides 25GB/s of bandwidth in both directions, and a GPU can connect 6 NVLinks instead of 4 in pascal days. A typical topology is as follows:

Starting with Volta, thread scheduling has changed, and on Pascal and previous GPUs, each of the 32 threads in Warp shares a Program Counter (PC) and uses an Active Mask to indicate which threads are runnable at any one time, a classic run is as follows:

The other branch is not executed until the first branch is completely complete. This means that different branches in the same warp lose concurrency, and threads of different branches cannot send signals or exchange data with each other, but at the same time, the threads between different warps retain concurrency, and there are inconsistencies in the concurrency of the threads, in fact, if the programmer does not pay attention to this, it is likely to lead to deadlocks.

This problem is solved in Volta, which has a separate PC and stack with threads within warp, as follows:

Because the runtime still has to conform to SIMT, there is a scheduling optimizer responsible for grouping runnable threads into executing using SIMT mode. The classic run is as follows:

The above figure can be noted that the execution of Z is not merged, this is because Z may produce some data that is needed by other branches, so the scheduling optimizer will only merge Z if it is determined to be safe, so the above figure Z is not merged is just a case, in general, the scheduling optimizer is smart enough to find safe merges. Programmers can also force a merge through an API, as follows:

Starting with Volta, support for multi-process concurrent use of GPUs has been improved. At Pascal and before, the use of a single GPU by multiple processes was the classic time-slice approach. Starting with Volta, multiple processes that are not satisfied with the GPU can be parallelized on the GPU, as shown in the following figure:

<h1 class="pgc-h-arrow-right" data-track="67">Turing</h1>

In 2018, NVIDIA released the Turing architecture, which I personally think is an extension of Volta, of course, first of all, various parameter enhancements, but we will not mention parameter enhancements here.

More important is the addition of an RT Core, the full name is Ray Tracing Core, as the name suggests, this is for games or simulations, because I have not been engaged in this kind of work, I will not introduce.

In addition, Tensor Core in Turing has added support for INT8/INT4/Binary, and in order to accelerate the inference of deep learning, the quantitative deployment of deep learning models has gradually matured.

<h1 class="pgc-h-arrow-right" data-track="71">ampere</h1>

In 2020, NVIDIA released the Ampere architecture, which is a big version, which subdivides GA100, GA102, GA104, and we will only focus on GA100 here.

Let's first look at the SM of the GA100:

The most core upgrade here is Tensor Core.

In addition to FP16 in Volta and INT8/INT4/Binary in Turing, this release adds support for TF32, BF16, and FP64. Focus on the TF32 and BF16, as shown below:

The problem with FP16 is that the representation range is not large enough, it is easy to appear underflow in gradient calculations, and forward and backward calculations are relatively easy to overflow, relatively speaking, in deep learning computing, the range is much more important than the precision, so with BF16, the accuracy is sacrificed, and the range is similar to FP32, which was more well-known to support BF16. The design of the TF32 is to absorb the benefits of BF16 and maintain a certain degree of compatibility with mainstream FP32, and FP32 is TF32 as long as it is truncated. Truncated into TF32 calculations and then converted to FP32, which has almost no effect on historical work, as shown in the following figure:

Another change is fine-grained structured sparseness, deep learning model compression in this field in addition to quantification, sparse is also a general direction, but the sparse model is difficult to use hardware acceleration, this version of the GPU provides some support for sparse, the current main purpose is to apply to Inference scenes.

First of all, the sparse matrix defined by NVIDIA, here called the structured sparseness of 2:4, 2:4 means that there are 2 values in every 4 elements that are not 0, as shown in the following figure:

First train using a normal dense weight, train to converge and then crop to a 2:4 structured sparse Tensor, then go fine tune to continue training non-0 weights, and then get a 2:4 structured sparse weight

Ideally with the same precision as a dense weight, then use this sparse weight for Inference. This version of TensorCore supports a 2:4 structured sparse matrix directly multiplied by another dense matrix.

The last more important feature is MIG (Multi-Instance GPU), although the industry's computing scale is indeed getting bigger and bigger, but there are also many tasks because of its characteristics that can not be used full of GPUs resulting in waste of resources, so there is a need to run multiple tasks on a GPU, before which some cloud computing manufacturers will provide virtualization solutions. In amperes, support is given for this need, called MIG.

One might wonder if the multi-process support introduced in Volta didn't solve the problem? For example, in Volta, although multiple processes can be parallelized, since all processes can access all memory resources, it is possible that one process will fill up all the DRAM bandwidth and affect the operation of other processes, and these affected processes are likely to have Throughput/Latency requirements. So we need stricter isolation.

In an ampere MIG, each A100 can be divided into 7 GPU instances for different tasks. The SMs of each instance have independent memory resources, which can ensure that each task has a stable Throughput/Latency that meets expectations. Users can use these virtual GPU instances as real GPUs.

<h1 class="pgc-h-arrow-right" data-track="84" > conclusion</h1>

In fact, there are many more details about each architecture, and it is only a brief summary here. I'll share a little more specific about CUDA programming later. I also welcome you to communicate with me a lot (both online and offline) and make progress together.

Original source: https://zhuanlan.zhihu.com/p/413145211

Reference

https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf

https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

https://www.microway.com/download/whitepaper/NVIDIA_Maxwell_GM204_Architecture_Whitepaper.pdf

https://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdf

http://www.hardwarebg.com/b4k/files/nvidia_gf100_whitepaper.pdf

https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

https://blog.nowcoder.net/n/4dcb2f6a55a34de9ae6c9067ba3d3bfb

https://jcf94.com/2020/05/24/2020-05-24-nvidia-arch/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Demystifying the evolution of NVIDIA's GPU architecture for nearly a decade, from Fermi to ampere CPUs and GPU FermiKepler Maxwell PascalVoltaTuringAmpere epilogue

Read on

Published online at Science, Shanghai Jiaotong University's "Realization of Segmented Fermi Surfaces in Superconductors" is published online today

Yang Zhenning and Gell-Mann went to the hospital to visit their mentor, and Fermi said a paragraph, and Yang Zhenning was deeply moved

A dollar of chili peppers and 2 eggs to do this, fragrant and spicy, super delicious, is too much rice

12 Reasons Humans Can't Find Aliens: Ponder the Fermi Paradox! Aliens Where are you Kaldashev Civilization Index Why can't we find aliens?

Fermi hit the rabbit with her car | I read

Ten minutes of fast food, minced meat roasted eggplant strips, not only good looking and more delicious, the disadvantage is fermium

The authentic method of Sichuan back pot meat, keeping in mind the 3 key steps, fragrant and delicious, too much rice

Where the hell are they? A deep dive into the Fermi Paradox: Perhaps humans will never find aliens

Liverpool vs Brentford start: Van leads, Jotafemi plays

Ponder the terrifying Fermi paradox, how we should solve it in the vastness of the stars

Look, the temperature is gone, can you hear| the Chinese science and technology large number of sub-simulation team successfully measured the second sound attenuation

A big breakthrough in quantum simulation, Chinese scientists have observed for the first time the critical divergence of entropy waves in the Fermi superflow

A new breakthrough in quantum simulation: Pan Jianwei's team observed for the first time the critical divergence of entropy waves in the Fermi superflow

The field of quantum simulation ushers in a major breakthrough!

A major breakthrough in quantum simulation, the HKUST team measured the second sound attenuation rate for the first time| Yuan Ren of Science and Technology

Comic | "super mice" that make superfluids from lithium atoms near absolute zero