What tricks can make the model train faster? Let's first understand the first principle of this problem

Excerpted from the Horace blog

Written by Horace He

Machine Heart Compilation

Edit: Juniper

Is deep learning a metaphysics? Not exactly.

Everyone wants to train the model faster, but did you really find the right way? According to Horace He, an undergraduate at Cornell University who interned with the PyTorch team, this problem should be solved in several steps: first, you need to know why your training is slow, that is, where the bottleneck is, and secondly, find a corresponding solution. Trying things out before you understand the fundamentals (first principles) is a waste of time.

In this article, Horace He analyzes possible bottlenecks from three perspectives: compute, memory bandwidth, and additional overhead, and provides ways to determine which bottleneck we are currently in, helping us to accelerate the system more specifically. This article has been appreciated by many senior researchers and developers such as Chen Tianqi.

What tricks can make the model train faster? Let's first understand the first principle of this problem

Here's what the original text says:

How can I improve the performance of my deep learning model? Most people will choose some random tricks summarized in online blogs, such as "use the system's built-in arithmetic operators, set the gradient to 0, use PyTorch version 1.10.0 instead of version 1.10.1..."

In this area, contemporary (especially deep learning) systems feel less like science and more like alchemy, so it's not hard to see why users tend to take this random approach. Even so, there are some first-principles in this area that can be followed, and we can rule out a large number of methods accordingly, thus making the problem easier to solve.

For example, if your training loss is much lower than your test loss, you may have an "overfitting" problem and trying to increase the capacity of the model is a waste of time. For example, if your training loss is consistent with your validation loss, it would be unwise to regularize the model.

Similarly, you can divide the problem of efficient deep learning into three distinct components:

Computation: The time it takes the GPU to compute the actual floating-point operation (FLOPS);

Memory: The time it takes to transfer tensors within the GPU;

Additional expenses: Time spent on other parts.

When training a machine learning model, it's critical to know what kind of problem you're having, and the same is true for making the model efficient. For example, when a model spends a lot of time transferring memory to gpus (that is, when memory bandwidth is tight), flops that increase the GPU do not work. On the other hand, if you're running a lot of matrix multiplication (i.e., when computation is stressed), rewriting your program to C++ to alleviate the extra overhead won't work.

So, if you want the GPU to run silkily, the discussion and research of the above three aspects is essential.

Behind the bitter lessons are a large number of engineers keeping GPUs running efficiently.

Note: Most of the content in this blog is based on GPU and PyTorch examples, but these principles are basically common across hardware and across frameworks.

compute

One aspect of optimizing deep learning systems is that we want to maximize the time spent on computations. You've spent 312 trillion floating-point operations, and you'd want them to be used in calculations. However, in order for your money to be rewarded from your expensive matrix multiplication, you need to spend less time on other parts.

But why is the focus here on maximizing compute rather than maximizing the bandwidth of memory? The reason is simple – you can reduce the extra overhead or memory consumption, but you can hardly reduce the amount of computation without changing the real operation.

Compared to memory bandwidth, the speed at which computations grow increases the difficulty of maximizing compute utilization. The following table shows when the FLOPS of the CPU doubles and the memory bandwidth doubles (focus on the yellow column).

One way to understand computation is to think of it as a factory. We pass instructions to our factory (additional consumption) and send the raw material to it (memory bandwidth), all in order to make the plant run more efficiently (compute).

So, if the capacity of a factory expands faster than the raw materials we provide to it, it will be difficult for it to achieve a peak efficiency.

Even if our factory capacity (FLOP) doubles, but the bandwidth can't keep up, our performance can't double.

Another point to say about FLOPS is that more and more machine learning accelerators have hardware configurations specifically for matrix multiplication, such as Nvidia's Tensor Cores.

So, if you don't do matrix multiplication, you can only achieve 19.5 trillion operations, not 312 trillion operations. Note that GPUs are not unique, in fact, TPUs are more specialized compute modules than GPUs.

With the exception of matrix multiplication, the GPU is slower to process other operations, a phenomenon that at first glance may seem problematic: what about other operators like layer normalization or activation functions, for example? In fact, these operators on FLOPS are just like rounding errors for matrix multiplication. For example, consider the table below for the flop numbers occupied by different operator types in BERT, where "Tensor Contraction" refers to matrix multiplication.

As you can see, non-matrix multiplication operations account for only 0.2% of all operations, so even if their speed is only 1/15 of matrix multiplication, there is no problem.

In fact, the FLOPS used for normalization and pointwise operations is only 1/250 and 1/700 of matrix multiplication. So why do non-matrix multiplication operations take so much longer than they should?

Going back to the "factory" analogy earlier, the culprit is often how the raw material is transported to and from the factory, in other words, the "memory bandwidth."

bandwidth

Bandwidth consumption is essentially the cost of transporting data from one place to another, which can refer to moving data from the CPU to the GPU, from one node to another, or even from CUDA's global memory to CUDA's shared memory. The last one is the focus of this article, which we generally call "bandwidth consumption" or "memory bandwidth consumption". The first two are generally called "data transportation consumption" or "network consumption" and are not within the scope of this article.

Or back to the analogy of the "factory". While we do the actual work in the factory, it's not suitable for large-scale storage. We want to make sure that its storage is efficient enough and that it can be used quickly (SRAM), not by volume.

So where do we store the actual results and the "raw materials"? Usually we have to have a warehouse where the land is cheap enough and there's a lot of space (DRAM). Then we can ship things between it and the factory (memory bandwidth).

This cost of moving things between compute units is the so-called "memory bandwidth" cost. In fact, the "memory" that appears in the nvidia-smi command is DRAM, and the often crazy "CUDA out of memory" is talking about this DRAM.

It's worth noting that every TIME we perform a GPU operation, we need to ship the data out and back to our warehouse, DRAM.

Now imagine that when we perform a unary operation (such as torch.cos), we need to transport the data from the warehouse (DRAM) to the factory (SRAM), then perform a small step calculation in the factory, and then transport the result back to the warehouse. Shipping is quite time-consuming, and in this case, we spend almost all of our time on shipping data rather than real calculations.

Because we're spending all our time on memory bandwidth, this operation is also called a memory-bound operation, which means we're not spending a lot of time on computations.

Obviously, that's not what we want. So what can we do? Let's see what an operator sequence looks like.

What a point-by-point operator sequence might look like.

Transferring data back and forth between global memory and compute units is clearly not optimal. A better approach is to perform the full operation in Data Factory at once and then pass the data back.

This is operator fusion, the most important optimization in deep learning compilers. Simply put, this approach does not write data to global memory in order to read again, but avoids additional memory access by performing multiple calculations at once.

For example, to perform an x.cos(.cos() operation, the way the memory is written requires 4 global reads and writes.

Operator fusion requires only 2 global memory reads and writes, which accelerates by 2x.

However, this practice is also not easy and requires some conditions. First, the GPU needs to know what happens next after the current operation is performed, so this optimization cannot be made in PyTorch's Eager mode (running one operator at a time). Second, we need to write CUDA code, which is not a simple task.

Not all operator fusions are as simple as point-by-point operators. You can fuse point-by-point operators into reduction or matrix multiplication. Even matrix multiplication itself can be thought of as an operation that combines broadcasting multiply and reduction.

Any 2 PyTorch operators can be fused, saving memory bandwidth costs for reading/writing global memory. In addition, many existing compilers can often perform "simple" fusion (e.g., NVFuser and XLA). However, more complex fusions still need to be written by hand, so If you want to try writing your own custom CUDA kernel, Triton is a good place to start.

Surprisingly, the fused x.cos(.cos() operation will take almost the same amount of time as calling x.cos() alone. That's why the cost of activating a function is almost the same, although gelu obviously contains more operations than relu.

Therefore, re-implementing/activating checkpoints produces some interesting results. Essentially, doing additional recalculations can result in less memory bandwidth, which reduces uptime. Therefore, we can reduce memory footprint and runtime by re-implementing it, and build a concise min-cut optimization channel in AOTAutograd.

Infer in memory bandwidth costs

For simple operations, direct inference of memory bandwidth is feasible. For example, the A100 has a global memory bandwidth of 1.5 TB/s and can perform 19.5 teraflops/scal calculations. So, if you use 32-bit floating-point numbers (that is, 4 bytes), you can load 400 billion numbers while the GPU performs 20 trillion operations.

In addition, performing simple unary operations, such as turning tensors x2, actually requires writing the tensors back to global memory.

So until about a hundred unary operations are performed, more time is spent on memory access than on actual calculations.

If you execute the following PyTorch function:

Benchmarking it with a fusion compiler allows you to calculate the FLOPS and memory bandwidth for each repeat value. Increasing the repeat value is a simple way to increase the amount of computation without increasing memory access - this is also known as increasing compute intensity.

Specifically, suppose we benchmark this code to first find out the number of iterations per second executed, then perform 2N (N is the tensor size) times of memory access and N*repeat FLOP. Therefore, the memory bandwidth will be bytes_per_elem * 2 * N / itrs_per_second, while FLOPS is N * repeat / itrs_per_second.

Now, let's plot the 3 functions of compute intensity: runtime, flops, and memory bandwidth.

Note that the run time does not increase significantly at all until 64 multiplications are performed. This meant that until then it was mostly limited by memory bandwidth, while computations were mostly idle.

The value of FLOPS is initially 0.2 teraflops. When we double the computational intensity, this number grows linearly until it approaches a peak of 9.75 teraflops, which is considered "computationally restricted" once it approaches the peak.

Finally, we can see that the memory bandwidth starts near the peak and begins to decrease as we increase the computational intensity. This is exactly what we are expecting, as it means that more and more time is being taken to perform the actual computation, rather than accessing memory.

In this case, it's easy to see when it's being computationally limited and when it's being memory-bound. When repeat64, the computation is close to saturation (that is, close to peak FLOPS) and the memory bandwidth begins to drop.

For larger systems, it is often difficult to say whether they are throttling or memory-bandwidth-constrained, because they often contain a combination of both computational and memory limitations. A common measure of computational limitation is to calculate the percentage of actual FLOPS to peak FLOPS.

However, in addition to the memory bandwidth costs, there is one more thing that can cause the GPU to not run silkily.

Additional overhead

Overhead is incurred when code spends time transferring tensors or other things other than computation, such as time spent in the Python interpreter, time spent on the PyTorch framework, and time spent starting the CUDA kernel (but not executing), which are indirect overheads.

The extra overhead is important because modern GPUs are very fast. The A100 can perform 312 trillion floating-point operations per second (312 TeraFLOPS). Python, by contrast, is simply too slow — Python performs about 32 million additions in a second.

This means that the A100 may have run FLOPS 9.75 million times in a single floP execution by Python.

To make matters worse, the Python interpreter isn't even the only source of indirect overhead, and frameworks like PyTorch have many layers of scheduling before they reach the actual kernel. PyTorch can perform approximately 280,000 operations per second. If you use micro tensors (for example for scientific computing), you may find PyTorch very slow compared to C++.

For example, in the following diagram, using PyTorch to perform a single addition, only a small piece of the graph is what is actually performed to calculate, and the rest is pure extra overhead.

Given this, you may be puzzled by the fact that PyTorch is becoming a mainstream framework, and this is because modern deep learning models typically perform large-scale computing. In addition, frameworks like PyTorch execute asynchronously. As a result, most of the framework overhead can be completely ignored.

If our GPU operator is large enough, then the CPU can run ahead of the GPU (so CPU overhead is irrelevant). On the other hand, if the GPU operator is too small, the GPU will spend most of its time on paperweight.

So, how do you tell if you're in this situation? Since the extra overhead usually doesn't change with the size of the problem (while compute and memory do), the easiest way to tell is to simply increase the size of the data. If the runtime is not proportionally increased, it should be said that an overhead limit is encountered. For example, if you double the batch size but increase the run time by only 10 percent, you might be subject to overhead limitations.

Another approach is to use the PyTorch parser. In the following figure, the pink block shows how cpu cores match GPU cores.

The CPU is running ahead of the CURVE than the GPU.

On the other hand, the "GPU-Util" (not "Volatile GPU-Util") entry in nvidia-smi measures the percentage of GPU cores that are actually running, so this is another good way to see if you are experiencing overhead limitations. This overhead, which all flexible frameworks like PyTorch have, essentially takes a lot of time to "figure out what to do."

This could come from python (looking up properties or dispatching to the correct function) or code in PyTorch. For example, when you perform a + b, you need to perform the following steps:

Python needs to find what __add__ dispatched on a.

PyTorch needs to determine many properties of the tensor (e.g. dtype, device, whether autograd is required) to decide which kernel to call.

PyTorch needs to actually boot the kernel.

Fundamentally, this overhead comes from the flexibility of being able to perform different operations in each step. If you don't need this flexibility, one way to address it is to keep track of it, for example by using jit.trace, FX, or jax.jit. Alternatively, you can switch to something like CUDA Graphs to perform this operation at a lower level.

Unfortunately, this comes at the cost of losing flexibility. One way to get the best of both worlds is to write more "real" JIT content by introspecting at the VM level. For more information, see TorchDynamo (https://dev-discuss.pytorch.org/t/torchdynamo-an-experiment-in-dynamic-python-bytecode-transformation/361).

summary

If you want to accelerate a deep learning system, the most important thing is to understand what the bottlenecks in the model are, because the bottlenecks determine what methods are appropriate to accelerate the system.

Many times, I see researchers and others interested in accelerating PyTorch code blindly trying without understanding the problem at hand.

Of course, on the other hand, if the user needs to consider these things, it also reflects a partial failure of the framework. Although PyTorch is an active area of interest, PyTorch's compiler or profile APIs are not the easiest to use.

All in all, I find that understanding the fundamentals of the system is almost always useful, and hopefully this will be useful to you as well.

What tricks can make the model train faster? Let's first understand the first principle of this problem

Read on