James Alex from the Temple of Concave

Qubits | Official account QbitAI

Nvidia's software moat is fading away.

With PyTorch supporting more GPU manufacturers and OpenAI's Triton disrupting the situation, NVIDIA's sharp weapon CUDA is gradually no longer sharp.

The above view comes from Dylan Patel, principal analyst at Semi Analysis, and the related article has attracted a wave of industry attention.

NVIDIA CUDA's monopoly is hard to secure: PyTorch is constantly tearing down towers, and OpenAI is stealing homes

Some netizens commented after reading it:

NVIDIA fell into this situation only because of immediate interests and gave up innovation.

Sasank Chilamkurthy, one of the authors of Pytorch, also added the knife:

When Nvidia offered to buy Arm, I was very uneasy about the potential monopoly. So I started doing what any normal person would do: get CUDA removed from the leading AI framework.

Let's talk about the reasons behind these things Patel mentioned.

PyTorch is the winner of the AI development framework and will support more GPUs

Let's first briefly talk about CUDA's past glory story.

CUDA is a parallel computing framework launched by NVIDIA.

CUDA for NVIDIA can be described as a turning point in history, and its emergence has allowed NVIDIA to take off quickly in the field of AI chips.

Before CUDA, NVIDIA's GPU was just a "graphics processing unit" responsible for drawing images on the screen.

CUDA can not only call GPU computing, but also call GPU hardware acceleration, so that the GPU has the ability to solve complex computing problems and can help customers program processors for different tasks.

In addition to common PCs, unmanned vehicles, robots, supercomputers, VR headsets and other popular devices have GPUs; For a long time, only NVIDIA's GPU could quickly process a variety of complex AI tasks.

So how did CUDA, which has unlimited scenery, become unstable later?

This also starts with the AI development framework debate, especially PyTorch VS TensorFlow.

If you compare PyTorch frameworks to cars, CUDA is a gearbox — it can speed up the computation process of machine learning frameworks, and when running PyTorch on NVIDIA GPUs, it can train and run deep learning models faster.

TensorFlow developed early and is also a powerful tool for Google, but its momentum has gradually been surpassed by PyTorch in the past two years. At several summits, the proportion of PyTorch framework usage also increased significantly:

Source: The Gradient, the proportion of papers specifically mentioned in PyTorch

Another TensorFlow deep user came forward and said, "Now I'm switching to PyTorch." ”

A key factor in PyTorch's success is that it is more flexible and easy to use than TensorFlow.

This is thanks to PyTorch's eager mode, which allows you to modify the model in the C++ runtime environment and immediately see the results of each step. While TensorFlow now also has an eager model, most of the big tech companies are already developing solutions around PyTorch. (Pierced the heart...) ）

On the other hand, although Python is written in both, the comfort of using PyTorch is superior.

In addition, PyTorch has more models available and a richer ecosystem, according to statistics, 85% of large models in HuggingFace are implemented with the PyTorch framework.

In the past, although the major AI development frameworks fought hotly, the lower parallel computing architecture CUDA could be regarded as dominant.

But time has passed, and in the competition for AI frameworks, PyTorch finally won over the previous leader TensorFlow, temporarily stabilized its position, and then began to do things.

In recent years, PyTorch has expanded to support more GPUs, and the first stable version of PyTorch 2.0 will also improve the support of other GPUs and accelerators, including AMD, Intel, Tesla, Google, Amazon, Microsoft, Meta and so on.

In other words, the NVIDIA GPU is no longer the only one

But there are actually CUDA's own problems behind this.

Memory walls are a problem

As mentioned earlier, the rise of CUDA and the wave of machine learning promote each other and win-win growth, but there is a phenomenon worth paying attention to:

In recent years, the leader NVIDIA Hardware's FLOPS has been continuously improved, but its memory improvement is very limited. Take the V100, which trained BERT in 2018, as a state-of-the-art GPU, which has grown by an order of magnitude on FLOPS, but the memory increase is not much.

Source: SemiAnalysis

In actual AI model training, as the model gets bigger and bigger, the memory requirement also increases.

For example, Baidu and Meta, when deploying production recommendation networks, need tens of terabytes of memory to store massive embedding tables.

In training and inference, a lot of time is not actually spent on matrix multiplication calculations, but waiting for data to reach computing resources.

So why not get more memory?

In short, banknote capacity is insufficient.

In general, memory systems arrange resources according to data usage needs in a structure ranging from "near and fast" to "slow and cheap". Typically, the nearest shared memory pool is on the same chip, typically consisting of SRAM.

In machine learning, some ASICs try to save model weights with a huge SRAM, which is not enough to encounter 100B+ model weights at every turn. After all, even wafer-level chips worth about $5 million only have 40GB of SRAM space.

Put NVIDIA's GPU, the memory is even smaller: the A100 is only 40MB, the next generation H100 is 50MB, according to the price of mass production products, for a chip per GB of SRAM memory cost up to $100.

The account is not yet settled. At present, the cost of on-chip SRAM has not been greatly reduced with the improvement of Moore's Law process, and if TSMC's next-generation 3nm process is used, the same 1GB will cost more.

Compared with SRAM, DRAM is much lower in cost, but the latency is an order of magnitude higher, and the cost of DRAM has hardly dropped significantly since 2012.

As AI continues to evolve, the demand for memory will increase, and this is how the memory wall problem was born.

DRAM now accounts for 50% of the total server cost. For example, NVIDIA's 2016 P100, compared with the latest H100, FB16 performance increased by 46 times, but the memory capacity only increased by 5 times.

△ NVIDIA H100 Tensor Core GPU

Another problem is also related to memory, which is bandwidth.

In the calculation process, increasing memory bandwidth is obtained through parallelism, for this purpose, NVIDIA uses HBM memory (High Bandwidth Memor), which is a structure composed of 3D stacked DRAM layers, which is more expensive to package, so that users with simple funds can only dry stare.

As mentioned earlier, one of the great advantages of PyTorch is that the Eager model makes AI training inference more flexible and easy to use. However, its memory bandwidth requirements are also very fat.

Operator fusion, that is, the main method to solve the above problems. The essence is "fusion", which does not write each intermediate calculation result to memory, but passes it at once, computing multiple functions, so that the amount of memory read and write is reduced.

△ Operator fusion Source: horace.io/brrr_intro.html

To implement "operator fusion", to write a custom CUDA kernel, you need to use C++.

That's where the disadvantages of CUDA become apparent: writing CUDA is much harder for many people than writing Python scripts...

In contrast, the PyTorch 2.0 tool can significantly lower this threshold. Its built-in NVIDIA and external libraries, no need to learn CUDA specifically, directly use PyTorch to add operators, for alchemists, naturally friendly.

Of course, this has also led to PyTorch having massively increased the number of operators in recent years, at one point exceeding 2,000 (manual dog heads).

At the end of 2022, the upgraded PyTorch 2.0, which has just been released, is even more vigorous and aimed at compilation.

With the addition of a compilation solution for image execution models, the framework improved training performance by 86% on the A100 and CPU inference performance by 26%.

In addition, PyTorch 2.0 relies on PrimTorch technology to shrink the original more than 2,000 operators to 250, making more non-NVIDIA backends more accessible; TorchInductor technology is also used to automatically generate fast code for multiple accelerators and backends.

Moreover, PyTorch 2.0 can better support data parallelism, sharding, pipeline parallelism and tensor parallelism, making distributed training more silky.

It is the above technology, combined with the support for GPUs and accelerators from manufacturers other than NVIDIA, that the software wall built by CUDA for NVIDIA is not so unattainable.

There are replacements behind him

NVIDIA's own memory improvement speed has not kept up here, and there is PyTorch 2.0 over there, but it is not over -

Open AI has launched a "simplified version of CUDA": Triton. (Steal home directly)

Triton is a new language and compiler. It is less difficult to operate than CUDA, but its performance is comparable to the latter.

OpenAI claims:

Triton only needs 25 lines of code to achieve performance comparable to cuBLAS on FP16 matrix multiplication shang.

OpenAI researchers have used Triton to generate kernels that are 1x more efficient than equivalent Torch.

Although Triton currently only officially supports NVIDIA GPUs, this architecture will also support multiple hardware vendors later.

It is also worth mentioning that Triton is open source, and other hardware accelerators can be directly integrated into Triton compared to closed-source CUDA, greatly reducing the time to build an AI compiler stack for new hardware.

But then again, some people feel that CUDA's monopoly is far from broken. For example, Soumith Chintala, another author of PyTorch and a distinguished Meta engineer, said:

This article exaggerates the reality that CUDA will continue to be the key architecture that PyTorch relies on.

Triton is not the first (optimized) compiler, and most of the current focus is still on the XLA compiler.

He said it was unclear whether Triton would slowly be accepted, and it would depend on time. In short, Triton doesn't pose much of a threat to CUDA.

Patel, the author of the article, saw the comment himself and replied:

I'm not saying [CUDA's monopoly] is gone (Broken), I'm saying it's going backwards.

And at present, Triton only officially supports NVIDIA GPUs (no performance testing performance on other GPUs), if XLA's performance on NVIDIA GPUs is not dominant, then it may not be as good as Triton.

But Soumith Chintala countered that even to say CUDA's status was declining. Because if Triton wants to promote hardware, there are still many risks, and there is still a long way to go.

Some netizens are on the same side as the PyTorch author:

I also hope that the monopoly will be broken, but CUDA is still at the top, without it, many people build software and systems that simply cannot be played.

So, what do you think of CUDA's current situation?

Reference Links:

[1]https://www.semianalysis.com/p/nvidiaopenaitritonpytorch

[2]https://analyticsindiamag.com/how-is-openais-triton-different-from-nvidia-cuda/

[3]https://pytorch.org/features/

[4]https://news.ycombinator.com/item?id=34398791

[5]https://twitter.com/soumithchintala/status/1615371866503352321

[6]https://twitter.com/sasank51/status/1615065801639489539

— End —

Qubits QbitAI · Headline number signed

NVIDIA CUDA's monopoly is hard to secure: PyTorch is constantly tearing down towers, and OpenAI is stealing homes

PyTorch is the winner of the AI development framework and will support more GPUs

Memory walls are a problem

There are replacements behind him