Silicon Valley giants develop their own GPUs and don't want to be "stuck" by NVIDIA

This issue focuses on the development and technical context of GPU chips, and exclusively publishes Tencent news, please do not reprint without authorization.
Text / Chen Jing, R&D Director of Asia Vision Technology, member of Fengyun Society
Editor / Su Yang
ChatGPT, the biggest star in the global technology industry, and the large-scale model behind it have unexpectedly given new impetus to the chip industry, which was originally "declining in demand".
The "Big Seven" of U.S. stocks, Apple, Microsoft, Google, Amazon, Nvidia, Facebook, and Tesla, all of which are giant technology companies related to chips and AI, have soared in market value in the past year, with the smallest increase of 50% in the year, and GPU overlord Nvidia is the strongest, with a market value increase of 239.2% in 2023, driving the Nasdaq index up 44.2% for the year. Against the backdrop of aggressive interest rate hikes by the Federal Reserve, it has been called the strangest bull market ever.
The key logic behind the staggering changes in the market capitalization of these high-tech companies is the technological progress of the chip industry, especially the GPU-related technological revolution.
Jensen Huang speaks at Computex 2023
In the past, China's chip industry has been targeted by the United States, and public opinion has paid more attention to mobile phone SoCs and related process technologies, such as 7nm and 5nm. The advent of Huawei's Kirin 9000S shows that Chinese mainland has broken through the manufacturing technology of 7nm SoC chips, forming a closed loop from design, manufacturing to commercial.
However, consumers are unfamiliar with GPU-related technologies, and it is precisely in this regard that the gap between Chinese and American related products in this regard is very obvious. Silicon Valley giants such as Nvidia and AMD are still accelerating the performance of advanced GPUs, and the performance gap of China's independent products has been stretched to more than 10 times.
In the face of the United States' tightening export controls on advanced chips, will China's artificial intelligence industry fall behind because of the limited performance of computing power and related chips? What methodologies can we learn from the development of GPU technology and the growth process of chip "giants" like NVIDIA?
01 Is NVIDIA's success due to luck?
The first GPU: NVIDIA GeForce 256
In 1999, when Nvidia released GeForce 256, it officially put forward the concept of Graphics Processing Unit, which translates to Graphics Processing Unit in Chinese.
The narrow concept of GPU is the graphics card - the CPU cannot process the image on the screen, and a specialized graphics card is needed. For example, 3dfx, which was quite famous in the 90s, launched its 3D accelerator card Voodoo in 1995, which greatly increased the performance of PC games. Games like Tomb Raider and Quake made Voodoo famous. Released in 1997, Voodoo 2 was even more alone in the competition of 3D graphics card technology. To this day, gamers are still an avid user base of high-end graphics cards.
Voodoo2 didn't give 3dfx a head start, though. There are many kinds of graphics card products on the market, and the adaptation of games and 3D applications has become the most difficult problem for developers. 3DFX, which stubbornly adheres to its own ecosystem, faces the situation of being isolated by game and application developers.
At the same time, NVIDIA began to catch up with 3dfx by developing TNT series products, and by the time the second-generation product TNT2 came out, it had already surpassed the performance of 3dfx, and the latter was passively involved in the performance battle. After Nvidia launched the revolutionary GeForce 256 and realized the T&L (Transforming & Lighting) capability on the graphics card chip, which was originally responsible for the CPU, 3dfx planned to fight back, but the product continued to skip tickets, and customers switched to Nvidia.
In the fierce battle, Microsoft became the last straw that crushed 3DFX.
In the face of Nvidia's challenge, 3dfx made a desperate bet and decided to buy a graphics chip company for $180 million to save the company by winning a large order for graphics chips for Microsoft's Xbox game console, but Microsoft chose Nvidia. Eventually, the troubled 3DFX was bought by Nvidia for $70 million and 1 million shares of the company's stock at a bargain price.
From the performance of TNT2 in February 1999, to the launch of GeForce 256 in August 1999, to the acquisition of 3dfx at the end of 2000, the rapid decline of a star company reflects the rapid progress of technology, the importance of respecting customers, and the brutality of competition.
At that time, NVIDIA's victory "gene" had already appeared, but no one imagined that this "gene" was so valuable. In the subsequent competition in the GPU industry, the support of customers, the rapid iteration of performance indicators, the efficient optimization of various application algorithms, the stable and easy-to-use software engine interface, the support of large companies, and the separation of design and manufacturing have become the magic weapons for IT hardware companies to continue to compete and win.
In the era of high PC growth, Intel, Nvidia and AMD are also core players in the graphics card field.
From Q3 of 2021 to Q3 of 2022, Nvidia and Intel occupy an absolute leading share in the market share of discrete graphics cards and integrated graphics cards
Intel went the integrated graphics route and didn't launch discrete graphics until 2022. AMD has both integrated graphics cards and discrete graphics cards, and in 2006 acquired the graphics card company ATI for $5.4 billion, taking the route of "dual development" of CPUs and GPUs. Prior to this, in 2004, ATI developed a Radeon 9700 graphics card with 110 million transistors, which was the first time ahead of NVIDIA in technology, and it also had a better relationship with Microsoft, taking the lead in supporting DirectX 9.0 and winning an Xbox order. Nvidia focuses on the development of discrete graphics cards, with a market share of more than 80%.
02 General Computing Helps NVIDIA "Open Up"
At the beginning of the millennium, it was enough for the average user to have an integrated graphics card, and the demand for discrete graphics cards was dominated by gamers rather than the mainstream market. The logic of computing is also represented by the supercomputing of heap CPUs, and in comparison, the performance of GPUs in computing is not impressive.
In 2003, the concept of GPU General Purpose Computing (GPGPU) was proposed. For many years after that, NVIDIA almost single-handedly brought the general-purpose computing power of GPUs to unimaginable heights. GPU jumped out of a single 3D image display application scenario, and gradually applied in many fields such as neural networks, scientific computing, cloud computing, AIGC, and large language models in the following 20 years.
General computing has become a new engine for NVIDIA's growth, and game-related products are also developing well, and NVIDIA continues to iterate on the GeForce series of products in the gaming market, and has been leading in performance by adding advanced features such as ray tracing. Nvidia precisely divides different SKUs to suit user groups with differentiated needs, which is also known as founder Jensen Huang's "knife method".
But what really made Nvidia a trillion-dollar market capitalization legend was the Tesla series of products for data centers launched in 2009, such as V100, A00, H100, and H200, which continuously improved the high-density parallel computing function of GPGPU.
Nvidia's GPU architecture and its corresponding computing performance since 2010
The golden development time of NVIDIA's "opening" is about 15 years so far, and its driving force is the continuous development of GPU architecture. In this process, NVIDIA named the architecture after the scientist to pay tribute to the scientific ancestors. The H100 and H200 use the Hopper architecture, named after Grace Hopper (Grace Herb, computer software engineering expert, the word bug is when she finds a bug on a computer relay).
So far, from the perspective of NVIDIA's revenue division, the revenue of the "data center" business accounts for more than 5 percent, successfully surpassing the "game entertainment" business, which only accounts for more than 3 percent. In addition, less than 10% focus on the "industrial sector", namely the Quadro series, which focuses on high-end graphics for industrial visualization in the design and construction industries, and the Orin series, which focuses on the field of autonomous driving.
Overall, the subjective reason for NVIDIA's emergence as the GPU hegemon is due to continuous technical iteration, algorithm optimization, continuous response to the differentiated needs of customers, and support from large customers.
Taking performance indicators as an example, using GPT-3 applications to evaluate computing power, the performance of H100 in 2023 has increased by 11 times on the basis of A100 in 2021, and the performance of H200 has increased by 18 times, and the performance of the B100 to be launched is twice that of H200. During this period, although Nvidia is "raising prices", the amount of training work that can be completed per dollar has also increased significantly.
Objectively, NVIDIA's victory has a lot to do with the popularity of deep learning, and its CUDA (Compute Unified Device Architecture) ecosystem, which is deeply optimized for artificial intelligence applications, is the cornerstone of NVIDIA's success - it has tens of thousands of R&D personnel, and developers can directly use the CUDA development library without optimizing GPU-related code by themselves, so that NVIDIA has user inertia, and both software and hardware are very hard. It can be said that in the artificial intelligence software ecosystem, NVIDIA is equivalent to Microsoft in the PC software ecosystem.
Nvidia's market capitalization has exceeded one trillion dollars, and there is another rare phenomenon to boost - time advantage - even if competitors can catch up with Nvidia, it will always be some time later.
For giants in the field of artificial intelligence, time is one of the important considerations, and there are several points that affect their layout:
First, it takes time for NVIDIA's competitors to develop and produce benchmark products;
Second, it will take a long time for companies that apply GPUs to adapt software and hardware, while NVIDIA's software ecosystem is mature and can be used quickly;
Third, some large-scale training takes a lot of time, and if the hardware performance and software optimization are slightly worse, the time cost will be unbearable.
In the context of the explosion of ChatGPT and the unexpected explosion of demand for AI hardware, there has been a phenomenon of "snapping up" Nvidia's high-performance GPUs. Nvidia offered a GPU at a high price of tens of thousands of dollars, and major companies arrogantly "bought cards" in tens of thousands or hundreds of thousands.
For example, Yotta, an Indian data center operator, has ordered 16,000 H100s with a delivery date scheduled for July 2024, and recently spent $500 million to order 16,000 H100s and GH200s, which will arrive by March 2025. In order to develop AGI (Artificial General Intelligence), social media giant Meta has thrown out orders for 350,000 H100 at the beginning of 2024, plus the purchased 600,000 H100 super computing power. The H100, which costs $3,000, sells for $30,000, and the profit of this order is beyond imagination.
It was these incredibly large orders that pushed Nvidia's stock price to $600 and its market capitalization to more than $1.5 trillion. And the "good news" of market demand seems to be over, such as AI PCs have become popular again, and everyone needs high-performance GPUs to make their PCs smarter.
03 Where does the powerful computing power of GPUs come from?
From the current point of view, in terms of architecture, the acceleration performance of GPUs is more outstanding than that of CPUs, and there are more ways to improve indicators, so the performance has been improving rapidly. In contrast, the increase in the computing speed of the CPU has slowed down, and it is more about the increase in power consumption metrics of mobile applications. Some metric progress is still achieved by integrating with GPUs, such as Apple's M1, M2, and M3 series chips for PCs, and the performance of integrated graphics cards has improved rapidly, and M3 only needs half the power consumption of M1's graphics performance, and the peak performance has been increased by 65%.
From the hardware point of view, GPUs have two main types of components, one is multiple computing units used to perform parallel computing, and the other is the video memory that stores data.
NVIDIA H100 GPU SXM5 module, in the middle is 1 Hopper GPU logic computing chip + 6 HDM memory
For example, H100 has a logic computing chip designed by NVIDIA, and six HBM3 memory chips of 16GB each (as shown in the picture above), which are closely connected to form a whole chip using TSMC's CoWoS (Chip on Wafer, Wafer on Substrate) advanced packaging technology.
In layman's terms, the supercomputing power of GPUs originates from thousands of computing units, which have different functions and need to cooperate with each other to complete a variety of graphics processing and AI computing tasks, and need to be managed at different levels. This is equivalent to many CPUs running at the same time, although the power of a single computing unit is limited, and the performance can jump dramatically when combined. However, the computing unit of the GPU is different from the concept of multi-threaded parallelism of 4 cores and 8 cores in the CPU.
The increasing number of computing units and the increasing complexity and sophistication of functions are the driving force behind NVIDIA's continuous optimization from the Fermi architecture to the Hopper architecture.
From a computing point of view, the top layer is the entire GPU, which contains several secondary graphics processing clusters (GPC), and the data processed by one graphics processing cluster will be connected through the crossbar and allocated to other graphics processing clusters for further processing. A graphics processing cluster contains several Stream Multiprocessors (SMs).
Details of the streaming multiprocessor architecture
Taking a single stream multiprocessor under the Fermi architecture as an example, it contains a variety of components, the most conspicuous of which is 32 cores (also known as CUDA cores, as shown in the left part of the figure above), that is, "stream processor" computing cores. The single graphics processing cluster of the Fermi architecture can efficiently complete some specific tasks, and become an engine, such as the PolyMorph Engine, which can greatly improve the processing power of geometry.
Note: In general, a frame of a large game contains millions of geometric polygons, and the number of polygons in a frame of CG animation can reach hundreds of millions.
Schematic diagram of the complete NVIDIA Fermi architecture
As shown in the figure above, the full Fermi architecture has 4 graphics processing clusters, 16 streaming multiprocessors, and a total of 512 CUDA Cores. They are characterized by the processing of vectors, and a GPU clock cycle performs operations such as multiplication, addition, etc., which can be accelerated as long as the data is organized into vectors. In summary, the number of CUDA cores directly determines the general processing power of the GPU.
With the rise of deep learning, the proportion of matrix and convolution operations in computational tasks has risen dramatically. CUDA Core can do matrix and convolution, but the single-step capability is limited, and acceleration requires multiple CUDA Cores to operate in parallel. Starting with the Volta architecture in 2017, NVIDIA introduced Tensor Cores to specifically handle matrices and convolutions, further optimizing efficiency. In the most common GEMM matrix multiplication method in deep learning, Tensor Core can complete two 4*4*4 FP16 floating point matrix multiplication with one clock cycle to form an FP32 matrix result, which provides more than 10 times the efficiency speedup.
In summary, the processing power of the GPU is determined by the number of CUDA Cores and Tensor Cores. According to public information, the SXM5 version of the H100 has 132 groups of streaming multiprocessors, with 16,896 FP32 CUDA Cores and 528 Tensor Cores.
Note: FP32 and FP16 represent single-precision floating-point numbers, which can be represented by 64bit, 3bit, 16bit, and 8bit, and are called FP64, FP32, FP16, and FP8 respectively. The number of bits is small, the cost of the computing part is low, and the performance is high, but the data accuracy will be reduced. In AI applications, it is not very sensitive to accuracy, and a large number of coefficient values are 0, and you can choose a floating-point number with a small number of digits to accelerate, such as FP8. The Transformer Engine released by NVIDIA uses the acceleration characteristics of FP8 to optimize the Tensor Core operator and well support large model training.
The transmission bandwidth of NVIDIA's main products is based on the fourth-generation NVLink H100, with a transmission rate of 900GB/s
In addition to the large number of compute units, another important metric that affects the performance of GPU hardware is the transmission channel bandwidth.
In the past, data was mainly communicated between the host and the GPU through the PCIE data channel, such as PCIE 3.0 x16 with a bandwidth of 16GB/s. After using the GPU for computing, the graphics card also needs to transfer data between them, NVIDIA has developed a special NVLink transmission channel to connect the graphics card, and the bandwidth continues to increase, H100's 4th generation NVLink is already 900GB/S, which is more than 10 times the transmission bandwidth based on PCIE.
The above cases about the number of compute units, CUDA cores and Tensor cores, and transmission bandwidth are all emphasizing that the powerful computing power of GPUs comes from many aspects, and there is a large space for optimization in design, including memory size, memory speed, cache management, data channels, number of cores, single core performance, numerical precision, parallel programming optimization, etc.
Even in the manufacturing and packaging links, such as nano process improvement and advanced packaging process research and development, there is room for participation for NVIDIA itself and later catch-ups, and the links that can be improved far exceed those of the CPU. At present, such as CoWoS advanced packaging, it depends on TSMC's technology and production capacity. Unofficial data shows that TSMC's CoWoS production capacity in 2024 will be about 2-30,000 pieces per month, and the production capacity will be distributed to Apple.
04 Is there still a chance for the giant's self-developed GPU?
At an absolute level, Nvidia's GPU technology is not out of reach, and AMD, as an opponent, is chasing fiercely.
The MI300 launched by AMD officially said that the performance of GPT-4 deployed exceeded H100 by 25%. In this regard, Nvidia responded that it should be fair to compare the performance indicators optimized by both parties. In this regard, many companies are planning to develop their own GPU products or already have GPU products optimized for specific tasks. For example, Google has developed its own TPU, which has the ability to optimize specific computing tasks from design to application, and customize suitable hardware. Recently, OpenAI also released news to the public, saying that the cost of Nvidia's GPU is too high and plans to develop it by itself.
From a technical point of view, R&D GPUs are within the capabilities of large companies, large IT companies in the United States have good chip design capabilities, and the seven giants are all chip design giants, and it is not a problem to organize teams to design high-performance GPUs.
More problems come from cost - GPU design and application links are very complex, if all links are done well, the optimization space is huge, you can also get better computing performance, but if which link is not done well, the performance will be greatly reduced, and performance is very critical, not only about a few GPUs, it may take tens of thousands, hundreds of thousands, billions of dollars of cost directly determines the number of enterprises that can be self-developed.
In contrast, NVIDIA has a first-mover advantage, and the entire application ecosystem and R&D process are smooth in many links, and it is constantly improving performance. Even if competitors are catching up all the way in terms of game rendering effects and large model training efficiency, they are only catching up with the previous generation of products. If the performance of GPUs can continue to improve at a high speed, NVIDIA's supremacy will not be shaken, and it is reasonable to expect that it will continue to lead by a large margin in market technical indicators and market share.
Schematic diagram of TSMC CoWoS 2.5D package
In addition, due to the current progress of TSMC's advanced packaging and other manufacturing technologies such as CoWoS, various components have not been physically limited like the nano process of chips, the GPU architecture can continue to be upgraded, and the number of streaming multiprocessors and cores can continue to increase.
Nvidia's advantages are obvious, but it does not mean that the opponent does not have a chance, the core problem lies in production capacity - the rush to buy Nvidia GPU will not arrive until a year later, and companies with sufficient GPU technical capabilities will also have a market, such as AMD is one of the alternative options, and the technology giants in Silicon Valley are also accustomed to cultivating reserve forces to avoid being "stuck" by a single partner in the construction of the supply chain.
In contrast, the situation in the Chinese market is more special, with a large number of potential gaps to be filled due to the impact of export controls. It is understood that some companies are currently using large model training to adapt domestic GPUs, and about half a year of development can achieve 60%-70% performance of A100 chips.
At the same time, large companies already have a large amount of relevant computing power reserves before the relevant export control regulations are issued, and in this case, cultivating a domestic GPU ecosystem has a long way to go.
From the perspective of the entire chip design market, CPUs are also improving, especially through advanced packaging integrated GPUs, such as Apple's M-series chips and Qualcomm's Snapdragon X Elite, which shows that there is still a lot of potential for high-performance chips. It's just that from the perspective of mass demand, there is already a surplus of CPU chip performance in PCs and mobile phones, and there is not much market imagination, but on the contrary, with the large-scale application of AI, the imagination of GPU mass application is open. For example, in the field of Go AI, the high GPU computing power and the high level of Go AI, GPU performance has become the judge of the weak and strong machines of Go AI.
It is conceivable that how smart your AI assistant is is likely to be determined by the performance of the GPU it is equipped with, and the market prospect of GPU is not endless.