Under the U.S. export ban, Nvidia has "castrated" a chip for China
This issue focuses on the technologies used by NVIDIA to launch H20 for the Chinese market and the corresponding performance analysis.
Author / Zhang Shujia Morris Editor / Su Yang
On October 17, the United States updated its export control standards, requiring advanced chips to exceed a certain threshold, that is, to apply for an export license. Under strict restrictions, NVIDIA's special version of the H800 and A800 chips for the Chinese market are also facing a ban, and the following are the standards for advanced chip performance delineated by the U.S. Department of Commerce:
The sum of total computing power≥ 4800 TOPS ,
The total computing power ≥ 1600, and the performance density ≥ 5.92;
2400≤ total computing power < 4800, and 1.6< performance density < 5.92;
The total computing power ≥ 1600, and the performance density of 3.2 ≤ < 5.92.
In the face of the new regulations, Nvidia gave two solutions: first, to communicate with the U.S. Department of Commerce to apply for a license to "open up" to specific Chinese customers, and second, to customize a new special version of the new regulations.
Nvidia CFO Colette Kress confirmed the news during the just-held fiscal third quarter conference call. Kress said Nvidia is working with a number of customers in the Middle East and China to obtain a license from the U.S. government to sell high-performance products. In addition, Nvidia is trying to develop new data center products that comply with government policies and do not require licenses.
01 How is H800 "castrated" to become H20?
Nvidia is trying to develop a new special version, namely the H20, L20 and other products that are rumored in the industry, and the latest news shows that the launch plan of related products has been postponed to the first quarter of 2024.
The problem is that the R&D, design, and production of new special chips such as H20 have completely jumped out of the rhythm of conventional chips.
Its answer is one of the key questions we want to discuss in this article: the production process of the back end point, which can be summed up in the more commonly used term, that is, castration.
HGX H20 - L20 PCIe - L2 PCIe - Product Specifications
According to the normal design, production cycle and product release rhythm, the H20 / L20 and other models of chips specially for the Chinese market are released at this time node, and it is unlikely that they are the products of redoing the mask and re-projecting, and a relatively reasonable inference is that they are new SKUs launched through the transformation + repackaging of the physical point breaking process of the semiconductor back-end.
The dotting process is a retrofit in the post-process (BEOL) of semiconductor manufacturing, which can use some tube/line repair processes without the need to redo the mask, including surface laser spotting, CoWoS layer spotting, and even hand-carving through tunnel mirrors.
The main process of chip manufacturing, source: Soochow Securities
It can be assumed that in the clean rooms of TSMC Nanke Fab18A, Taichung Fab15B and Taichung Advanced Packaging 5 factories that ONVIDIA H800 foundry, the previous batches of bare chips produced by downgrading have not had time to cut and plating metal wires and electrodes, and have not yet been packaged into H800 and L40S, but are then packaged into H20 and L20 through the back-end point breaking production process.
02 Surface laser spotting is a traditional art in semiconductor manufacturing
As a general practice in the industry, the cache size and underlying physical interconnection (PHY channels) of a digital logic chip can be rebuilt/broken in the back-end packaging and testing process to do failure shielding treatment, especially for low-score die transformation methods are considered to be decades of traditional arts, such as the early Pentium, Celeron processors One of the important differences is the dot break cache.
If it is a small part, it can be done by hand (equivalent to miniature carving), and the part with a slightly larger area can be redesigned to reserve the point break position, and then the machine can complete the point break failure.
Design layout of a temperature sensor with built-in digital display
In practice, the usual wafer fab will be equipped with professional equipment, by the laser directly on the die to cut the line/trench, and in the Chandler factory in Arizona, there is also a device directly under the special tunnel mirror hand-engraved transistors, claiming to be atomic-scale, different from the ordinary scanning tunneling microscope, a few years ago Intel had a promotional video, mentioning this equipment, it is rumored that there are no more than 14 licensed operators in the world.
In fact, before planar transistors, hand-engraving was not considered a difficult task in microscopes, but after entering FinFET, the cost of hand-engraving equipment and operators became unattainable due to the vertical 3D gate structure.
Specific to H20/L20, how do these two special products come from the H800 and L40S downgauges?
H20: H100/800 series, hopper architecture (HBM3, 2.5D CoWoS package, NVLink)
L20: Compatible with L40S series, Ada Lovelace architecture (GDDR6, 2D InFO package, PCIe Gen4)
Note: The firmware is modified accordingly;
Looking back at the differences between the H100/H800 in the same architecture, the key underlying physical interconnection (SerDes PHY) between H100 and H800 shows that H100 is castrated to H800, which can be achieved through local physical point break failure processing, but in contrast, although H20 is isomorphic to the previous two products, it is speculated that the Dark Si area cut off may be larger, and it is not sure whether the conventional point breaking operation is not worth it, and the layout may need to be redone.
However, in addition to the difference in the underlying physical layer interconnection (SerDes PHY), there are also differences in the unit area of double-precision floating-point computing (FP64) and tensor kernel (used for matrix and convolutional class computing tasks), which is difficult to determine, but it can be speculated that it is similar to the use of physical redundancy design and shielding, after all, today's design methodologies are all about modularity, and the test after tape-out will have 70 die and 90 points The difference between die, and the fact that there is more than one FP64 on the GPU chip, is also reasonable for local operation physical point failure.
03 Design redundancy creates conditions for point breaking, and it is also the basic operation of large factories
For example: A, the Intel F series CPUs that are still visible on the market today are 70 points die of the dot off graphics core; B, the first two generations of Apple Si, the official announcement of 8 core NPU, there are actually 9, that is, design redundancy.
The above is also considered a basic operation in the wafer manufacturing process, especially during the transition period between the pilot plant and the beta tape-out, if there is a small mistake, it will be directly corrected by hand, and it will not be returned to modify the mask and re-tape.
From the perspective of chip designers, design redundancy originally exists in the chip development process, because the front-end lithography process emphasizes high yield, specific to the number of failed transistors, the yield at the module level is judged in the test link, and the dead pixels can be directly cut off by the circuit, and the subsequent lead and capping process flow remains unchanged.
For example, 3 years ago, Intel launched the F series CPU without graphics cores to the market, which is the product of physical downregulation/castration, which cuts off the graphics core and repackages it for sale. However, the chip occasionally consumes a huge amount of power, and after user complaints, it was found that the display core that originally failed through physical point interruption was uncontrolled after being connected to power.
This case reflects the situation we mentioned above, the same assembly line, after the point break failure of the chip, the subsequent wire/pin and packaging process remains unchanged, can continue to be sold. Especially in the early days, the yield of Intel 10nm was very low, and there was a backlog of many such low-slices, so the chips with failed graphics cores would be printed with the F mark and continue to be sold.
There may be a lot of room for this "redundancy" today, after all, the H100 is already a large chip of 814 square millimeters, almost close to the edge of the mask size (26 mm * 33 mm = 858 mm2). The H20 downgauge model released today is about 15% of the performance of the H100, but its material cost is almost the same.
04 Better operability and economy of dot break at the package level
In addition to the laser spotting process on the surface of the logic chip, there are also spotting requirements for certain special locations, such as the dotting of the CoWoS interposer.
As TSMC's 2.5D packaging solution, CoWoS can make multiple chips packaged together, and interconnects and memory devices are interconnected through silicon interposers, achieving the effect of small package size, low power consumption, and fewer pins.
Compared with surface laser spotting, the front-end part of CoWoS - that is, the CoW part is the through-silicon via and the interposer layer - is operated at this level, which is differentiated, which is more economical and easier to ensure yield.
Because the computing power logic chip and the I/O chip are separated, it can shield the underlying physical interconnection channels, and can also reduce the performance of HBM3 memory, and it is easier to modify the differentiation in the silicon interposer, which is lower than the cost of modifying all the logic chips, because the linewidth accuracy of the operation on the interposer can be low, and even the linewidth of the top layer of metal can be broken.
However, the CoWoS interposer layer can only shield the physical interconnection and HBM memory, but cannot shield the area of the computational logic chip such as FP64 unit and Tensor core unit, which needs to supplement the method of point failure on the surface of the logic die mentioned above.
In addition, under normal circumstances, the circuit of physical point failure cannot be detected from an external third party, and the process is irreversible, especially now that the chip is more than a dozen layers of metal, the surface of the bare die is modified, and the metal layer above cannot be seen, unless anti-engineering perspective scanning is used.
To sum up, we can see that the H20/L20 and other models produced by further special supply/downgauge production can be judged to be the transformation products of the post-physical point breaking process of the H800 and L40S bare chips, and at the same time, the firmware is repackaged and re-modified to become new SKUs.
Considering that NVIDIA's previous backlog of $5 billion of GPU products that were originally sold to China has not yet been delivered, and now it has returned to the factory for back-end transformation to be able to release new SKUs so quickly, then it is speculated that the $5 billion order from domestic manufacturers may be converted to these three models.
05 The ability and inability of H20 after "castration".
The parameters and export controls related to the core AI chip are controlled by APPLIES and unregulated by APPLIES
The following is a horizontal comparison of H20 and H100/H800/A100 products, including product specifications, single card and cluster computing efficiency, material costs, and pricing system.
In terms of cluster comprehensive computing power, H100/H800 is currently the top stream deployment of AIDC computing power clusters, of which the theoretical expansion limit of H100 is 50,000 card clusters, up to 100,000 P computing power, H800 maximum clusters are 2-30,000 cards, with a total of 40,000 P computing power, and A100 is 16,000 cards, with a total of 9,600P computing power.
However, for H20, the theoretical expansion limit of the cluster is 50,000 cards, and the total computing power of a single card is 0.148P (FP16/BF16), which is much lower than that of H100/H800/A100.
8-card server module based on NVIDIA H800
At the same time, based on the estimation of computing power and communication balance, the reasonable median overall computing power of 50,000 H20 is about 3000P.
However, judging from the comprehensive hardware parameters of HGX H20, almost all the indicators outside the computing power threshold strictly restricted in the performance density ban of the US Department of Commerce are full, and it is obviously positioned as a general-purpose processor.
However, for the LLM large model format, the actual use of H20 for kcal distributed training, although most of the effective use time is the time of matrix multiplication and addition calculation on the GPU, and the proportion of communication and memory access time is reduced, but after all, the computing power specifications of a single card are low, and the expansion of ultra-limited kcal clusters will reduce its cost-effectiveness ratio, H20 is more suitable for the training/inference of vertical models, and it is not easy to meet the training needs of hundreds of billions of parameter-level LLMs.
It should be noted that it is a paradox to choose more low-spec, cheaper GPU parallel clusters to try to match or exceed the performance of a GH200 with ultra-high computing power.
Because there are many constraints of this solution, the ROI of setting up and running the environment is not high. Because it is impossible to obtain an ideal solution in terms of computing power utilization, parallel strategy execution, cluster comprehensive energy consumption, hardware cost, networking cost, etc., the performance of H20 cluster and A800 cluster can be compared, but it is not practical to compare the performance of H100/GH200 cluster.
In terms of the basic specifications of H20, the computing power level is about 50% A100 and 15% H100, and the computing power of a single card is 0.148P(FP16)/0.296P(Int8), 900GB/S NVLink, 6 HBM3e (the materials of the video memory are the same as the configuration of the H100 SXM version, that is, 6*16GB=96GB capacity), and the die size is also 814mm2.
Considering that the cost of HBM particles accounts for 55%-60% of the material cost of a single H100 GPU, and the material cost of the whole card is about $3,320 (the cost of H20 is similar, and even higher due to the addition of L2 cache and the addition of a dotting process, and the additional HBM3 capacity and NVLink lanes bandwidth are more than those of H800), then the channel unit price of H20 may be at a similar level to that of H100/H800 according to the final channel pricing rules.
Refer to several market circulation prices (channel prices from a certain line of Internet companies and a certain line of server factories) year-on-year:
- The DGX A800 PCIe 8-card server is about 1.45 million yuan/set, and the NVLink version is 2 million yuan/set
- DGX H800 NVLink version server, the domestic channel price is about 3.1 million yuan/unit (excluding IB)
- DGX H100 NVLink version server, about 450,000 US dollars per unit in Hong Kong (excluding IB)
- The price of H100 PCIe single card is about 25,000-30,000 US dollars, and the H800 PCIe single card is still uncertain, and the circulation channel of single card is not formal