laitimes

Nvidia, overtaken by a corner?

author:Pretend to be hi-skinned

Build chips like building blocks.

According to incomplete statistics, the semiconductor industry has developed about 1,000 packaging types, divided by interconnect type, including wire bonding, flip chip, wafer-level packaging (WLP) and through-silicon vias (TSV), etc., countless dies are connected through interconnects, constituting today's increasingly prosperous packaging market.

Among them, advanced packaging has become the most concerned and popular field in the past two years, and the slower the progress of advanced processes, the more prominent its importance is, AMD, Intel and Nvidia, the traditional "royal three", have set foot in it, from 2D packaging to 2.5D packaging, and also launched a challenge to the peak of 3D packaging.

In June 2023, AMD officially launched MI300X and MI300A AI accelerators in San Francisco, of which MI300X adopts 8 XCD, 4 IO dies, 8 HBM3 stacks, up to 256MB AMD Infinity Cache and 3.5D package design, supports new mathematical formats such as FP8 and sparsity, and is a design for AI and HPC workloads, and its transistors have reached 153 billion The largest chip AMD has ever made.

Nvidia, overtaken by a corner?

AMD says the MI300X outperforms the NVIDIA H100 in AI inference workloads by 1.6x and performs on par with the H100 in training jobs, providing the industry with a much-needed high-performance replacement to replace Nvidia's GPUs. In addition, these accelerators also have more than twice the capacity of Nvidia's GPUs with HBM3 memory, reaching a staggering 192 GB, allowing their MI300X platform to support more than twice as many LLMs per system and run larger models than the H100 HGX.

The most eye-catching is of course AMD's claim for 3.5D packaging, which AMD says has achieved a new "3.5D package" technology by introducing 3D hybrid bonding and 2.5D silicon interposers.

"This is a truly amazing silicon stack that delivers the highest density performance the industry has ever known," said Sam Naffziger, senior vice president and enterprise researcher at AMD. This integration uses two of TSMC's technologies, SoIC (Integrated System-on-Chip) and CoWoS (Wafer Substrate Chip). The former (SoIC) uses hybrid bonding technology to stack smaller chips on top of larger chips, directly attaching copper pads on each chip without the need for solder, helping to stack cache V-Cache chips on top-of-the-line CPU chips, while the latter (CoWos) stacks chips on a larger piece of silicon called an interposer to accommodate high-density interconnects. ”

When Nvidia also used TSMC CoWoS's 2.5D package in H200, AMD took the lead in realizing the combination of TSMC SoIC 3D packaging and CoWoS 2.5D packaging.

Build chips like building blocks

First of all, let's review the specific architecture of MI300X and MI300A, according to AMD's official explanation, the MI300 series uses TSMC's 3D hybrid bonding SoIC (silicon on integrated circuit) technology, which 3D stacks various computing elements on top of the four underlying I/O chips, whether it is CPU CCD (core computing chip) or GPU XCD. Each I/O chip can hold two XCDs or three CCDs. Each CCD is the same as the one used in existing EPYC chips, and each CCD has eight hyper-threaded Zen 4 cores. The MI300A uses three of these CCDs and six XCDs, while the MI300X uses eight XCDs.

The so-called XCD is AMD's chiplet that is responsible for computing in the GPU, and on the MI 300X, 8 XCDs contain 304 CDNA 3 compute units, which means that each compute unit contains 34 CUs. For comparison, the AMD MI 250X has 220 CUs, which is a big leap.

The HBM stack, on the other hand, is connected using a standard interposer using 2.5D packaging technology, and each I/O chip contains a 32-channel HBM3 memory controller to host two of the eight HBM stacks, providing the device with a total of 128 16-bit memory channels. The MI300X uses a 12Hi HBM3 stack with a capacity of 192GB, whereas the MI300A uses an 8Hi stack with a capacity of 128GB.

Specifically, AMD's CPU CCDs are 3D hybrid bonded to the underlying I/O chip, communicating by leveraging the GMI3 interface in a standard 2.5D package, for which AMD has added a new pad through-hole interface that bypasses the GMI3 link to provide the TSV required for vertically stacked chips.

The 5nm XCD GPU chip marks the full siliconization of AMD GPU design, with XCD and IOD having hardware-assisted mechanisms that break jobs down into smaller parts, dispatch them, and keep them in sync to reduce host system overhead, and the units also feature hardware-assisted cache coherence.

AMD has been preparing for this small step in the packaging of the MI300 series for many years, with the earliest origins dating back to 1965, when AMD engineers developed a design that split each large chip into smaller pieces based on the "chipset" concept.

Nvidia, overtaken by a corner?

In the CPU competition with Intel, the failure of the bulldozer architecture put AMD in a precarious position, and it urgently needed a low-cost solution to compete with Intel's more advanced architecture, and Zen came into being, a new generation of Ryzen processors with chipset or MCM (multi-chip module) architecture, marking a complete transformation of the entire PC and chip manufacturing industry.

The original Zen architecture was relatively simple, with an SoC design where everything from the core to the I/O and controller was on the same chip, while introducing the CCX concept, in which the CPU cores were divided into quad-core units and combined using infinite cache, consisting of two quad-core CCXs in one chip, although the consumer grade was still a single-chip design.

While the situation with Zen+ remains largely unchanged (with more advanced nodes), Zen 2 is a major upgrade, the first chiplet-based consumer CPU design with two compute chips or CCDs plus an I/O chip. AMD has added a second CCD to the Ryzen 9 with a core count never seen before in the consumer space.

Zen 3 further refined the chiplet design, eliminating CCX and merging eight cores and 32MB of cache into a single unified CCD, which greatly reduced cache latency and simplified the memory subsystem, and AMD Ryzen™ processors delivered better gaming performance than rival Intel for the first time. Zen 4 doesn't make significant changes to the CCD design, other than a downsized CCD design.

In the EPYC series, the first generation AMD EPYC processors are based on four replicated chiplets. Each processor has 8 "Zen" CPU cores, 2 DDR4 memory lanes, and 32 PCIe lanes to meet performance targets, and AMD had to provide some extra headroom for the Infinity Fabric interconnect between the four chiplets.

The first chiplet of the second-generation EPYC, called the I/O die (IOD), is a 12nm process that contains 8 DDR4 memory lanes, 128 PCIe gen4 I/O lanes, and other I/Os such as USB and SATA, SoC data structures, and other system-level features. The second chiplet is a composite core die (CCD), which uses a 7nm process. In practice, AMD packs an IOD with up to 8 CCDs, each of which provides 8 Zen 2 CPU cores, allowing it to deliver 64 cores at a time.

On the third-generation EPYC, AMD offers up to 64 cores and 128 threads, powered by AMD's latest Zen 3 core. The processor is designed with eight chiplets, each with eight cores, and this time all eight cores in the chiplet are connected, enabling an effective dual L3 cache design for a low overall cache latency structure.

In the fourth-generation EPYC, AMD uses a chiplet design of up to 12 5nm complex core chips (CCDs) on the original architecture, where the I/O chips use 6nm process technology and the surrounding CCDs use 5nm process. Each chip has 32MB of L3 cache and 1 MB of L2 cache.

These CPUs eventually paved the way for the technical side of the MI300 series of chiplets.

In January 2021, AMD applied for and approved a patent for the design of MCM GPU Chiplets, and AMD published a patent titled "GPU Chiplets Using High-Bandwidth Cross-Linking" with patent number "US 2020/0409859 A1", in the patent description, AMD outlined what the future of graphics chips in the chiplet design will look like, and the GPU chiplet should be directly integrated with the CPU communication, while other small chiplets communicate with each other via passive, high-bandwidth cross-links and are arranged as system-on-chips (SoCs) on the appropriate interposers.

In November 2023, AMD disclosed another patent for a chiplet design, which describes a GPU design that is very different from the existing chip layout, that is, a large number of memory cache chips (MCDs) are distributed around the large main GPU chip, which describes a system that distributes the geometric workload across multiple chips, with all chips working in parallel. In addition, none of the "central chips" will assign work to subordinate chips, as they will all operate independently. The patent suggests that AMD is exploring chipsets to make GCDs, not just a giant piece of silicon.

From the consumer field to the supercomputing field to the AI field, AMD has set off a red storm with chiplets, and it is TSMC's advanced packaging technology that continues to help this storm.

The people behind AMD

In an interview with IEEE Spectrum, Sam Naffziger, Product Technology Architect at AMD, said, "Five or six years ago, we started working on the EPYC and Ryzen CPU families. At that time, we conducted extensive research to find the most suitable packaging technology for connecting chips. It's a complex equation involving cost, performance, bandwidth density, power consumption, and manufacturing capacity. It's relatively easy to come up with good packaging technology, but it's a different thing to actually produce it in high volume and at low cost. ”

In 2011, TSMC first developed the 2.5D package CoWoS, which was immediately adopted by Xilinx's high-end FPGA, but because it was too expensive, it was slow to open the situation in the packaging market until the AI wave swept the world, and Nvidia, AMD, Google, and Intel have thrown olive branches and pushed CoWoS to the throne of the hottest advanced packaging.

Below is a schematic diagram of TSMC's CoWoS (Chip on Wafer Substrate) package. CoWoS allows the integration of multiple chips or dies on a single package. This enables different types of chips, such as processors, memory, and graphics chips, to be integrated into a single package, resulting in higher performance, lower power consumption, and smaller form factors. Multiple chips are stacked vertically via through-silicon vias (TSVs) and interconnected with microbumps. This stacking approach can shorten the interconnect length, reduce power consumption, and improve signal integrity compared to traditional 2D packaging.

Nvidia, overtaken by a corner?

CoWoS is a big part of AMD's chiplets, and by dividing large monolithic chips into smaller chipsets, designers can focus on optimizing specific features for each chipset. , enabling better power management, higher clock speeds, and higher performance per watt, while also helping to integrate these high-performance chips into a single package with other components such as memory, further improving system performance.

In 2018, TSMC launched SoIC technology, which is an innovative multi-die stacking technology, mainly for wafer-level bonding of process technologies below 10nm, compared with CoWoS technology, SoICs can provide higher packaging density, smaller bonding spacing, and can also be shared with CoWoS/InFo to achieve multiple chiplet integration.

At the IEDM meeting, TSMC's VP presented more details on the company's SoIC roadmap. According to the roadmap, TSMC will first adopt the currently available 9μm bonding pitch. Then, it plans to roll out 6μm pitches, followed by 4.5μm and 3μm. In other words, TSMC wants to introduce a new keybase every two years or so, with each generation scaling up to 70%.

He also used AMD's processors as an example of SoIC applications, where AMD designed processors and SRAM based on a 7nm process, which were then produced by TSMC and finally connected to the chips using SoIC technology with a 9μm bonding pitch.

The 3D V-Cache cache added to AMD's 2021 EPYC processor, codenamed Milan-X, is the world's first data center processor with 3D chip stacking.

AMD says that 3D V-Cache adds an additional 64 MB to the current 3rd generation EPYC CPU's 32 MB of SRAM per compute chip, bringing Milan-X's L3 cache to 96 MB per compute chip, and with up to 768 MB of L3 cache shared in the CPU due to up to 8 compute chips in the Milan-X processor architecture, additional L3 cache can alleviate memory bandwidth pressure and reduce latency, significantly improving application performance.

TSMC's SoIC technology makes this possible, permanently binding the interconnect in V-Cache to the CPU, reducing the distance between the chips to achieve 2 TB/s of communication bandwidth, which consumes only one-third of the energy per bit of the interconnect in the Milan-X CPU compared to the 2D chiplet package used in the third-generation EPYC CPU, resulting in a 200x increase in interconnect density and a three-fold increase in power efficiency.

This technology has since been devolved into the Ryzen 7 5800X3D processor, and has begun to show its skills in the consumer market, including the latest Ryzen 9 7950X3D, which also uses 3D V-Cache technology.

In 2023, TSMC focused on the new 3DFabric technology at the North American Technology Forum, which mainly consists of three parts: advanced packaging, 3D chip stacking and design. Advanced packaging allows more processors and memory to be packed into a single package to improve computing performance, and in terms of design support, TSMC has launched the latest version of the open standard design language to help chip designers handle complex and large chips.

From 2011 to 2023, more than ten years of TSMC's packaging technology evolution has finally made AMD's chiplet dream come true, and the MI300 series is also built on the basis of the latest 3DFabric, integrating TSMC's SoIC front-end technology with CoWoS back-end technology, which can be called the culmination of advanced packaging technology for mass production.

Encapsulated layout of the Blue Giant

For Intel, packaging is also one of the focuses of its development, and unlike AMD, Intel has chosen to engage in packaging by itself, trying to master the whole process of chip development, production and application.

Intel's 2.5D packaging technology for TSMC's CoWoS is called EMIB, which was officially applied to products in 2017, and Intel's data center processor Sapphire Rapid is the technology, and the first-generation 3D IC package is called Foveros, which has been used in Intel's computer processor Lakefield in 2019.

The most distinctive feature of EMIB is that various chips (dies) such as memory (HBM) and computing (die) can be connected from below through a silicon bridge (Sillicon Bridge). Because the silicon bridge is buried in the substrate and connected to the chip, the memory and the computing chip can be directly connected, accelerating the energy efficiency of the chip itself.

Foveros is a 3D stack, which stacks chips with different functions such as memory, computing and architecture, and uses copper wires to penetrate each layer to achieve the effect of connection, and finally, the factory will send the stacked chips to the packaging factory for assembly, and the copper wires are connected to the circuit on the circuit board.

In 2022, Intel fused the lower 2.5D and 3D packaging technology for the first time, named Co-EMIB, which is an innovative application that combines EMIB and Foveros technology, which can interconnect two or more Foveros components and basically reach the performance level of a single chip.

Each Ponte Vecchio processor is actually an image set of two chiplets connected together using Intel's Co-EMIB, which forms a high-density interconnect between the two 3D chiplet stacks, which itself is a small piece of silicon embedded in the package's organic substrate. Interconnects on silicon can be narrower than those on organic substrates. Ponte Vecchio has a normal connection spacing of 100 microns to the package substrate, while the connection density in the Co-EMIB chip is almost twice as high, and the Co-EMIB chip also connects the high-bandwidth memory (HBM) and Xe Link I/O chiplet to the "base silicon" (the largest chiplet), on which the other chips are stacked.

Nvidia, overtaken by a corner?

The base chip also uses Intel's 3D stacking technology, called Foveros, which creates a dense array of chip-to-chip vertical connections between the two chips. These connections are only 36 microns apart and are made by connecting the chips "face to face"; that is, the top of one chip is glued to the top of the other. Signals and power enter the stack through TSV through-silicon vias, which are fairly wide vertical interconnects that pass directly through most of the silicon. The Foveros technology used on the Ponte Vecchio is an improvement on the technology used to manufacture Intel's Lakefield mobile processors, doubling the signal connection density.

It's not easy to do this, and Intel Fellow Wilfred Gomes says it requires innovation in yield management, clock circuitry, thermal regulation, and power delivery. For example, Intel engineers chose to provide a higher-than-normal voltage (1.8 volts) to the processor in order to reduce the current and simplify the package, and the circuitry in the substrate reduced the voltage to close to 0.7 volts for use on the computing chips, and each computing chip had to have its own power domain in the substrate.

For Intel, Ponte Vecchio pushes its current advanced packaging technology to the pinnacle, and compared with AMD's MI300 series, it is not much inferior, and can be described as the red and blue twin stars of today's advanced packaging.

In fact, although Intel is slightly behind TSMC in advanced manufacturing processes, it is on par with TSMC in advanced packaging. Intel says its flexible foundry services allow customers to mix and match its wafer fabrication and packaging products, and as an established vendor, its wafer packaging fabs are scattered around the world and can take advantage of its geography to expand capacity and services.

Intel CEO Pat Gelsinge also said in an interview that Intel has the advanced capabilities of the next-generation memory architecture, as well as the advantages of 3D stacking, which can be used not only for chiplets, but also for artificial intelligence and ultra-large packages for high-performance servers.

Why Chiplets?

After reading the technology history of AMD, Intel and TSMC, I believe many people will have a question, why are they so obsessed with 3D packaging and chiplets?

The reason for this is the demand within the semiconductor industry, where Moore's Law allows increasing device integration to continue to fit the same physical size, and lithography shrinking can shrink the building blocks by 30%, which can add 42% more circuits without increasing the chip size.

But not all semiconductor devices can enjoy this dividend, for example, I/O that can include analog circuitry, which scales at about half the speed of logic, forcing people to find new ways out. And the cost of lithography shrinkage is not cheap, with wafers processed on the 7nm process costing higher than wafers processed on the 14nm process, 5nm processes costing more than the 7nm process, and so on...... As wafer prices rise, chiplets tend to be more affordable than monolithic chips.

In addition, the creation of reusable designs is further motivated by the fact that new chip designs require design and engineering resources, and the typical cost of a new design for each new process node increases due to the increasing complexity of new nodes.

The chiplet design philosophy makes this possible because new product configurations can be achieved by simply changing the number and combination of chips, and by integrating a single chiplet into a 1, 2, 3, and 4 chip configuration, 4 different processor varieties can be created from a single tapeout, whereas if they were to be combined into a single chip, 4 separate tapeouts would have been required.

Nvidia, overtaken by a corner?

In its technical presentation of the new Radeon RX 7900 series "Navi 31" graphics processors, AMD explained in detail why the chipset route must be adopted for high-end graphics processors.

In fact, AMD's Radeon GPUs in the past decade have not been optimistic compared to CPUs, both in terms of profits and revenues, and in the face of Nvidia's competition, the need to reduce manufacturing costs has become more prominent, and with the introduction of the GeForce "Ada Lovelace" generation, Nvidia continues to bet on monolithic silicon GPUs, even the largest "AD102" chips are still monolithic GPUs, which provides an opportunity for AMD to reduce the cost of GPU manufacturing.

Chiplets allow AMD to wage a price war with Nvidia and gain more market share. The most typical example is AMD's relatively aggressive pricing of the Radeon RX 7900 XTX and RX 7900 XT, respectively, at $999 and $899, respectively, and according to AMD's official website, these two products have the ability to compete with Nvidia's $1,199 RTX 4080, and in some cases, even the $1,599 RTX 4090.

In fact, this is one of the most significant benefits of chiplets, by using which AMD can quickly improve yields and simplify design/verification while selecting the best process for each chiplet. The logic part can be fabricated using cutting-edge processes, and high-capacity SRAM can be manufactured using a process of around 7nm, while I/O and peripheral circuits can be manufactured using a process of around 12nm or 28nm, reducing design and manufacturing costs.

In addition, chiplets make it easy to manufacture derivative types, such as the same logic but different peripheral circuits, or the same peripheral circuits but different logic, and can mix chiplets from different manufacturers rather than being limited to a single manufacturer.

AMD is like this, Intel is nothing more than that, AMD relies on TSMC's existing technology to fully study chip architecture design, Intel has to work a little harder, on the one hand, to study advanced processes and packaging, on the other hand, it is also necessary to start iterative improvements of chips and Chiplets, and the two even started a competition on packaging.

Nowadays, it is no longer important to judge the winner or loser of the competition, because 3D packaging and chiplet are gradually moving from data centers and AI accelerators to PC processors in the consumer market, and finally benefiting notebooks and mobile phones, which has become a new trend that everyone recognizes.

Write at the end

Compared with AMD and Intel, Nvidia is so "sluggish" in 3D packaging and chiplets.

2017年6月英伟达发表论文《MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability》提出了MCM设计,其基本可以看成是如今的Chiplet。

However, Nvidia has not put this design into practice, but published a paper called "GPU Domain Specialization via Composable On-Package Architecture" in December 2021, in which the proposed COPA-GPU architecture is actually only a separate separation of the L2 cache, which means that NVIDIA will continue to adhere to the Monolithic single lithography design in the future.

The reason why Nvidia insists on large chips is actually very simple, the communication bandwidth between die and die can never be compared with the communication bandwidth inside monolithic, Chiplet may not be suitable for high AI computing power occasions, and is more suitable for showing its strength in the CPU field, Grace CPU released by NVIDIA in 2022 Superchip, for example, realizes high-speed interconnection of chips through NVLink-C2C technology, and the chip also follows the Chiplet interconnection specification UCIe jointly formulated by the industry.

Although Nvidia is currently one of TSMC's largest customers of 2.5D packaging CoWoS, it is not included in SoIC's customers for the time being, which also makes it the latest one to embrace this advanced technology.

With the rapid development of chiplets, Nvidia may also begin to embrace this design concept in the future, this year's whistleblower Kopite7kimi said that Nvidia's next-generation Blackwell GB100 GPU for high-performance computing (HPC) and artificial intelligence (AI) customers will be fully chiplet designed.

Now AMD is one step ahead in AI chips, using Chiplet and 3.5D packaging to create a bigger and stronger MI300X, Intel has also fully embraced Chiplet and 3D packaging, although Nvidia is still sitting on a huge AI market, but there is an imperceptible crack in its throne, red, blue and green, who can have the real right to speak in chip packaging?

Reference Links

AMD Explains the Economics Behind Chiplets for GPUs——techpowerup

New AMD Patent Describes Potential Chiplet-Based GPU Design——extremetech

AMD unveils Instinct MI300X GPU and MI300A APU, claims up to 1.6X lead over Nvidia’s competing GPUs——tomshardware

This article is from the WeChat public account "Semiconductor Industry Observation" (ID: icbank), author: Shao Yiqi