laitimes

Three methods for 3D chips

Source: Content compiled by Semiconductor Industry Watch (ID: icbank) from the IEEE, thank you.

A recently unveiled batch of high-performance processors suggests that a new direction for the continuation of Moore's Law is coming. According to the agreed express regulations, each generation of processors needs to have better performance than the previous generation. This means more logic is integrated into the silicon. But this will face two problems: One is that our ability to shrink transistors and the logic and memory blocks they make up is slowing down. The other is that the chips have reached their size limits. The lithography tool can only pattern areas of about 850 square millimeters, which is about the size of the top Nvidia GPU.

Over the years, system-on-chip developers have begun to break down their growing designs into smaller chiplets and link them together within the same package to effectively increase silicon area and other advantages. In CPUs, these links are mostly so-called 2.5D, where small chips are set side by side with each other and use short, dense interconnect connections. With most major manufacturers already agreeing on a 2.5D chip-to-chip communication standard, this integration momentum is likely to only grow.

But to transfer really large amounts of data like on the same chip, you need shorter, denser connections, and this can only be achieved by stacking one chip on top of another. Connecting two chips face to face could mean thousands of connections per square millimeter.

It takes a lot of innovation to make it work. Engineers have to figure out how to prevent the heat of one chip in the stack from killing another, decide which features should go where and how they should be made, prevent the occasional bad chip from leading to a large number of expensive dumb systems, and deal with the complexity that comes with solving all of these problems at once.

Here are three examples, from fairly simple to confusingly complex, showing where 3D stacking is now:

AMD's Zen 3

Three methods for 3D chips

AMD's 3D V-Cache technology connects a 64-megabyte SRAM cache [red] and two blank fabric chiplets to the Zen 3 compute chip.

PCs have long offered the option to add more memory, providing faster speeds for very large applications and data-heavy work. Due to the 3D chip stacking, AMD's next-generation CPU chiplets also offer this option. Of course, this isn't an aftermarket add-on, but if you're looking for a computer with more glamorous, ordering a processor with massive cache memory could be your choice.

Although both Zen 2 and the new Zen 3 processor cores are manufactured using the same TSMC manufacturing process — and therefore transistors, interconnects, and everything else of the same size — AMD has made so many architectural changes that they can deliver an average 19 percent performance boost even without additional caching. One architectural gem is the inclusion of a set of through-silicon holes (TSVs) that are vertically interconnected directly through most of the silicon. TSV is built in Zen 3's highest-level cache, the SRAM block called L3, which sits in the middle of the compute chip and is shared among all eight of its cores.

In processors for data-heavy workloads, the back side of the Zen 3 wafer is thinned until the TSV is exposed. A 64-megabyte SRAM chip is then bonded to those exposed TSVs using so-called hybrid bonding — a process similar to cold-soldering copper together. The result is a dense set of connections that can be as tight as 9 microns. Finally, for structural stability and heat transfer, a blank silicon chip is attached to cover the rest of the Zen 3 CPU chip.

Adding extra memory by setting extra memory next to the CPU chip is not an option because the data takes a long time to reach the processor core. "Although the L3 [cache] size has tripled, 3D V-Cache has only added four [clock] cycles of latency—this can only be achieved through 3D stacking," said John Wuu, senior design engineer at AMD.

Larger caches have a place in high-end games. Using desktop Ryzen™ CPUs and 3D V-Cache can increase 1080p game speeds by an average of 15%. It's also suitable for more serious work, reducing the uptime of difficult semiconductor design calculations by 66 percent.

Wuu notes that the industry's ability to shrink SRAM is slowing down compared to the ability to shrink logic. As a result, you can expect future SRAM expansion packs to continue to be manufactured using more mature manufacturing processes, while computing chips are pushed to the forefront of Moore's Law.

Graphcore's Bow AI processor

The Graphcore Bow AI accelerator uses 3D chip stacking to boost performance by 40%.

Even if there is no single transistor on one chip in the stack, 3D integration can speed up calculations. Graphcore, a UK-based AI computer company, has dramatically improved its system performance simply by installing power-supply chips on its AI processors. Adding power-supply silicon means that the combination chip called Bow can run faster (1.85 GHz compared to 1.35 GHz) and have lower voltages than its predecessor. This means that computers train neural networks 40 percent faster and consume 16 percent less energy than the previous generation. Importantly, users don't need to change their software to get this improvement.

The power management chip consists of a combination of a capacitor and a silicon through hole. The latter simply provides power and data to the processor chip. What really sets the difference is the capacitor. Like bit memory components in DRAM, these capacitors form deep and narrow grooves in silicon. Because these charge reservoirs are very close to the processor's transistors, the power transfer becomes smooth, allowing the processor core to run faster at lower voltages. Without a power supply chip, the processor must increase its operating voltage above its nominal level to operate at 1.85 GHz, consuming more power. Using a power chip, it can also reach this clock frequency and consume less power.

The manufacturing process used to make BoW is unique, but it is unlikely to remain that way. Most 3D stacking is done by gluing one chip to another, while one of them is still on the wafer, called the chip on the wafer [see "AMD's Zen 3" above]. Instead, Bow used TSMC's wafer-to-wafer, in which the entire wafer of one type bonded with the entire wafer of the other type and then cut into a chip. Simon Knowles, graphcore's chief technology officer, said it was the first chip on the market to use the technology, and it made the connection density between the two dies higher than the density that could be achieved using the on-wafer chip process.

Although the power supply chiplets do not have transistors, they may appear. Knowles said using the technology only for power "was just the first step for us." "In the near future, it will go even further."

Intel's Ponte Vecchio supercomputer chip

Three methods for 3D chips

Intel's Ponte Vecchio processor integrates 47 small chips into one processor.

The Aurora supercomputer is designed to be one of the first high-performance computers (HPCs) in the United States to break through the exaflop barrier — performing 1 billion high-precision floating-point calculations per second. To get Aurora to these heights, Intel's Ponte Vecchio packaged more than 100 billion transistors on 47 silicon chips into a single processor. Intel uses both 2.5D and 3D technologies to compress 3,100 square millimeters of silicon (almost equal to four Nvidia A100 GPUs) into a 2,330 square millimeter footprint.

Intel researcher Wilfred Gomes told engineers attending the IEEE International Solid-State Circuits Conference that the processor pushes Intel's 2D and 3D chip-tip integration technology to the limit.

Each Ponte Vecchio is a set of two mirrored chipsets bundled together using Intel 2.5D integrated technology Co-EMIB. Co-EMIB forms a bridge of high-density interconnects between two 3D chip stacks. The bridge itself is a small piece of silicon embedded in an encapsulated organic substrate. The density of interconnects on silicon can be twice as dense as on organic substrates.

The Co-EMIB die also connects high-bandwidth memory and I/O chiplets to the "base block," which is the largest chip in the rest of the stack.

The base tile uses Intel's 3D stacking technology, called Foveros, on which to stack compute and cache chiplets. The technology establishes a dense chip-to-chip vertical connection array between the two chips. These connections can be 36 microns, except for short copper columns and solder micro-bumps. Signals and power enter the stack through silicon through holes, and a fairly wide vertical interconnect passes directly through most of the silicon.

Eight compute tiles, four cache tiles, and eight blank "hot" tiles for cooling from the processor are all connected to the base tile. The foundation itself provides cache memory and a network that allows any compute block to access any memory.

Needless to say, none of this is easy. Gomes says it innovates in yield management, clock circuitry, thermal regulation and power transfer. For example, Intel engineers chose to supply the processor with a voltage above normal (1.8 volts) so that the current is low enough to simplify the package. The circuitry in the base block reduces the voltage to nearly 0.7 V for the compute blocks, and each compute block must have its own power domain in the base block. The key to this capability is a new type of high-efficiency inductor called a coaxial magnetic integrated inductor. Because these are built into the footprint board, the circuit actually winds back and forth between the base block and the package before supplying voltage to the compute block.

Gomes says it took 14 years, from the first petaflop supercomputer in 2008 to this year's exaflops machine. Gomes told engineers, but advanced packaging, such as 3D stacking, is one of those technologies that can help shorten the next thousand-fold computing improvement to just six years.

Read on