Bergamo, AMD's upcoming 128-core server part, sets new heights in x86 CPU performance.

Bergamo's architecture is cloud-native because Moore's Law is gradually slowing down, and it represents an important inflection point in data center CPU design. At the heart of Bergamo is Zen 4c, a new CPU core variant of its successful 5nm Zen 4 microarchitecture that pushes more cores per socket.

While official details of the Zen 4c have been fairly scarce so far, AMD's CTO said this in their Ryzen 7000 keynote: "Our Zen 4c, which complements our compact density, is a new track to our core roadmap, which offers the same features as the Zen 4 in about half of the core area."

In this in-depth discussion, we'll share an analysis of the Zen 4c architecture, market impact, average selling price, sales volume, hyperscaler order conversion, and how AMD was able to halve core area while maintaining the same core functionality and performance.

We'll examine why AMD is pursuing this new path in CPU design in response to market demand and competition from ARM-based chips from Amazon, Google, Microsoft, Alibaba, Ampere Computing, and Intel x86 Atom E cores.

Finally, we look at Bergamo's reduced production costs and expected sales, as well as AMD's future adoption of dense core variants in its client-embedded and data center product lines. Before diving into these market and architecture details, let's first talk about the background.

The era of cloud CPUs that ended Moore's Law

The rationale behind Zen 4c and Bergamo's design is to provide as many computing resources as possible while battling the physical limitations of silicon as Moore's Law slows. Despite the requirement to continue to increase the number of cores, this slowdown is an industry-wide phenomenon that poses challenges for designers.

As AMD brings their 128-core Bergamo to market, rival Intel is preparing their 144-core "Sierra Forest" section. Both are responding to the rise of data center ARM CPU cores, from hyperscale internal efforts by Amazon, Google, Microsoft, and Alibaba to commercial silicon 192-core AmpereOne cloud-native CPUs.

With the rise of Generative AI, GPUs, accelerators, and ASICs are all the rage, and the share of capital expenditure is increasing, but humble general-purpose CPUs remain the foundational backbone of most data center deployments worldwide. In the cloud computing paradigm, maximizing computing resources while minimizing total cost of ownership (TCO) is the name of the game.

Increasing the number of cores is one of the main ways to save power and cost. Socket consolidation, where a new CPU replaces four or more old CPUs, is all the rage. There are a large number of 22 to 28-core Intel CPUs on 14nm, which are power hungry and need to be replaced. We have not had an infrastructure replacement cycle since the mid-2010s, and the cloud has extended the lifecycle of servers from 3 to 6 years. This will soon change as new cloud-native CPU performance/TCO improvements spur development.

Consolidation eliminates the need for slow and power-hungry inter-outlet and network communication and requires fewer physical resources (fans, power supplies, boards, etc.). Even in the same generation, two 32-core servers fundamentally consume more power than a single 64-core server that delivers the same level of performance. In the cloud, it is simpler to spin up, shut down, and migrate clients in a compute network with fewer and larger compute nodes.

However, more cores means more power consumption. The thermal design power (TDP) of CPU sockets has skyrocketed over the past 7 years, from 140W to 400W. The 2024 platform will crack 500W.

Still, the limitations on power and cooling from increased thermal density mean that TDP does not grow commensurate with the number of cores, resulting in a decrease in the power budget per core. Running at high clock speeds and power maximizes performance per core and per square millimeter of silicon, which is the basic unit of cost.

The current trend is that performance per watt is the most important factor in any given workload, so significant price premiums can be demanded. Looking at the transition from AMD Milan to Genoa, AMD was able to demand an 80% price increase simply due to increased deployment density and performance per watt.

Therefore, CPU architects must carefully balance their core design to optimize per-watt performance. At the same time, as Moore's Law slows down, the cost per transistor is on par with the new process node, so this task becomes more difficult because of the need to control the transistor budget and core size.

Engineers make basic design decisions with imperfect information on performance, power, area, and more. At one end of the performance, power, area (PPA) curve is IBM's Telum, which focuses on maximizing performance per core for legacy mainframe-style applications. To improve its products for its banking, airline and government customers, IBM must design huge cores, clock speeds above 5GHz, and eventual reliability, which is too costly for newer containerized distributed workloads.

On the other hand are CPUs and low-power mobile chips in microcontrollers, which prioritize energy efficiency and minimum area (cost). Intel's failure in the smartphone revolution meant they lacked the decade-long design experience ARM had in energy efficiency optimization.

When Apple expanded its architecture with the M1 Mac and beat Intel, different design points were revealed. Over the years, Intel's high-performance P-cores have become increasingly bloated as they continue to pursue per-core performance and 6GHz clock speeds at the expense of power and area. Running the same core at 3GHz in a server chip is not the best choice for city efficiency.

Next year, Intel's Sierra Forest will solve this problem by bringing their E-core design to the data center. Derived from their Atom family of low-power cores, Intel can package 3-4 times more cores for a given chip size. However, the caveat of E-cores is that they reduce the Instruction Set Architecture (ISA) functional level and lower instructions per clock (IPC), resulting in worse per-core performance and efficiency. The latter is compensated for by a pure increase in the number of cores in many workloads.

Intel began to combine E-cores with P-cores in its client line to improve multi-threaded performance per square millimeter, and ISA mismatches led to issues such as disabling AVX-512 on P-cores and requiring a hardware thread scheduler to manage workload allocation to cores with distinct characteristics. As for the all-E-core Sierra Forest, the focus is on delivering slot performance close to that of the P-core Granite Rapids, while using less silicon. Its successor, Clearwater Forest, will go all out on performance and core count per socket.

Going back to AMD, it has neither smartphone experience nor a separate low-power core pedigree design team. Their Zen core also had to scale from 5.7GHz desktops to efficient laptops and servers. In response to ARM and Atom, they created Zen 4c.

Zen 4c is a joint effort by AMD's design team to introduce cores at different points in the performance, power, area (PPA) curve to better adapt to the latest trends in data center CPU workloads. AMD has taken a rather clever move, using the same Zen 4 architecture and incorporating a variety of tricks in the physical design to save a lot of area.

This means the same level of IPC and ISA functionality, simplifying client integration. In fact, AMD is also quietly replacing some Zen4 cores with Zen 4c cores in its low-end 4nm Ryzen 7000U "Phoenix" mobile processor.

In Bergamo, Zen 4c allowed AMD to increase the number of cores from 96 to 128 while saving area and cost. This divergence in design philosophies will increase in future generations of hardware.

Next, let's go into the specific technical details before finally narrowing it down and covering cost, ASP, hyperscale order conversion, volume, and adoption in non-data center environments.

Here is the specification table of Bergamo and its difference from Genoa

Two models will be available in June: the fully enabled 128-core EPYC 9754 and the scaled-down 112-core EPYC 9734, with 1/8 of the Zen4c core disabled. Compared to Genoa's best 96-core EPYC 9654, Zen 4c enables Bergamo to install 1.33x more cores in the same SP5 slot and 360W TDP.

Zen 4c has the same amount of private cache as Zen 4, with the same L1 and 1MB L2. Maintaining a sufficiently large private cache is important in cloud and virtualized environments. This helps maintain performance consistency by reducing dependencies on shared resources.

Bergamo's clock speed has also decreased, with the base clock reduced by 150MHz and the boost clock reduced by 600MHz. Of course, more cores in the same 360W socket TDP means lower operating frequency. Bergamo still has a 1.25x advantage in raw CPU throughput (core x base clock), and while Genoa can go up even higher, this only helps at lower utilization. Bergamo focuses on cloud environments where predictable performance is key and clock speeds have a low operating range.

Another major difference from Bergamo is the die and L3 cache configuration. The number of CCDs was reduced from 12 in Genoa to 8 in Bergamo, which means that Bergamo has 16 Zen 4c cores per CCD, while Genoa has 8 Zen 4 cores.

Bergamo also saw the return of multiple CCX per CCD, last appearing on the EPYC 7002 "Roma" generation. This splits the die in two, with half of the cores only being able to communicate with the other half via long-distance round-trip IO dies.

The impact on performance is detailed below. While Bergamo still has 8 cores per CCX to communicate locally, their shared L3 cache has been halved to 16MB. This half-size L3 is also present in AMD's mobile designs to save space. While this would hurt IPC in some workloads, it made sense for Bergamo because it focused less on shared resources and more on performance per square millimeter. Those looking for a large L3 option can expect the Genoa-X and its L3 up to 1152MB.

Bergamo uses the same IO Die as Genoa, so SP5 slot IO is the same as DDR5-4800's 12-lane, 128-lane PCIe 5.0 lanes and dual-slot capabilities. However, Bergamo's IO Die is only connected to 8 CCDs, compared to 12 for Genoa, which raises the question: Can AMD make a 12 CCD, 192-core Bergamo?

The IO chip has 12 Global Memory Interconnect 3 (GMI3) chiplet links routed through the package substrate. In Genoa, the GMI3 wires of the CCD away from the IO Die are routed below the L3 cache area of the closer CCD.

This proved more difficult on the Bergamo, as the higher density of the Zen 4c CCD meant that more layers had to be used to route the wires below the smaller L3 of the closer CCD. We can see the visual result of this through CCD chip placement.

On Genoa, groups of 3 CCDs are placed side by side, while on Bergamo, there is a gap between the CCDs to make room for wiring. The package also routes PCIe in the middle, DDR5 up and down, so there is not enough free space for 12 Zen 4c CCDs.

Mold shooting, floor plans, and core analysis

This is a template for Bergamo's Zen 4c CCD, codenamed "Vindhya". This was made using assets from the Zen 4 CCD, codenamed "Durango," and was courtesy of AMD at ISSCC 2023. Note that the two 8-core CCXCompute Complexes are side by side with each other, each with a 16MB shared L3. The L3 also does not have a through-silicon via (TSV) array for 3D V-Cache, saving a small amount of area. This makes sense because cloud workloads don't benefit much from a large shared cache.

What really striking here, however, is the chip size. The 16 Zen 4c cores are slightly larger than the 8 Zen 4 cores. At ISSCC 2023, AMD revealed that the Zen4 has a CCD of 66.3mm2. This is the design area without chip seals and scribings at the edges. The Zen 4c's CCD is designed to be just 72.7mm2, less than 10% larger.

Keep in mind that there are double cores, double the L2 cache, and the same amount of L3 cache on each chip. The core must be greatly scaled down to accommodate more cache on each chip, with only a fraction of the area added.

Regarding the chiplet interconnect, the Infinity Fabric on Package (IFOP) is identical on both chips and includes two GMI3-Narrow links. However, while the chip supports it, it doesn't seem to use the Zen 3c model linked by two GMI4s. Instead, signals from two independent CCXs are multiplexed to the IO Die over a single link.

A closer look at the core reveals clear differences in design and layout. The table below lists the regional breakdown of Zen 4c codenamed "Dionysus" and Zen 4 codenamed "Persephone".

Compared to Zen4, the core area of Zen 4c is down by 35.4%, which is quite remarkable because it both contain 1MB of L2 cache. While this means that the L2 SRAM cell occupies the same area, AMD is able to reduce the area of the L2 region by making the L2 control logic more compact. Excluding L2 and chip pervasive logic (CPL) regions, the core shrinks by a staggering 44.1% and the engine (front-end + execution) region is almost halved.

That's what Papermaster is referring to, and the amazing engineering feat of Zen 4c is essentially the same design as Zen 4, with the same IPC, only with different implementations and layouts. Floating-point units (FPUs) are not shrunk to exactly the same extent, which may be due to thermal hotspots, as FPUs are usually the hottest part of the core when under heavy pressure. We also noticed that the SRAM cells within the core itself also look more compact, with a 32.6% reduction in area. You can clearly see this through the Page Table Walker in the lower right corner.

Physical design skills

AMD created Zen 4c by taking the exact same Zen 4 register transport level (RTL) description, describing the logical design of the Zen 4 core IP and implementing it using a more compact physical design. The design rules are the same as on the TSMC N5, but the area is very different. We detail three key technologies for the physical design of devices that enable this.

First, lowering the clock target of the design results in a reduced area when synthesizing the core. This is the speed vs. area curve of the ARM Cortex-A72 CPU cores synthesized on TSMC's N5 and N3E nodes. Even if you use the same core design on the same node, you can choose the core area and the clock speed that can be achieved on it.

With lower clock targets, designers have more room to work on critical path designs, simplifying timing closure and reducing the number of additional buffer units required to clear loose timing constraints. Most designs today are limited by routing density and congestion, and lower operating clocks enable designers to compress signal paths closer together and increase standard cell density.

Standard cell density refers to the proportion of standard cells in the area that can be placed in the design. Standard units are functional circuits such as flip-flops and inverters that are repeated throughout the design and combined to form complex digital logic. As this close-up view of the placement software shows, they come in many different sizes.

The blue rectangle is the standard cell, while the black area is unfilled. We highlight one area with low cell density and approximately 50% area utilization, and another area with high cell density over 90%. Standard cells with a large number of input and output signal pins occupy nearby routing resources, effectively blocking adjacent spaces where standard cells are placed.

Zooming out to see the entire core produces a cell density map that outlines areas where standard cells are tightly packed (orange, yellow) and areas with low area utilization (green, blue). The black rectangle is a large SRAM macro placed before the standard cell.

All this means that AMD can take their Zen 4 cores and scale them down directly by moving down the speed vs. area curve, and the cores look roughly similar but have a higher cell density. However, because of the next physical design approach, the Zen 4c looks very different.

Zen 4c looks very different because it has a flatter design hierarchy and fewer partitions. For such a complex core design with hundreds of millions of transistors, it makes sense to divide the core into different regions in the layout plan so that designers and simulation tools can work in parallel to accelerate time to market (TTM). Any engineering changes to the circuit can also be isolated to a subregion without having to redo the layout and routing process for the entire core.

Intentionally separating timing critical areas can also help solve routing congestion and achieve higher clock speeds with less interference. We see that ARM's Neoverse V1 and Cortex-X2 cores have no hard partitions between logical regions and are laid out as compact as possible. When looking at the physical die, these areas appear homogeneous. On the other hand, we see that Intel's Crestmont E-core has many visible partitions, with borders highlighted in purple.

As we can see in the Zen 4 kernel comments, each logical block in the kernel has many partitions, but in Zen 4c it is greatly reduced to only 4 partitions (L2, frontend, execution, FPU). By merging these partitions in Zen 4, the areas can be more tightly encapsulated together, adding another way to save area by further increasing the density of standard cells. It's fair to say that AMD's Zen 4c "looks like an ARM core."

The last way to reduce the area is to use a denser memory. Zen 4c reduces the SRAM area of the core itself because AMD has switched to a new SRAM bit cell. The figure shows an 8T SRAM circuit diagram with 8 transistors, with 4 transistors in the middle to store 1 bit of information, and 2 pairs of access transistors to power 2 pairs of word and bit lines.

The high-performance Out-of-Order cores have multiple capabilities to read and write from the same block of memory, so these 8T dual-port bitcells are used. They occupy more area and require double the signaling resources than denser 6T single-port bit units.

To save space, AMD replaced these 8T dual-port bit cells with new 6T pseudo-dual-port bit cells developed by TSMC.

In the paper related to the 4.24GHz 128X256 SRAM operation dual-pump with 5nm technology reading and writing the same cycle, TSMC proposed a high-speed 1R1W dual-port 32Kbit (128X256) SRAM with a single-port 6T bitcell macro.

A read-after-write (RTW) dual-pump CLK generation circuit with TRKBL bypass is proposed to improve read performance. A bimetallic scheme is used to improve signal integrity and overall operating cycle time. A local interlock circuit (LIC) has been introduced in the readout amplifier to reduce the active power and push Fmax further. The results show that in 5nmFinFET technology, slow-angle wafers are capable of reaching 4.24GHz at 1.0V and 100 degrees Celsius.

From the description, we can see that TSMC can simulate dual-port bit cells by performing sequential read and write operations in the same clock cycle. While this is not as flexible as two separate access ports, the area reduction is enough for AMD to adopt the technology for the Zen 4c. As SRAM area scaling flattens, we will see more development of these area-saving techniques.

*Disclaimer: This article is the original author's creation. The content of the article is his personal opinion, we reprint only to share and discuss, does not mean that we agree or agree, if you have any objections, please contact the background.

AMD's Strikes Back: Zen 4c

The era of cloud CPUs that ended Moore's Law

Mold shooting, floor plans, and core analysis

Physical design skills