laitimes

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

author:CICC Research
At the NVIDIA GTC 2024 held last week, we also clearly observed the trend of network technology capabilities being upgraded, with continuous innovation and bandwidth acceleration between bare chips, chips, and cabinets. With the support of the underlying technology of the PHY layer of communication protocols such as NVLink, the single-channel transmission rate of NVLink has become an inevitable trend to evolve to 224G or even higher. In this article, we will start from 224G PHY &SerDes and look forward to the development trend of wired communication within AI data centers.

summary

D2D: The demand for die-to-die communication is increasing, and advanced packaging processes are constantly advancing. D2D communication takes place inside the chip package, and the interface physical layer can be high-speed SerDes or high-density parallel architecture. We believe that with the gradual increase in the heterogeneous penetration rate of many cores such as chiplets, the demand for inter-die communication is expected to further increase, and higher requirements are put forward for advanced packaging technology and interconnection standards: 1) The advancement of advanced packaging, such as 2.5D/3D packaging, can bring higher I/O density to D2D connections, and TSMC CoWoS is widely used in A/ 2) The emergence of the UCIe protocol is expected to promote the standardization of interconnect ports between dies to realize the free integration of different chips, and the D2D interconnection ecosystem is expected to become more open.

C2C: The PCIe bus continues to be upgraded, and NVLink leads a new revolution in inter-chip communication. The motherboard bus is an important medium for C2C communication, in which PCIe is mainly used to connect CPU and high-speed peripherals, and the average generation of standards is upgraded every three years, and it has been iterated to version 6.0, which can achieve a transmission rate of 256GB/s under 16 channels. In AI scenarios, heterogeneous parallel computing architectures have become the mainstream, and we have observed that the C2C interconnection between GPUs and heterogeneous xPUs has gradually evolved from PCIe to more powerful dedicated interconnection technology, NVLink realizes high-speed and low-latency direct interconnection between GPUs, and introduces NVSwitch to solve the communication imbalance, GTC 2024 releases a new generation of NVLink and NVSwitch, and the two-way bandwidth of C2C interconnection is increased to 1.8TB/s.

B2B: High-speed interconnection between machines improves AI training performance, and protocols and hardware go hand in hand. We believe that the improvement of inter-machine communication efficiency requires supporting protocols and hardware: 1) Protocol: Evolve from traditional TCP/IP to RDMA to optimize network performance. 2) Hardware: The interface transmission rate needs to be iteratively uplinked, and the performance of the switch chip as the core hardware is facing upgrades, and we expect that the evolution of SerDes to 224G is expected to provide the underlying support for the launch of 102.4T switching chips and help achieve 1.6T network connectivity.

risk

The development of AI large models and applications is less than expected, and the iteration of SerDes technology is less than expected.

body

A Preliminary Study of PHY: Physical Layer Functions and SerDes Technology Evolution

What is PHY?

The PHY (Physical Layer) is located at the lowest level of the OSI reference model, connecting the data link layer equipment (commonly known as MAC, that is, media access control) and the physical medium (such as optical fiber and copper cable), and assumes the function of converting the digital signal on the device layer and the analog signal on the medium.

According to the physical layer protocol specification given by the IEEE 802.3 standard, the internal structure of the PHY can generally be disassembled into 3 sub-layers (PCS, PMA, PMD) and 2 interfaces (MII, MDI).

► PCS (Physical Coding Sublayer) :P CS is located at the topmost level of the physical layer architecture. Upward, the PCS connects the RS (coordination sublayer) via MII/GMII to realize the interconnection between the MAC layer and the PHY layer. The main functions of PCS include: 1) line encoding/decoding and scrambling/descrambling/descrambling to ensure reliable and orderly data transmission, 2) compensating for uplink and downlink rate differences to avoid information confusion and errors, and 3) forward/forward error correction (FEC) to reduce the impact of noise on transmission quality.

► The main functions of the Physical Medium Attachment (PMA) :P MA are series-parallel conversion, link status, clock recovery, and error detection time.

► PMD (Physical Medium Dependent) :P MD mainly completes the functions of output signal to MDI, network voltage transformation, photoelectric conversion, etc.

Figure 1: Relationship of the 1000BASE-T1 PHY to the OSI reference model and the IEEE 802.3 CSMA/CD LAN model

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: 1) The MAC interface of the 1000BASE-T1 standard is GMII (Gigabit MII), which supports Gigabit Ethernet (MII), and 2) Auto-Negotiation is required ≥ 1Gbps backplane application scenarios

Source: IEEE official website, CICC Research Department

Underlying SerDes technology

SerDes provides the underlying technical support of the physical layer for data communication protocols such as Ethernet and PCIe. SerDes is the abbreviation of SERializer/DESerializer, which is a mainstream time-division multiplexing, point-to-point high-speed serial communication technology, and the underlying technical support of high-speed serial link data communication protocols such as Ethernet, HDMI, PCIe, and USB is SerDes. According to the transmission distance, SerDes can be divided into long/medium/short distance SerDes (LR/MR/VSR SerDes), ultra-short range (XSR) SerDes and very short range (USR) SerDes, which are respectively used for backplanes (such as in Ethernet switch PHY), chip-chip (such as in PCIe/CXL PHY), chip-module (such as ODSP PHY), and die-to-die (e.g., in Die-to-Die PHY).

Figure 2: SerDes and its application scenarios for each transmission distance

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: Cadence, OIF, CICC Research

SerDes systems typically consist of serializers at the sender end, deserializers at the driver and receiver ends, and analog front-ends, including analog circuits and digital circuits, which are usually integrated in the IP core or exist as stand-alone chips. The earliest single-channel data rate of SerDes is generally 1.25-3.125Gbps, and the highest rate of SerDes for mature applications in the world is 112Gbps per channel, which is designed based on the PAM4 ADC+DSP architecture and will be commercialized in 2022. We believe that the improvement and performance optimization of SerDes single-channel transmission rate depends on the evolution of various technologies such as pulse amplitude modulation, high-speed ADC, digital signal processing, and the upgrade of manufacturing processes.

► Pulse Amplitude Modulation (PAM): Advanced modulation can increase the number of bits carried by a single pulse (single symbol), thereby increasing the SerDes transmission rate. For example, under NRZ (non-return-to-zero coding), each symbol carries 1 bit, and under PAM4 (fourth-generation pulse amplitude modulation) encoding, a pulse can present 4 levels, and each symbol carries 2 bits, doubling the transmission efficiency per unit time (bit rate) under the same bandwidth.

► High-speed ADC: Under the NRZ modulation mode, the traditional analog front-end can be used, and then enter the digital domain after matching-equalization-sampling-deserialization; Sampling and deserialization are done in digital circuitry.

► Digital Signal Processing (DSP):P AM4 SerDes system adds DSP to play the clock recovery function, by generating all possible data sequences and comparing them with the received signal to identify the most likely transmission sequence, which can effectively compensate for gain errors and time offsets, and enhance the system's resistance to noise, improving data transmission efficiency and stability.

► Advanced process technology: Advanced process technology upgrades can help SerDes achieve lower power consumption and higher performance interconnects, or drive the emergence of new architectures. Various chip manufacturers have successively launched 3nm process SerDes to meet the higher data bandwidth requirements of AI and other high-speed network infrastructure.

Why do I need to upgrade to 224G?

The AI wave is driving a surge in data processing and driving the demand for cloud computing power. The demand for high-speed connectivity within data centers has grown dramatically, requiring interconnect channels with larger carrying capacity to support them. In addition to relying on the performance upgrade of core hardware such as single GPU card and SSD, the improvement of AI computing power also requires higher inter-chip and intra-chip interconnection capabilities as strong support, so as to achieve efficient aggregation of multiple GPUs and meet the needs of GPU memory access and data exchange. Therefore, we believe that large clusters do not equal large computing power, and to improve the training efficiency of AI clusters, the interconnection capabilities between chips on the board, between on-chip dies, and off-chip need to be further upgraded.

Comparing I/O bandwidth and compute unit hashrate, we see that the gap between I/O bandwidth and computing power is widening as Moore's Law slows down and semiconductor processes approach their physical limits. As wafers iterate to 5nm/3nm processes, transistor density approaches the limit, R&D costs increase, and Moore's Law tends to slow down. In the post-Moore era, chip manufacturers such as Intel and AMD have adopted the technical route of multi-die expansion, and the multi-die system integrates multiple heterogeneous bare chips in a single package, the number of transistors can be increased to trillions, and the computing power of the processor has been greatly improved. However, due to the physical limit on the number of I/O pins, the increase in I/O rate is disproportionate to the growth rate of computing power, and according to the Synopsys website, the I/O performance is improved by less than 5% while the transistor density is doubled, and the data transmission capacity restricts the increase in the total computing power of the chip to a certain extent.

Figure 3: The gap between I/O bandwidth and computing power is widening

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: Synopsys official account, CICC Research

In summary, we believe that the evolution of the physical layer (PHY) standard of interconnection interfaces to 224G or even higher transmission rates is accelerating in the context of the continuous expansion of data center clusters and the increasing demand for interconnection bandwidth. The advantages of 224G SerDes are reflected in the following: Taking 224G Ethernet SerDes as an example, the application of single-channel 224G SerDes can greatly reduce the number of cables and switches required in the data center, thereby optimizing network efficiency and reducing the additional communication cost caused by the increase of nodes. The OIF CEI-224G framework uses CPO (Optoelectronic Co-Package) and OE (optical engine) to shorten the electrical link between the host SoC and the optical interface, and according to Synopsys' official website, 224G SerDes reduces power consumption per bit by about one-third compared to 112G.

224G SerDes is challenged to achieve higher transmission performance on a given channel or distance type. Looking back at the development process of Ethernet data transmission, the Nyquist frequency of analog 28G SerDes based on NRZ modulation is 14G, and the Nyquist frequency of 56G/112G PAM4 hybrid SerDes is 14G and 28G, respectively, and when the SerDes rate is increased to 224G, the Nyquist frequency needs to be doubled to 56G, causing more serious link loss. In addition, the channel isolation between the SerDes signal and the interference source is not improved, resulting in problems such as increased signal crosstalk. At the same time, higher data rates also require lower power consumption per bit. Synopsys says SerDes design complexity increases by a factor of five to achieve the performance levels of the previous generation at 224G.

The design of the 224G SerDes PHY has been launched, and the commercial deployment process is expected to accelerate. According to IP Nest, several 224G SerDes designs have been launched in 2023, but the objective design complexity, power constraints, and the need for more advanced modulation technology make it difficult to implement and deploy 224G SerDes, LightCounting predicts that the first batch of 224G SerDes will usher in deployment in 2026, and early applications include retimers and transmissions, switches, AI extensions, optical modules, I/ O-chips and FPGAs are expected to extend to more areas of data demand after mature applications. Synopsys first demonstrated 224G SerDes at ECOC 2022 in September 2022, enabling high-performance 224G Ethernet PHY IP by minimizing analog front-ends, introducing massive parallelism and advanced DSP technologies throughout the system, and Marvell said on the FQ4 2024 earnings call that its next-generation, single-channel, 200Gb/s 1.6T PAM DSP products have been certified on the customer side and are expected to begin deployment by the end of this year; Broadcom said at the Infrastructure Empowerment AI Investor Conference held on March 20 that its underlying SerDes technology has been upgraded to a single-channel 200Gb/s and is manufactured on a 3nm process.

Figure 4: 224G SerDes frequency, loss, and volume forecasts

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: The Nyquist frequency is the minimum sampling frequency that needs to be defined to prevent signal aliasing, and the actual sampling frequency is 2 times the Nyquist frequency, for example, 56G SerDes is 28G in NRZ scheme, and 14G in PAM4 due to 2 bits per pulse

Source: Synopsys, LightCounting, IP Nest CICC Research

Outlook: The development trend of wired communication within AI data centers

According to the packaging level, the wired communication within the data center can be roughly divided into three layers from the inside to the outside, namely: Die-to-Die, Chip-to-Chip, and Board-to-Board Communication. Among them, die-to-die communication is the smallest level of communication, which occurs within the chip package to realize data exchange between different functional modules inside the chip; outward, chip-to-chip communication realizes data communication between different chips (such as CPU-GPU) on the server motherboard; outside the server, board-to-board communication realizes server-switch, The data transmission between switches is layered to form the internal networking architecture of the data center cluster.

We believe that the achievement of efficient training of multi-modal large models with more than one trillion parameters needs to be based on the premise of larger-scale data carrying capacity and efficient transmission rate in the data center, and the all-round improvement of die-to-die, chip-to-chip, and board-to-board interconnection capabilities within the cluster has become a definitive development trend.

Figure 5: Schematic diagram of data center communication at each level

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: OIF, Montage Investments, Alphawave, CICC Research

展望#1:Die-to-Die

D2D communication is the ultra-short-range data transmission between dies that occurs within a chip package. The D2D interface is a functional module for data transmission between bare chips, which usually consists of a PHY chip and a controller module. At the physical layer, high-speed SerDes architecture or high-density parallel architecture can be used between die and die to achieve parallel/serial data transmission respectively, and support 2D, 2.5D, and 3D package structures.

The many-core heterogeneous solution based on chiplets has many advantages, and the communication requirements between dies are further increased. Chiplets disassemble the SoC into a die to achieve a specific function, and the IP of different processes can be reused. The advantages of chiplet are: 1) yield: by integrating small-area chips, the impact of wafer defects on yield is reduced, 2) cost: based on IP with different functions, the process can be flexibly selected to balance performance and R&D costs, 3) computing power: breaking through the limitation of single-core area and providing more physical platforms for transistors, 4) storage capacity: the chiplet solution can be stacked multiple times in a single package, increasing the memory capacity while maintaining miniaturization, 5) communication bandwidth: chiplet adopts high density, high-speed packaging and interconnection design, can effectively improve the bandwidth and signal transmission quality between computing and storage, computing and computing, and alleviate the "storage wall" problem.

There are various protocols for the interconnection ports between chiplet dies. The design of the interconnect interface protocol of the chiplet die needs to consider complex elements such as adaptation to the process and packaging technology, system integration and expansion, and at the same time needs to meet the differentiated requirements of different application fields for performance indicators such as transmission bandwidth per unit area and power consumption per bit. Chiplet interface interconnection protocols can be divided into physical layer, data link layer, network layer, and transport layer. Among them, interfaces at the link layer and above rely more on inheriting or extending existing interface standards and protocols, and there are many interconnection protocols at the physical layer, and there are differences in performance indicators and implementation processes such as bandwidth density, transmission delay, and energy consumption. From the perspective of connection mode, chiplet has two interconnection modes at the physical layer: serial and parallel.

The lack of uniformity in D2D interconnect standards affects the further development of chiplets. The free splicing of chiplets relies on the open and unified communication protocols between dies, but at present, D2D interfaces are usually developed based on the manufacturers' own interconnection requirements, and the ideal scenario of free combination of chips and SiP packaging cannot be realized. We believe that the further development of chiplets is largely limited to the inconsistency of PHY interconnection standards for inter-die communication, and faces problems such as mismatched interfaces in the future of designed finished products and waste of resources when interconnecting different chips.

Figure 6: Chiplet physical layer partial interconnect standards at a glance

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: CSDN, Semiconductor Industry Watch, CICC Research

UCIe helps standardize chiplet interfaces and gradually open up the D2D interconnection ecosystem. In March 2022, chip manufacturers such as Intel, AMD, Arm, Qualcomm, Samsung, TSMC, and ASE Group, as well as cloud vendors such as Google Cloud, Meta, and Microsoft, jointly established the Chiplet Alliance to jointly formulate the Universal Chiplet Interconnect (UCIe). Express) standard, Nvidia announced in August 2022 that it would support the new UCIe specification [1]. According to IP Nest predictions, the number of launches for UCIe-based D2D IP designs is much higher than others. We believe that driven by the UCIe standard, chiplet chips from different vendors based on the same interface standard are expected to be further integrated through advanced packaging, and the chiplet ecosystem is expected to be gradually improved.

Figure 7: UCIe specification chiplet physical layer standards and performance improvements

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: UCIe website, IP Nest, CICC Research

Advanced packaging technology optimizes connectivity and improves communication speed between dies. The traditional packaging method is mainly based on the wire to connect the bonding pad of the wafer to the pins of the substrate to achieve electrical connection, and finally cover the shell to form protection, the main methods are DIP, SOP, QFP, etc. Because the chiplet solution sacrifices the routing density and transmission stability between functional modules compared with monolithic SoCs, traditional packaging may be difficult to meet the communication requirements between dies. The advent of advanced packaging optimizes the connection between dies, effectively shortens the signal distance between dies, and provides higher connection density and communication bandwidth, improving communication quality and reducing power consumption levels.

Advanced packaging increases pin density by replacing leads with a dot or layer layout. Point connections include Bumping, TSV (Through-Silicon Vias), and layer connections include RDL (Rerouting Layer) and Interposer. The combination of point and layer packaging technologies has formed a variety of advanced packaging forms such as Fan-out, WLCSP (wafer-level packaging), Flip-chip (which can be subdivided into two flip-packs, FCBGA and FCCSP), 2.5D/3D packaging, and SiP (system-in-package). According to Yole's forecast, the penetration rate of advanced packaging in the overall packaging market is expected to continue to increase, accounting for 49.4% of the advanced packaging market by 2025, and the market size is expected to expand from US$44.3 billion in 2022 to US$78.6 billion in 2028, with a compound growth rate of 10% from 2022 to 2028. From the perspective of advanced packaging types, the market growth rate of 2.5D/3D packaging is leading, with a CAGR of nearly 40% from 2022 to 2028.

Figure 8: Market Size Forecast for Advanced Packaging by Type

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: Yole, CICC Research

Leading manufacturers accelerate the layout of advanced packaging and expand 2.5D/3D packaging platforms. The global advanced packaging technology is mainly dominated by leading manufacturers such as TSMC, Intel, and Samsung, and other packaging manufacturers are gradually following suit. Taking TSMC as an example, CoWoS (Chip On Wafer On Substrate) is to connect the chip to the silicon wafer first, and then connect the CoW to the substrate, and its core is to apply silicon interposer, Bumping and TSV technologies to replace traditional wire bonding, and increase the number of pins and interconnect density. At present, 2.5D packaging methods such as CoWoS are widely used in chip packaging such as CPU, GPU, FPGA, etc., and are the mainstream solutions based on chiplet chip packaging.

The bump spacing continues to decrease, and the interconnection density between dies increases. We have observed that as the demand for computing speed and computing power of chips continues to increase, advanced packaging continues to develop towards diversified functions, diversified connections, and diversified stacking, and the bump spacing corresponding to the packaging form is getting smaller and smaller, and the I/O density and package integration are increasing, which also leads to the difficulty of packaging. Compared with the technologies of various manufacturers, TSMC and Intel are more ahead in packaging capabilities.

展望#2:Chip-to-Chip

The motherboard bus is an important medium for C2C communication, enabling data transmission between chips on the board. In AI scenarios, heterogeneous parallel computing architectures represented by CPU+GPU have become the mainstream, and C2C interconnection technology has gradually evolved from PCIe to multi-node lossless networks.

PCIe is a high-speed serial expansion bus standard for connecting CPUs to high-speed peripherals. PCIe has good backward compatibility, with an average of one generation of standards being upgraded every three years, corresponding to a doubling of the rate per lane. Processor I/O bandwidth is doubling every three years on average, driving the PCIe standard to evolve at a rate of three generations. PCIe 1.0 was officially launched in 2003 and has been iterated to 6.0 in 2022. In June 2022, the PCI-SIG released a PCIe 7.0 forward-looking document, which is expected to double the transfer rate again to 128GT/s (up to 512GB/s at 16 lanes) while maintaining the same coding/modulation scheme, and the full standard specification will be released in 2025.

PCIe 6.0 will begin to use 224G SerDes, which can support 800G Ethernet after conversion by network card. As mentioned earlier, PCIe is used for intra-chip or intra-rack connectivity, Ethernet is used for off-rack connectivity, and PCIe can be converted to Ethernet through a PCIe network interface card (NIC), which can then be used to implement an Ethernet fabric through a multi-layer network switch. According to Synopsys' website, 4-lane PCIe throughput matches the highest single-lane Ethernet data rate, meaning that x16 PCIe throughput is equivalent to x4 Ethernet port bandwidth. PCIe 5.0 will be based on the next-generation 224G SerDes, which can efficiently support 800G Ethernet (4-lane) ports under 16-lane network card conversion, and PCIe 6.0 is expected to be commercially available around 2025 according to PCI-SIG.

Figure 10: PCIe Specifications for each version

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: Flit Mode* indicates the mode of the flow control unit, and the data transmission is carried out in Flit as the smallest unit

Source: PCI-SIG official website, CICC Research

Signal insertion loss also increases during the iteration of the PCIe standard, and the introduction of signal conditioning technology can effectively improve the signal quality. In order to deal with the increasing problem of signal insertion loss, PCIe has introduced signal conditioning chips since the 4.0 era: 1) PCIe Retimer: Retimer is a hybrid device with a digital-analog device, which works by extracting the embedded clock in the input signal through the embedded CDR circuit, and then retransmitting the data using the unattenuated and distorted clock signal, thereby improving signal integrity and eliminating the effect of signal jitter. 2) PCIe Redriver: The damaged signal is amplified by the driver at the transmitter end and the filter at the receiver end to compensate for the signal loss. In contrast, Retimer can achieve a better reduction of channel loss than Redriver, but the delay of data processing is lengthened due to the increase in data processing. PCIe Retimer is widely used in AI servers, and its market size is expected to expand. Retimer chips can improve signal integrity during data transmission in servers, enterprise storage, heterogeneous computing, and communication systems, and typical application scenarios include NV Me SSDs, AI servers, Riser cards, etc.

The Compute Express Link (CXL) protocol is based on the PCIe physical standard and shares memory to improve performance. In 2019, Intel led and cooperated with Meta, Google and other companies to release a new interconnection protocol, CXL. CXL (CXL.io) runs on top of the PCIe 5/6 physical layer, with the same physical and electrical interface characteristics as PCIe, providing high bandwidth and high scalability, and CXL provides an additional protocol (CXL.cache/mem) for memory exchange between devices in the data center. In November 2023, CXL 3.1 was officially released, which is a progressive update to version 3.0, proposing a new trusted execution environment and optimizing the structure and memory expander. The advantages of CXL over PCIe are: 1) reduce cross-device access latency, through the CXL protocol, the CPU and GPU can bypass the PCIe protocol for memory resource sharing, forming a memory resource pool, effectively reducing the latency between the CPU and the GPU, 2) increase the memory capacity, the additional device connected to the CXL provides more memory to the CPU, and the low-latency CXL link allows the CPU to use this additional memory in combination with DRAM memory.

Exhibit 11: CXL protocols and specifications

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: CXL official website, CICC Research Department

NVLink is a peer-to-peer interconnection protocol dedicated to NVIDIA GPUs. NVIDIA developed NVLink technology in 2014 for heterogeneous computing scenarios, which enables direct interconnection between GPUs, expands multiple GPU input/output (I/O) within servers, and provides faster and lower latency in-system interconnection solutions than traditional PCIe buses. NVLink 1.0 has a bidirectional transfer rate of 160GB/s, and NVLink has been upgraded in tandem with the evolution of the GPU architecture. At GTC 2024 Keynote on March 19, NVIDIA announced the fifth-generation NVLink high-speed interconnection solution, which increases the maximum total bidirectional bandwidth to 1.8TB/s, which is double that of the fourth-generation and about 14 times the total bandwidth of x16 PCIe 5.0 links. We believe that the launch of NVIDIA's fifth-generation NVLink technology will significantly improve the efficiency of inter-GPU communication, which is expected to further enhance the computing performance of its AI chip cluster from the C2C interconnection level.

NVSwitch is an extension of NVLink's technology to solve the problem of uneven communication between GPUs. At GTC 2024, NVIDIA announced a new generation of NVLink Switch: a single NVSwitch chip uses TSMC's 4NP process technology and supports 72 bidirectional 200G SerDes ports (applying 224G PAM4 SerDes technology). The next-generation NVLink Switch can interconnect up to 576 GPUs, greatly expanding the NVLink domain and increasing the total aggregate bandwidth to 1PB/s, helping AI models with more than one trillion parameters to unleash acceleration performance.

Figure 12: NVLink and NVSwitch specifications by generation

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: "-" indicates that no public information has been disclosed

Source: Nvidia's official website, CICC Research

NVLink-C2C extends NVLink to the package level, supporting die interconnects with advanced packaging. Built on SerDes and Link technologies, NVLink-C2C is scalable from PCB-level integration, multi-chip module (MCM), silicon interposer, or wafer-level connectivity. Taking the GB200 superchip as an example, NVIDIA uses NVLink-C2C technology to build a package-level interconnect, and the Grace CPU and Blackwell GPU support 900 GB/s bidirectional bandwidth communication.

展望#3:Board-to-Board

Under the development of AI models, data centers show two trend changes: 1) network traffic is growing rapidly, and the proportion of east-west (i.e., between servers) traffic has increased significantly, and according to Cisco's forecast, the current east-west traffic proportion may have reached 80-90% of network traffic; 2) the network architecture is gradually moving towards a multi-layer non-convergence, less convergence, and more scalable form, such as NVIDIA's data center uses the fat tree architecture to build a non-convergent network, and the total network bandwidth of each layer is consistent. In our previous report, "Top of the AI Wave Series: InfiniBand VS Ethernet, Computing Center Network Requirements Upgrade", we have discussed in detail what kind of data center network is needed for AI.

High-speed interconnection (B2B) between machines is an important part of improving the efficiency of AI training. On the one hand, the networking protocol, as the artery of data capacity, needs to be upgraded to evolve from traditional TCP/IP to RDMA to optimize network performance, and on the other hand, the interface transmission rate needs to continue to iteratively rise.

Figure 13: Compared with traditional data centers, intelligent computing centers have higher requirements for communication performance

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Source: China Mobile Research Institute, Baidu Developer Center, Information Observation Network, CICC Research Department

Remote Direct Memory Access (RDMA) reduces data transmission steps and improves communication efficiency. At present, there are three types of mainstream RDMA solutions, namely InfiniBand, RoCE, and iWARP.

► InfiniBand

InfiniBand has gradually become the preferred choice for large-scale AI training clusters from a niche supercomputing market. Since 2014, the proportion of InfiniBand in the TOP100 (the world's top 100 supercomputing) has been significantly higher than that of Ethernet, and after 2020, InfiniBand has accounted for more than 50%. For the construction of intelligent computing centers driven by AI large models, InfiniBand can better meet the above-mentioned intelligent computing needs with its extremely high throughput, extremely low latency, high scalability (up to 10,000 nodes can be scaled to clusters with tens of thousands of nodes), fast rollout, tuning and maintenance of large-scale network capabilities, and lossless network construction capabilities, and the penetration rate of AI back-end networking has increased rapidly.

According to the InfiniBand roadmap, the expected transfer speed in the future has been planned to be 3.2Tb/s. The current InfiniBand transmission speed is 400Gb/s (4-lane, 8-fiber mode). In November 2023, IBTA released the initial specification for XDR 800Gb/s InfiniBand, and the latest roadmap shows that XDR 800Gb/s will be implemented in 2024, with a double of the SerDes rate for a single channel, and XDR-enabled NICs and switches will provide 800Gb/s transmission speeds per port, and support XDR switch-to-switch connectivity at 1.6Tb/s port rates. By 2030, the performance of InfiniBand networks will continue to be improved to 1.6 Tb/s GDR and 3.2 Tb/s LDR.

Exhibit 14: InfiniBand roadmap

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: At 4X (4 lanes) the link speed is expressed in Gb/s

Source: InfiniBand Trade Association, CICC Research

InfiniBand networking has outstanding advantages, but it has problems such as high cost and closed ecosystem. At present, Mellanox (a subsidiary of Nvidia) is dominating the InfiniBand market. Nvidia has strong bargaining power in the industry chain, resulting in a higher BOM cost of AI training and inference infrastructure deployed through the InfiniBand protocol protocol than Ethernet, according to the Broadcom Infrastructure Enabling AI Exchange Conference [2], the cost of computing clusters built based on the Ethernet RDMA protocol is about 50% of that of InfiniBand-based or lower, we believe that InfiniBand may have some shortcomings in terms of universality and affordability.

► RoCE

The performance of RoCE is close to that of InfiniBand, and multi-vendor optimization of flow control and congestion management. However, unlike IB's credit-based flow control mechanism, RoCEv2 may have problems such as PFC deadlock and congestion pervasiveness when the traditional Ethernet network is modified to ensure a lossless network based on the flow control mechanism of PFC (priority-based flow control), ECN (explicit congestion notification), and DCQCN (data center quantized congestion notification). Huawei, H3C, Inspur and other manufacturers have launched their own optimized lossless network solutions.

A number of technology giants jointly established UEC to build high-performance Ethernet in the form of an alliance. In response to the possible impact of the increase in the deployment rate of InfiniBand on the Ethernet market share, according to the Linux Foundation, in July 2023, UEC (Ultra Ethernet Consortium) was co-founded by hardware equipment manufacturers Broadcom, AMD, Cisco, Intel, Arista, Eviden, HP, and hyperscale cloud vendors Meta and Microsoft. Since its establishment, the UEC Alliance has continued to expand its membership, demonstrating the advantages of an open ecosystem. We believe that although UET-related standards and technologies are still in the early stage of development, with the gradual promotion and implementation of the technology in the future, UET is expected to surpass the RoCE protocol, benchmark against InfiniBand, and lead the penetration rate of Ethernet networks in the field of intelligent computing.

Summary: InfiniBand and RoCE have their own advantages and disadvantages, and in AI training scenarios, the InfiniBand protocol takes precedence, and Ethernet + RDMA penetration accelerates. In the short term, due to the shortage of computing resources, some vendors choose NVIDIA to build data centers on behalf of others and quickly launch the network to train large models with the help of InfiniBand's features. With the continuous optimization of AI networks, the demand for upgraded high-performance Ethernet solutions that can achieve compatibility and interoperability with large-scale IP networks on the existing network is expected to accelerate, and the penetration rate in intelligent computing centers is expected to continue to grow.

Switches are the core of B2B communication, and the increase in the bandwidth of switching chips promotes high-speed interconnection between machines. As the performance anchor of the switch, the switch chip determines the total bandwidth, maximum port transmission rate, buffer time, and other capabilities of the switch. Combined with the observation of our industry chain, as the data traffic continues to rise, the rate of the switch chip has basically doubled according to the rate of the two-year generation. At present, the bandwidth of the world's most advanced Ethernet switch chip products has reached 51.2Tbps, Broadcom, NVIDIA, Marvell Electronics, and Cisco have successively released related products, and some products have been shipped in batches in 2023, and LightCounting expects that the 51.2Tbps InfiniBand switch chip will usher in mass production in 2024. According to LightCounting (April 2023 report), by 2028, the penetration rate of 51.2Tbps switching chips in Ethernet and InfiniBand is expected to reach 8% and 54%, respectively, and is widely used in AI data centers.

Figure 15: Ethernet switch chip shipments and forecasts by rate

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: Based on LightCounting's April 2023 report

Source: LightCounting, CICC Research

Exhibit 16: InfiniBand Switch Chip Shipments by Rate and Forecast

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: Based on LightCounting's April 2023 report

Source: LightCounting, CICC Research

Figure 17: High-bandwidth switch chips and switches from the world's leading semiconductor manufacturers

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: 1) "-" indicates that it has not been publicly disclosed, 2) the speed of the switch chip in Jericho3-AI full-duplex mode can reach 28.8Tbps, and 3) the official website of the NVIDIA Quantum series only discloses the switch configuration, but does not disclose the specific performance parameters of the switch chip

Source: Companies' official websites, TechWeb, Semiconductor Industry Watch, CICC Research

The gradual maturity and commercial implementation of 100G+ high-speed SerDes has promoted the launch of 51.2T ultra-high-bandwidth switching chips from the technical side. Looking ahead, we believe that SerDes will continue to evolve to 224G, which is expected to drive the launch of 102.4T and higher bandwidth switching chips. According to the switch chip upgrade roadmap shown in the Yole report, Yole expects to usher in the batch launch of switch chips with bandwidth of 102.4T and 204.8T in 2025 and 2027, respectively, and achieve a breakthrough in the single-channel SerDes rate of electrical interfaces from 112G to 224G. Broadcom, the global leader in Ethernet switching chips, said at the 3QFY23 performance meeting that in order to realize the next generation of 1.6T Ethernet connection, it has begun to develop Tomahawk 6 switching chips (applying 224G SerDes) in 2023, with a throughput capacity of more than 100Tb/s, according to the previous Broadcom switching chips every 1.5-2 years to upgrade a generation, we expect the company's switching chips with a bandwidth of more than 100Tb/s will be launched later in 2024.

From the perspective of switch bandwidth and port rate, with the acceleration of AI back-end network migration to high-speed, 51.2T switch chips will be further deployed in 2024, 400G+ port rate switches are expected to harvest increments in AI clusters, Dell'Oro predicts that by 2025, the penetration rate of 400/800G switches will reach 85%, and 1.6T switches are also expected to begin to gradually increase and become the mainstream port rate of data center switches by 2027.

Figure 18: Penetration and forecast of switches with different port rates in the backend network of AI clusters

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: Includes Ethernet switches and InfiniBand switches

Source: Dell' Oro, CICC Research

PHY及SerDes产业链梳理

From the perspective of the PHY &SerDes chip industry chain, the upstream mainly includes PHY IP core and EDA manufacturers such as Synopsys and Cadence, and the midstream chip design manufacturers include Fabless or IDM manufacturers such as Broadcom, Marvell Electronics, Realtek, Texas Instruments, and Qualcomm.

Figure 19: Overview of the upstream and midstream of the PHY chip industry chain

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: "~" indicates the company's PHY chip market share

Source: Yutai Micro Prospectus, CICC Research Department

SerDes is designed to be IP-based and is an important driver for the growth of the interface IP market. Interface IP continues to evolve at high speed, and the high-end interface IP market is full of tension. We believe that the development of AIGC will put forward higher requirements for the bandwidth and latency of data transmission, which will further promote the upgrading of interface protocols such as PCIe, Ethernet, and storage, and the underlying SerDes technology will continue to be high-end. According to IPnest's forecast, the average annual compound growth rate of the IP market size of PCIe, DDR, Ethernet and D2D interface from 2022 to 2026 is about 27%, of which the high-end category (PCIe 4.0 and above &CXL, advanced DDR, high-end Ethernet, D2D) will grow rapidly, with a CAGR of up to 75% from 2022 to 2026 IPnest expects that the total market size of the four types of high-end interface IP is expected to reach $2.115 billion by 2026.

Figure 20: 2021-2026 high-end interface IP market size forecast

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Note: High-end Ethernet refers to PHYs based on 56G, 112G, and 224G SerDes

Source: IPnest, CICC Research

The IP core and EDA market is dominated by overseas leading manufacturers with a high degree of concentration, and the IP+EDA software business of leading enterprises is collaborative. According to IPnest data, the top 3 vendors in the global IP market in 2022 are Synopsys, ARM and Cadence, with market shares of 30%, 25% and 7% respectively, and CR3 exceeding 60%. According to TrendForece, the top three vendors in the global EDA market in 2021 are Synopsys, Cadence and Siemens, with a share of 32%, 30% and 13%, respectively.

Focus on SerDes IP - SerDes technology suppliers are concentrated in North America, and domestic manufacturers are accelerating their layout. At present, there are two main types of SerDes vendors in the market: licensing SerDes IP to chip manufacturers and charging patent licensing fees. The world's leading third-party SerDes manufacturers, such as Synopsys, Cadence, and Alphawave, are all U.S. companies, and the current SerDes IP self-sufficiency rate in the domestic market is still low, and local manufacturers are breaking through the 112G SerDes technology. Broadcom, Marvell, Intel and other manufacturers design SerDes IP according to their own needs or for downstream customers, with strong customization attributes.

Focus on Ethernet PHY chips - The Ethernet PHY chip market is dominated by overseas manufacturers, and the level of localization is relatively low. According to the data of China Automotive Technology and Research Center, the global Ethernet PHY chips are mainly dominated by overseas manufacturers such as Broadcom (USA), Marvell Electronics (USA), Realtek (Taiwan, China), and Texas Instruments (USA), with a total share of more than 91% of the top 5 manufacturers in the global Ethernet PHY chip market in 2020, and the total share of the top 5 manufacturers in the domestic market is generally similar to the global market, with a relatively high share of Realtek (28%) and a total share of more than 87% of the top 5 manufacturers. Domestic Ethernet PHY chip manufacturers mainly include Yutai Micro, Jinglue Semiconductor (born from Marvell Electronics), etc., domestic PHY chips currently have a low market share, and focus on low-speed products in vehicle scenarios, we believe that the future in the field of medium and high-speed PHY chips alternative space is still broad.

Risk Warning

The development of AI models and applications is less than expected. With the increase in the digital transformation and intelligent penetration rate of the whole society, artificial intelligence continues to empower all walks of life. Artificial intelligence, on the other hand, relies on massive data for model training, which promotes a significant increase in the demand for computing power in the whole society, and has high requirements for the transmission rate, scalability, and compatibility of communication hardware equipment. If the development of large AI models and applications is not as expected, it may slow down the demand for D2D, C2C, B2B and other communications to upgrade to ultra-high bandwidth.

SerDes technology iteration is less than expected. High-speed SerDes designs face design challenges such as reducing power consumption, noise and disturbance, and rely on advanced process technology to optimize the overall performance of SerDes. If the above links are overcome, the iteration speed of the SerDes single-channel transmission rate will slow down, which will slow down the overall pace of interface physical layer transmission upgrade.

[1]https://www.uciexpress.org/post/ucie-announces-incorporation-and-new-board-members-at-fms-2022

[2]https://investors.broadcom.com/static-files/4378d14e-a52f-409f-9ae4-03d810bc7a6c

Article source:

This article is excerpted from: "Communication Technology 10-Year Outlook Series - 224G PHY Has Set Sail, Data Center Wired Communication Towards a New Journey" released on March 23, 2024

Analyst: Chen Hao, SAC License No.: S0080520120009 SFC CE Ref: BQS925

Contact: Zheng Xinyi SAC License Number: S0080122070103

Analyst: Shiwen Li SAC License No.: S0080521070008 SFC CE Ref: BRG963

分析员彭虎 SAC 执证编号:S0080521020001 SFC CE Ref:BRE806

Legal Notices

CICC Communication Technology's 10-year outlook: 224G PHY sets sail, and data center wired communication is on a new journey

Read on