PCIe 扩展 GPU VRAM 容量新技术 — 实现两位数纳秒延迟!

Modern GPUs for AI and HPC applications have a limited amount of high-bandwidth memory (HBM) built-in, limiting their performance in AI and other workloads. However, the new technology will allow the GPU to expand the GPU memory capacity by inserting more memory into the device connected to the PCIe bus, rather than being limited to the memory built into the GPU – it will even allow for memory capacity expansion using SSDs. Panmnesia, a company backed by the prestigious KAIST Institute in South Korea, has developed a low-latency CXL IP that can be used to expand GPU memory using CXL memory expanders.

Memory requirements for more advanced datasets for AI training are growing rapidly, meaning AI companies must either buy new GPUs, use less complex datasets, or use CPU memory at the expense of performance. Although CXL is a protocol that officially works on top of a PCIe link, enabling users to connect more memory to the system via the PCIe bus, the technology must be recognized by the ASIC and its subsystems, so simply adding a CXL controller is not enough for the technology to work, especially on GPUs.

Panmnesia faced challenges in integrating CXL for GPU memory scaling due to the lack of CXL logic structures and subsystems in the GPU that support DRAM and/or SSD endpoints. In addition, the GPU cache and memory subsystems are not aware of any extensions other than Unified Virtual Memory (UVM), which tends to be slow.

To solve this problem, Panmnesia has developed a CXL 3.1 compliant Root Complex (RC) with multiple root ports (RPs), support for external memory via PCIe, and a master bridge with a host management device memory (HDM) decoder that connects to the GPU's system bus. The HDM decoder is responsible for managing the address range of the system's memory, making the GPU's memory subsystem "think" that it is processing the system memory, when in fact the subsystem is using DRAM or NAND connected via PCIe. This means that the GPU memory pool can be expanded using DDR5 or SSD.

According to Panmnesia, this custom GPU-based solution labeled CXL-Opt has been extensively tested, showing double-digit nanosecond round-trip latency (compared to the CXL-Proto developed by Samsung and Meta, shown as 250 nanoseconds in the graph below), including the time required for protocol conversion between standard memory operations and CXL flit transfers. It has been successfully integrated into memory expanders and GPU/CPU prototypes in hardware RTLs, proving its compatibility with a wide range of computing hardware.

According to Panmnesia's tests, UVM (Unified Virtual Memory) performed the worst of all tested GPU cores, due to the overhead of host runtime intervention during page faults and transferring data at the page level, which often exceeded the needs of the GPU. In contrast, CXL eliminates these issues by allowing direct access to expanded storage via load/store instructions.

As a result, the CXL-Proto has a 1.94 times shorter execution time than UVM. Panmnesia's CXL-OPT further reduces execution time by a factor of 1.66, and its optimized controller achieves double-digit nanosecond latency and minimizes read/write latency. This pattern is also reflected in another chart that shows the IPC values recorded during the execution of the GPU cores. The data shows that Panmnesia's CXL-Opt is 3.22 and 1.65 times faster than UVM and CXL-Proto, respectively.

Overall, CXL support can bring a lot of benefits to AI/HPC GPUs, but performance is a big concern. In addition, it remains to be seen if companies such as AMD and Nvidia will add CXL support to their GPUs. If the approach to using PCIe connected memory for GPUs does evolve, only time will tell if industry heavyweights will use IP blocks from companies like Panmnesia, or simply develop their own technology.