There are different ways to solve network I/O problems
In the past few months, AI applications such as OpenAI ChatGPT, Google Bard, and Baidu Wenxin Yiyan have continued to explode. The exponential growth in the scale of AI problems and the scale of user services has increased the demand for GPUs, CPUs, accelerators, memory, and storage.
At present, memory capacity and GPU performance are growing rapidly, while the network is an important bridge to connect, I/O is not keeping pace, IO bandwidth lags two orders of magnitude behind accelerated computing scaling, resulting in resource retention and underutilization, and expensive GPUs and other accelerators sitting idle.
Network I/O performance can't keep up with GPU compute performance
While other companies, including industry giant Nvidia, have used proprietary interface solutions to solve this networking problem, networking chip startup Enfabrica has gone the other way, offering another way to scale, choosing to combine industry standards such as PCIe and CXL with open-source software frameworks.
Enfabrica introduces new Accelerated Computing Architecture (ACF) chips optimized for AI and accelerated computing workloads designed to provide scalable, streaming, terabits per second data movement between GPUs, CPUs, accelerator ASICs, memory, and networking devices while reducing the total cost of cloud networking.
The advent of ACF runs counter to the increasingly prevalent practice of putting intelligence inside the switch itself, the network interface card, and even reducing the need for DPUs/IPUs.
A startup blessed with a halo of stars
Enfabrica was founded in 2020, and although it has not been around for a long time, the founding team has a lot of origins.
Star team
- CEO Rochan Sankar was previously Broadcom's Director of Product Marketing and Management, driving five generations of "Trident" and "Tomahawk" data center switch ASICs;
- Chief Development Officer Shrijeet Mukherjee has worked for Cisco, Cumulus Networks, Google, and others.
- Mike Jorda, director of chip design, was responsible for data center chip design at Broadcom for 21 years.
- Michael Goldflam, Director of Systems Testing, was responsible for switching software at Broadcom for 15 years.
- Carlo Contavalli, VP of Software Engineering, was responsible for software engineering at Google for 12 years.
- Chief architect Thomas Norrie was responsible for hardware at Google for 12 years.
- Chip architect Gavin Starks was the CTO of Netronome Systems, a smart NIC company.
The company's founding advisor is Christos Kozyrakis, a professor of electrical engineering and computer science at Stanford University and a principal at MAST, who has done research at organizations such as Google and Intel; Another high-profile advisor is Albert Greenberg, who is currently Uber's vice president of platform engineering and led Azure Networking at Microsoft for more than a decade, and before that, he was a networking specialist at AT&T Bell Labs. Rachit Agarwal, an associate professor at Cornell University with expertise in large-scale data analytics, is also an advisor to Enfabrica.
As can be seen from staffing, this team not only understands the data center, but also knows how to bring the product to market.
ACF-S
Enfabrica's new innovative ACF device enables composable AI fabrics for compute, memory, and network resources that can scale from a single system to tens of thousands of nodes. Provides uncontested access to >50X DRAM extensions over existing GPU networks via ComputeExpressLink (CXL) bridging.
Collapse multiple network layers to improve performance
At the heart of Enfabrica's design is the quest to replace multi-tier network infrastructure with its accelerated computing fabric (shown above). Sankar explained that the Enfabrica architecture "acts as a hub-and-spoke model" that can "break down and scale arbitrary computing resources," and "whether it's a CPU, GPU, accelerator, memory, or flash, they can all be connected to this hub, effectively acting as an aggregate I/O fabric device." ”
Enfabrica's first chip, ACF-S, was manufactured on TSMC's 5nm FinFET process and uses fully standards-based hardware and software interfaces, including multi-port 800 GbE networks and high-performance PCIe Gen5 and CXL 2.0+ interfaces.
Enfabrica's first generation multi-Tbps fabric silicon IC architecture
Without changing the physical interface, protocol, or software layer on top of the device driver, ACF-S provides multi-terabit switching, bridging between heterogeneous computing, and memory resources in a single silicon while significantly reducing device power consumption in AI clusters consumed by device count, I/O latency hops, top-of-rack network switches, RDMA-over-Ethernet NICs, Infiniband HCA, PCIe/CXL switches, and CPU-connected DRAM. Sankar explains that the chip is like a "sandwich," meaning "a high-performance Ethernet switching pipe, a large shared buffer called a terabit NIC replication engine, and high-performance PCIe Gen5 and CXL 2.0+ switching." ”
The chart below compares the ACF system to NVIDIA's DGX-H100 system and Meta's Grand Teton AI server. Enfabrica says the ACF system will offer cost, scale and performance advantages over the DGX-H100 and Grand Teton systems.
In summary, Enfabrica's new innovative ACF device benefits include:
- Provides scalable, streaming, terabits per second data movement between GPUs, CPUs, accelerators, memory, and network devices.
- 100% standards-based hardware and software interfaces.
- Eliminate latency layers and optimize interface bottlenecks in today's top-of-rack network switches, server NICs, PCIe switches, and CPU-controlled DRAM.
- Enable composable AI fabrics for compute, memory, and network resources, from a single system to tens of thousands of nodes.
- Provides uncontested access to >50X DRAM extensions over existing GPU networks via ComputeExpressLink (CXL) bridging.
No need to wait for CXL 3.0 to expand and share memory
Currently, the CXL hardware ecosystem is still immature, and CXL 3.x components, including CPUs, GPUs, switches, and memory expanders, are still under development. The CXL 3.0 protocol can provide true memory sharing for systems with a mix of near-memory and far-memory memory, but CXL Tier 3.0 components won't provide a true memory pool until 2027.
According to reports, ACF is able to expand and pool memory for sharing across compute engines, without even waiting for the PCI-Express 6.0 interconnect and CXL 3.0 protocol.
Enfabrica means that ACF uses standard interfaces, does not require changes to application, compute, storage, and networking elements in the AI/ML IT stack, provides access to disaggregated memory before CXL 3.0, and will support CXL 3.0 in the future without breaking the standard.
Storage hierarchy diagram
It is not yet known how ACF chips will bring memory pools and sharing in the form of CXL 3.0 in the future.
$20 billion market
It is reported that the entire data center market will reach $2 trillion by 2033 (see figure below). According to 650 Group, data center spending on high-performance I/O chips on compute, storage, and networking chips is expected to double to more than $20 billion by 2027.
The entire data center market will reach $2 trillion in the next decade
According to Enfabrica, applying ACF solutions with CXL memory to generative AI workloads enables dynamic dispatch of user contexts to GPUs in massive parallelism. Simulation tests have shown that ACF-enabled systems achieve the same target inference performance using only half the number of GPUs and CPU hosts compared to the latest "bigiron" GPU servers on the market.
In addition, the ACF-S chip enables customers to reduce the GPU compute cost of Large Language Model (LLM) inference by approximately 50% and Deep Learning Recommended Model (DLRM) inference by 75% at the same performance point.
Enfabrica said the target market for ACF chips is public and private cloud operators, HPC and network system builders. It helps customers remove existing interconnect components, freeing up space and reducing the complexity of components in racks. Link speeds will also be significantly improved, resulting in higher accelerator utilization, shorter AI model training run times, and lower costs.
But whether Enfabrica, which combines strength and aura, can be recognized by the market, will take time to test.