laitimes

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

author:NVIDIA China
借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Aerial CUDA accelerates telecom workloads using CPUs, GPUs, and DPUs to deliver higher levels of spectral efficiency (SE) on cloud-native accelerated computing platforms.

适用于 Aerial 的 NVIDIA MGX 系统基于先进的 NVIDIA Grace Hopper 超级芯片和 NVIDIA Bluefield-3 DPU 构建,旨在加速 5G 端到端无线网络:

  • 虚拟化 ren (varn)分布式单元(du)
  • Centralized Unit (CU)
  • User Plane Function (UPF)
  • vRouter
  • Cybersecurity

This full-stack acceleration approach delivers leading performance and spectral efficiency while reducing total cost of ownership (TCO) and opening up new profitable opportunities for better return on assets (ROA). The Aerial CUDA-accelerated RAN software stack is available in the NVIDIA 6G Research Cloud platform.

Telcos have already invested billions in 4G/5G spectrum, and they are expected to invest again in 6G spectrum to meet the growing demand for mobile subscribers.

The ecosystem includes chipmakers, OEMs, and independent software vendors (ISVs) to provide solutions with different performance characteristics. These solutions are primarily based on specialized hardware, such as application-specific integrated circuits (ASICs) or system-on-chips (SoCs), to handle compute-intensive Layer 1 (L1) and Layer 2 (L2) functions.

The challenge is to balance the complexity of implementing algorithms in a RAN solution with the cost and power consumption of implementation.

Telcos want to be able to disaggregate the hardware and software of RAN workloads so that they can build networks on cloud infrastructure, opening up possibilities for software innovation, new differentiated services, control hardware lifecycle management, and improved total cost of ownership (TCO).

vRAN demonstrates the ability of a commercial off-the-shelf (COTS) platform to run RAN Distributed Unit (DU) workloads. However, due to the compute performance gap, acceleration is required to enable fixed-function acceleration for certain workloads, such as Forward Error Correction (FEC).

In this article, we discuss the progress of Aerial CUDA Accelerated RAN for DU workload acceleration, detailing the algorithms used and expected benefits, the underlying hardware used, and its ability to consolidate telecom workloads such as DU, centralized units (CUs), and cores, as well as host revenue-generating workloads using multi-tenancy capabilities. Finally, we'll explore the overall TCO and ROA benefits that telcos can expect to achieve.

Aerial CUDA 加速 RAN

NVIDIA Aerial RAN combines Aerial software for 5G and AI frameworks with NVIDIA accelerated computing platforms to help telcos reduce TCO and monetize their infrastructure.

Aerial RAN has the following key features:

  • A software-defined, scalable, modular, highly programmable, and cloud-native framework that eliminates the need for any fixed function accelerators. It gives the ecosystem the flexibility to adopt the modules required for its commercial products.
  • Full-stack acceleration of DU L1, DU L2+, CU, UPF, and other network functions enables workload consolidation to maximize performance and spectral efficiency for superior system TCO.
  • A general-purpose infrastructure with multi-tenancy to support traditional workloads and advanced AI applications for superior RoA.

Full-stack acceleration

Full-stack acceleration is based on the following two pillars:

  • NVIDIA Aerial 软件,可加速 DU 功能 L1 和 L2;
  • Enables ecosystems to run and optimize workloads such as CUs or UPFs on the platform, and enable workload consolidation.

Figure 1 shows that accelerating DU L1 and L2 is a key aspect of NVIDIA's full-stack acceleration.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

图 1. Aerial RAN 堆栈

DU 加速

Aerial has implemented advanced algorithms to improve the spectral efficiency of the RAN stack, covering DU L1 and L2.

The accelerated L1 and L2 capabilities described in this article are implemented through a general-purpose approach that leverages the parallel computing power of GPUs within an accelerated computing platform.

Figure 2 shows the MGX server platform hosting the accelerated L1 cuPHY and L2 MAC scheduler cuMAC on the same GPU instance, with the CPU hosting the L2+ stack. This demonstrates the power of GPU-based platforms to accelerate multiple compute-intensive workloads simultaneously.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 2. cuPHY and cuMAC software architecture

L1 (cuPHY)

Aerial cuPHY is a 3GPP-compliant, GPU-accelerated, fully inline implementation of the data and control channels of the RAN physical layer L1. It offers an L1 high PHY library that provides unmatched scalability by leveraging the computing power and high parallelism of GPUs to handle the compute-intensive portions of L1. It supports standard multiple-input, multiple-output (sMIMO) and massive MIMO (mMIMO) configurations.

As a software implementation, it enables continuous enhancement and optimization of workloads, just as cuPHY continues to deliver capacity gains over time on the AX800 accelerated platform and the new MGX platform.

Channel estimation in L1 is the foundational block in any wireless receiver, and an optimized channel estimator can significantly improve performance. Traditional channel estimation methods include least square (LS) or minimum mean square error (MMSE). A comparison of these methods is summarized in Table 1.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Table 1. Comparison of different channel estimation methods

NVIDIA has enhanced cuPHY L1 with a new channel estimator that outperforms the methods listed in Table 1. This implementation uses the Copy Kernel Hilbert Space (RKHS) channel estimator algorithm.

RKHS L1 channel estimation

RKHS channel estimation focuses on the meaningful part of the time-domain channel impulse response (CIR), limiting unwanted noise and amplifying the relevant part of the impulse response (Figure 3).

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 3. RKHS Channel Estimation Method

RKHS requires complex computations, close to infinitely convex optimization problems. RKHS converts this infinitely convex problem into a finitely convex problem without any loss of performance.

RKHS is computationally intensive, making it ideal for parallel processing on GPUs. Table 2 summarizes the RKHS gain and computational requirements for sMIMO and mMIMO configurations.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Table 2. RKHS Channel Estimation Benefits and Implementation Requirements Aggregation

The CIR calculated by RKHS (Figure 4) is very close to the actual channel (measured in a simulated environment) for the tap-off delay line (TDL)-C channel model with four antennas and two UL layers.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 4. RKHS channel estimation improvements

In a range of modulation and coding schemes (MCS), the improved CIR significantly increases the bit error rate (BER) compared to the signal-to-noise ratio (SNR) curve. Figure 5 shows the advantages of RKHS over MMSE (with two different windows, 1 s and 2.3 s), which provides up to 2.5 dB of gain for MCS 15.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 5. The RKHS channel estimates the dB gain

L2 (cuMAC)

The L2 MAC scheduler in the RAN stack plays an important role in determining how the UE accesses radio resources. This, in turn, determines the spectral efficiency of the entire network.

For 5G systems, there are many degrees of freedom, including:

  • Transmission Interval (TTI) slot
  • Allocated Physical Resource Blocks (PRBs)
  • MCS
  • MIMO layer selection

A typical scheduler focuses on a single unit, which limits the performance achieved. Table 3 shows a comparison of typical scheduler approaches.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Table 3. Comparison of typical scheduler methods

At NVIDIA, we implemented a multi-unit scheduler using a proportional fairness (PF) algorithm that outperformed the two methods listed in Table 3.

Multi-unit scheduler

The NVIDIA Multi-Unit Scheduler significantly improves wireless performance by optimizing scheduling parameters (TTI, PRB, MCS, and MIMO layers) for a large number of adjacent cells (Figure 6).

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 6. Multi-unit scheduler approach

Multi-unit scheduling using the PF algorithm requires complex computational logic to account for various variables across all cells. This is ideal for GPUs with massively parallel processing capabilities. Table 4 summarizes the benefits and computational requirements of sMIMO and mMIMO (jointly dispatched 20 units). As you can see, CPU computing requirements are high.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Table 4. Summary of multi-unit scheduler benefits and implementation requirements

Figure 7 shows the spectral efficiency of 20 100MHz 4T4R 4DL/2UL units, each with 500 active UEs and 16 UE/TTIs.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 7. Multi-unit scheduler gain

DU comprehensive acceleration and improvement

All in all, RKHS channel estimation supports higher MCS allocations per UE, while multi-unit schedulers represent a major leap forward in radio resource scheduling. Both approaches significantly improve spectral efficiency and are optimally implemented on GPUs.

For example, for a 6-cell 100MHz 64T64R system, achieving more than 2x SE gain would require about 240 cores (about 8 x 32-core CPUs), requiring additional CPU servers. In contrast to GPU implementations, where L1 PHY processing and L2 schedulers are hosted on a single GPU in a single server.

Workload consolidation

As mentioned earlier, the second pillar of full-stack acceleration is to consolidate multiple workloads and accelerate them on Aerial RAN. This is achieved by leveraging the available compute resources of GPUs, CPUs, and DPUs in the NVIDIA accelerated computing platform.

For telecom workloads, MGX systems offer a modular and scalable architecture for data centers. The system provides the computing power needed to integrate functions such as the RAN CU, RAN Intelligent Controller (RIC) application, and core functions such as UPF.

NVIDIA Grace Hopper 超级芯片结合了 NVIDIA Grace 和 NVIDIA Hopper 架构,使用 NVIDIA NVLink-C2C 为 5G 和 AI 应用提供 CPU+GPU 一致性内存模型。

CUs can take advantage of many Grace CPU cores. RIC applications, such as xApps, which often incorporate AI/ML techniques to improve spectral efficiency, can be accelerated on the GPU.

As we move further into the network, features such as UPF can benefit from DPU acceleration through the use of key DPU features:

  • GTP encryption and decryption
  • Stream Hashing and Receiver-Side Scaling (RSS)
  • Deep Packet Inspection (DPI)

Workload consolidation enables telcos to minimize the number of servers deployed in the data center, resulting in an overall increase in TCO.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 8. NVIDIA MGX for workload consolidation

多租户 Aerial RAN

Telcos need a platform that can meet the demanding performance and reliability requirements of telco workloads, with the ability to host different types of telco workloads, from RAN to core, on a common platform.

The telecom RAN infrastructure is significantly underutilized. With a multi-tenant cloud infrastructure, telcos can increase utilization with profitable applications when spare capacity is available.

The types of workloads that can provide monetization opportunities for telcos include generative AI and multi-access edge computing (MEC) applications based on large language models (LLMs). These types of workloads are driving unprecedented computing demands in distributed telco edge data centers.

Due to the need to support a large number of LLM-based applications at the edge, there is a significant increase in the number of edge GPU servers and various MEC applications dedicated to performing LLM inference.

Figure 9 shows the MGX platform, which hosts all workloads and helps telcos overcome underutilization of computing resources, reduce their overall energy footprint, and improve the monetization of their infrastructure.

借助 NVIDIA Aerial CUDA 加速 RAN,增强 5G/6G 的 DU 性能和工作负载整合

Figure 9. NVIDIA MGX shares AI and telecom infrastructure

Aerial CUDA 加速 RAN 的优势

So far, we've discussed how NVIDIA Aerial software can help improve overall spectral efficiency, and how accelerated computing platforms can provide processing power to consolidate multiple workloads on the same platform.

A multi-tenant platform enables the monetization of AI workloads. A 5-year TCO analysis shows that the platform is available at approximately 30% of AI's time and provides revenue that significantly offsets the cost of the platform, taking into account typical hourly GPU pricing. This ROA has a significant impact on the performance of the metric per dollar compared to CPU-only systems.

According to the bar graph, GPUs with AI revenue have a 4.1x performance per cost improvement compared to x86 CPUs.

Conclusion

All in all, Aerial RAN delivers superior TCO and unlocks new revenue opportunities to maximize return on investment (ROA).

NVIDIA is transforming the telecommunications infrastructure, which is built on the NVIDIA accelerated computing platform and powered by Aerial software. Aerial CUDA's accelerated RAN meets telcos' desire to deliver market-leading wireless capabilities in a TCO-efficient manner and enables them to begin to monetize deployed infrastructure in ways that are not possible with today's deployed infrastructure.

In this article, we detail the spectral efficiency gains achieved on L1 and L2 using the new algorithms, and discuss the ability of RAN and LLM-based workloads to accelerate AI workloads. The next generation of NVIDIA platforms will further improve these key metrics by delivering higher unit densities and higher workload acceleration.

Aerial CUDA Accelerated RAN is available as part of the NVIDIA 6G Research Cloud Platform. For more information on access, see NVIDIA Aerial:

Job

Read on