laitimes

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

author:Niuhua Net

From the chatbot program ChatGPT to the Wensheng video model Sora, the vigorous development of AI models has brought continuous challenges to algorithm models, high-quality data, and computing infrastructure. "When enterprises increase the scale of their clusters through scale out, they need to connect data centers from micro to macro and point-to-point, enhance the interconnection performance at all levels, and truly and effectively use computing resources. Tian Mochen, founder and CEO of Strange Moore, said in an interview with electronics enthusiasts.

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

With the slowdown of Moore's Law, improving the performance and computing power of single-processor systems through scaling up has encountered many difficulties such as "long pipelines, high latency, and difficult wiring". As a continuation of scale up, Scale out introduces large-scale interconnection at the physical level, making "computing power-interconnection" a new starting point for improving computing power. IPnest, a research institution, predicts that in 2025, the market share of "inter-chip interconnection technology" interface IP is expected to surpass processor IP and become the No. 1 IP category. So, what will be the characteristics of intra-chip, inter-chip, and inter-network interconnection technology in the future, and how will the high-performance computing system develop? On these topics, we interviewed Mr. Tian Mochen, CEO of Singular Moore, a representative enterprise in the field of interconnection technology.

On-chip connectivity: from dedicated to universal

Theoretically, the core can be regarded as a fixed module to realize the reuse of different products and generations. In the development of intelligent computing center clusters, the advantages of the interconnected chips represented by IO Die in improving yield, reducing manufacturing complexity and cost have become the consensus of industry development. AMD's Zen series and Intel's Clearwater Forest flagship data center processors are typical examples.

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

Intel Clearwater Forest 2

Tian Mochen believes that the on-chip interconnection technology represented by IO Die presents two major trends: core granulation and 3D development. Chipping is to improve the flexibility of the architecture and reduce the dependence of chips on advanced processes, while 3D is to further improve the interconnection density through the vertical dimension.

At present, IO Die in the market is mainly dominated by large manufacturers such as AMD and Intel, but private protocols are not compatible with chips from different sources, and the closed ecosystem of dedicated IO Die has become a constraint to its development. Driven by huge demand, general-purpose IO chips are beginning to emerge. Taking Kiwi IO Die, a general-purpose interconnected chip under Strange Moore, as an example, the product integrates a large number of storage and interconnection interfaces such as D2D\DDR\PCIe\CXL, which can support up to 10+ chiplets and build a computing power platform with up to 192 core CPUs or 1000T GPUs.

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

奇异摩尔通用互联芯粒 Kiwi IO Die

At the same time, benefiting from the advancement of advanced packaging technology, IO Die has also seen a structural change from 2.5D to 3D. Base Die can be regarded as the 3D form of IO Die, which allows different computing and storage chips to be stacked or side-by-side, which can significantly improve the integration of transistors per unit area of the chip, bringing higher bandwidth, lower latency, and power consumption.

The situation of Base Die in the market is similar to that of IO Die, although the dedicated product has shown commercial value in the market, the technology has not proliferated, but has been monopolized by a few leading companies. With the efforts of innovative companies represented by Strange Moore, the general market for Base Die has taken off. According to Tian Mochen, Kiwi 3D Base Die, a universal interconnection base under Strange Moore, is the world's first case in 3D high-performance universal base, realizing breakthroughs in bandwidth, energy efficiency, and the number of chips carried by the universal interconnection chip, which can achieve 8 times the interconnection density of the 2.5D structure with 20% power consumption, and can achieve up to 16 computing power cores stacked.

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

奇异摩尔通用互联底座Kiwi 3D Base the

IO Die and Base Die are just two typical examples of interconnection technologies that illustrate how on-chip interconnection technologies can generate more power for computing power between computing and storage, in the wave of massive intelligent computing centers and scaling out. In fact, in addition to on-chip interconnection, there are many ways to achieve higher and better connections and lower costs for more data, such as single-point to comprehensive breakthroughs in inter-chip interconnection and inter-network interconnection technology.

Inter-chip connectivity that needs to be accelerated: the D2D interface

Like on-chip interconnection, benefiting from the rapid growth of computing power and computing power, the inter-chip interconnection technology needs to be accelerated. Die-to-die technology (D2D) based on chiplet technology brings a more efficient way to connect computing and memory, which can integrate computing and storage chips together seemingly effortlessly to form a SoC-level chip at the interconnection level.

Compared with the traditional interconnection of computing chips and memory chips, D2D provides a more efficient and lower latency connection method, which is the basis for the implementation of Chiplet, Tian Mochen introduced. D2D can effectively shorten the physical distance of data transmission, reduce latency and increase processing speed, as the basis of advanced packaging, D2D can realize the seamless connection of computing and storage units, further improve performance and reduce power consumption, and based on D2D, enterprises can more flexibly realize multi-module configuration of computing and storage units, improve system scalability and flexibility, and reduce system maintenance costs. These advantages make D2D interfaces play a key role in the scaling out construction of high-performance clusters.

Like IO Die, D2D also needs to be generalized vigorously. Based on the UCIe standard, Singular Moore has launched the world's first batch of Die2Die IP that supports UCIe V1.1, with an interconnection speed of up to 32GT/s, latency as low as several nS, and full support for mainstream protocols such as UCIe, CXL, and Streaming, which is plug-and-play. Tian Mochen said that all products of Strange Moore are built on the basis of international standard protocols, and are committed to making each product interconnected and forming an open chiplet system.

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

奇异摩尔高速互联接口Kiwi Die2The IP

RISC-V+Chiplet:1+1>2

Today, in addition to chiplets, RISC-V architectures are also making great strides in high-performance computing. In the edge computing market, traditional general-purpose MCU/MPU/CPU has been difficult to meet different application scenarios and performance requirements, and RISC-V brings better PPA implementation. RISC-V is an open standard in nature, and it is inevitable to impact the high-performance computing market, and the integration of the two (with Chiplet) is believed to open up a 1+1>2 innovation momentum for the high-performance computing market. This is also the deep motivation for the cooperation between Ventana, a leading company in RISC-V high-performance processors, and Strange Moore.

According to Balaji Baktha, founder and CEO of Ventana, the two companies have teamed up to create a scalable processor architecture that can combine multiple Ventana Veyron V2s with Singular Moore's IO Die into SoCs in different configurations. Tian Mochen believes that the combination of V2 and Singular Moore IO Die is a successful case of the integration of RISC-V and Chiplet in the field of high-performance computing.

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

"RISC-V has the characteristics of open source, openness, flexibility and high customizability, and has designed a variety of instruction set extensions for task acceleration, which can realize task acceleration such as vector computing, encryption and decryption, and has high computing performance, and simple performance reduces the power consumption of the chip. "Chiplets are an important part of the strategy to build next-generation semiconductor products, making it easy to build high-performance CPUs." Its 'composability' allows users to combine compute, memory, and IO in the best proportions to create a system that is more efficient in terms of performance, cost-effectiveness, workload, and more." Combining RISC-V's open architecture with Chiplet's open hardware design drives workflow efficiencies in data centers and maximizes single-socket performance. ”

The reporter learned that Strange Moore and Ventana have been working hard to push the combination of RISC-V and IO Die to the forefront of next-generation computing architectures to improve the efficiency of data center services and workloads, and jointly created a high-performance data center-class RISC-V processor that incorporates RISC-V The advantages of the architecture and modular chiplet design, each V2 unit includes 32 cores, and finally achieves a maximum of 192 cores, which is also the world's first data center-class RISC-V chiplet processor.

Looking back on the cooperation with Ventana, Tian Mochen said that from the perspective of technical interoperability, based on the interconnection of IO Die, the three major architectures of x86, ARM and RISC-V all need to use a large number of storage access and external interfaces for a large number of transmission, reading and scheduling. The convergence of RISC-V and chiplet technologies further enhances the customization of computing platforms, avoiding the need for high-performance computing from being locked into a single vendor ecosystem and enabling enterprises to meet the architectural change challenges of AGI, which are difficult to achieve in x86 and ARM chiplet designs.

For example, in order to realize the communication of chips from different sources, the two companies adopted the solution of connecting the CPU chips with the central design of IO Die, and achieved ns-level low latency and efficient data transmission through Kiwi Fabric. Let the entire SoC present the characteristics of independent CPUs from the perspective of workflow.

In terms of performance, the two companies are working together to achieve up to 192 cores to meet the high-performance benchmark set by the existing ISA (x86/ARM) and ensure that the processor microarchitecture can deliver world-class performance. At the same time, all cores share high performance, cache, and memory through consistent interconnect.

In terms of specific domains, it provides flexible hardware configuration options for various workloads through the overall planning of compute chips, memory and various accelerator ratios, and end-to-end RAS is built into the CPU to ensure that all buses are protected by secure boot verification and level verification, while overcoming side-channel attacks and other vulnerabilities to ensure the security of the CPU chip and the entire SoC layer.

From compute acceleration to network acceleration

From the perspective of industry dynamics, the transformation of high-performance computing from scale up to scale out is all-round, in chip design, computing power cards, and clusters. To put it simply, the core change of Scale out is connectivity. Tian Mochen believes that behind the challenge of massive data interaction caused by scale out is the need to accelerate the transformation of the focus from computing to network and the optimization of the three elements of "Bandwidth, Efficiency, and Workload".

In terms of computing power cluster network transmission protocols, the traditional TCP/IP protocol has the disadvantages of heavy CPU load and high latency, which is difficult to meet the strict requirements of high-performance computing for network throughput and latency. RDMA can directly access in-memory data through the network interface without the intervention of the operating system kernel, making massively parallel computing clusters feasible. The computer network protocol stack will transition from TCP/IP to RDMA, turning a cluster into a single device at the network level.

RDMA does not specify a complete protocol stack, so it contains different branches, such as NVIDIA's Quantum InfiniBand, which is an ultra-low latency, ultra-high throughput private network engine designed specifically for RDMA. However, the industry needs a more versatile solution. RoCE, which is comparable to InfiniBand in terms of performance, significantly reduces the cost of RDMA protocol communication and is believed to be able to break NVIDIA's technology monopoly in this field.

That's why Strange Moore has launched the high-performance Network Domain Specific Accelerator (Kiwi) NDSA series. According to reports, Kiwi NDSA has built-in RoCE V2 high-performance RDMA (Remote Direct Memory Access) and dozens of offload/acceleration engines, which can be used as independent chips to achieve acceleration at different positions of the system. The Kiwi NDSA product family includes "NDSA-RN-F" and "NDSA-RN". The former is the world's first batch of 200/400G high-performance FPGA RDMA network cards, which will be available in the near future, and the latter is the world's first RDMA NIC chiplet product that supports 800G bandwidth, in addition to upgrading the bandwidth to 800G, the latency is also reduced to ns, and supports tens of GB of ultra-large-scale data packets.

Scale out into a better solution for high-performance computing, and the general interconnection technology has a great future: an interview with the CEO of Strange Moore

Kiwimore High-Performance Network Accelerates Kiwi NDSA

epilogue

In the context of the all-round transformation of the high-performance computing system from scale up to scale out, interconnection technology has become a new solution to improve the computing power of clusters. The interconnection chips represented by IO Die are accelerating their generalization process and the transition from 2.5D to 3D, traditional transmission methods such as PCIe between chips are being replaced by low-latency and low-power D2D technologies, and the trunked interconnection network is shifting from TCP/IP to RDMA architecture, and the general RDMA solution will usher in broader opportunities. The convergence of interconnection technology and RISC-V architecture can help enterprises better cope with the challenges of architecture change caused by AGI and help users in the high-performance computing field achieve scale out. In the future, connectivity will be a critical market for almost every enterprise in the high-performance computing industry chain.