laitimes

What did the restlessness and uneasiness of the supercomputing market through El Capitan announce today? Why is it so important? conclusion

author:Codename 8

Last fall, HPE acquired it for $1.6 billion.

Today, HPE announced that the U.S. Department of Energy will select HPE as the enabler of the latest 10-billion-gigabit supercomputer. The El Capitan supercomputer, located at Lawrence Livermore National Laboratory (LLNL), will cost $600 million and is scheduled to go live in 2023.

What did the restlessness and uneasiness of the supercomputing market through El Capitan announce today? Why is it so important? conclusion

The El Capitan supercomputer is based on the next generation of "Genoa" AMD EPYC CPUs, each with four next-generation AMD Radeon GPUs, and the third-generation Infinity framework.

Although there are few details, I believe that the method of integrating 4 GPUs plus CPU may pose a potential threat to NVIDIA.

Note, though, that AMD may be years away from catching up with NVIDIA's leadership in artificial intelligence and high-performance computing.

What did the restlessness and uneasiness of the supercomputing market through El Capitan announce today? Why is it so important? conclusion

<h1 class="pgc-h-arrow-right" > what was announced today? </h1>

Spokespersons for Lawrence Livermore National Laboratory, Cray and AMD announced that they will use the performance of El Capitan (for simulation and machine learning) to protect the U.S. nuclear arsenal at speeds of more than 2 exaflops, 30 percent higher than previously announced performance.

While neither of the three parties revealed the number of nodes or cores in the Xen 3 CPU, it does point out that the GPU/CPU architecture is connected via the AMD Infinity memory coherency framework, with each node interconnected via Cray Slingshot.

Steve Scott, chief technology officer at HPE Cray, said the el Capitan supercomputer will perform more than 10 times the performance of today's number one supercomputer (ORNL Summit powered by IBM POWER CPUs and NVIDIA GPUs), and more than the performance of the top 200 supercomputers combined.

Forrest Norrod said AMD EPYC and Radeon GPUs will be standard products, rather than proprietary SKUs for Lawrence Livermore National Laboratory.

In addition to industry-leading single-core and multi-core performance, another key feature is the simplicity of programming, and the shared memory consistency of Infinity Fabric greatly improves programming efficiency. This means that every software thread running on a node can access all 4 HBM-3 stacks on the GPU and CPU memory as a single memory space.

What did the restlessness and uneasiness of the supercomputing market through El Capitan announce today? Why is it so important? conclusion

Why is < h1 class="pgc-h-arrow-right" > so important? </h1>

First, the choice of AMD in this performance-first system is a strong endorsement of AMD's CPU and GPU roadmap.

That said, the U.S. Department of Energy and Cray's engineers and management are confident in AMD's future development and ability to execute.

Technically, putting the CPU and 4 GPUs on an integrated cache-coherent fabric has enormous potential to optimize performance and minimize programming hassle.

In stark contrast, Nvidia (no data center-class CPU) uses the proprietary NVLINK V2 to interconnect the GPU, relying on the slower PCIe Gen 3 I/O interconnect to the CPU.

In addition, IBM POWER does support NVLINK 2, but the correlation between POWER and the supercomputing space is low.

So unless Nvidia starts gravity investing in ARM server CPUs or acquiring IBM POWER, I don't see how Nvidia will respond to this opportunity.

This means that while AMD faces huge software challenges, it may have an advantage on medium-sized (4-GPU nodes).

It should also be noted that AMD's approach is limited to 4 GPU architectures and relies on Cray Slingshot to interconnect more GPUs at scale.

But Slingshot is definitely not slack, and it achieves extremely high bandwidth at a staggering 12.8 Tb/s speed in each direction on 64 200 Gbps ports.

Although AI may require the use of thousands of GPUs, 4 GPU nodes are ideal for high-performance computing and can also run AI well for today's DNN models.

Nvidia is aware of the coming scramble and plans to acquire Mellanox, which will provide a solution for a highly scalable GPU framework.

But of course, no CPU native supports Mellanox InfiniBand, and it's unclear how Mvidia can solve the CPU-GPU bottleneck.

Intel has hooked up with Habana Labs, which uses 100Gb Ethernet and RDMA to solve the same problem. Ethernet connects the node, while Infinity connects the processors within the node.

Finally, I'd like to point out that AMD and Nvidia (and in the future, Intel) are favoring startups that can build AI-specific ASICs: GPUs can handle the 64-bit floating-point intensive workloads common in high-performance computing, as well as less accurate AI workloads. This is one of the reasons why NVIDIA is so popular in the public cloud.

All three gigabit supercomputers located in the U.S. use GPUs (from Intel and AMD) to handle high-performance computing workloads while also providing performance acceleration for AI.

< h1 class="pgc-h-arrow-right" > conclusion</h1>

AMD is shaping itself as an attractive option in high-performance computing by integrating CPUs and GPUs, just as AMD has leveraged its APU strategy to differentiate itself in the notebook market.

To do this, AMD needs to build an AI software ecosystem to penetrate the AI market, and we are also looking at next-generation chips from AMD, Intel, and Nvidia to determine whether AMD can take advantage of this to take the lead in the market.

In short, a great war is about to break out.

Read on