New choice, what changes will Intel's new Gaudi2 processor bring?

Author: Li Xiangjing

Today, the demand for computing power driven by large models continues to surge, and while GPUs are preferred, computing infrastructure also requires new combinations of processors to give users more choice.

Recently, Intel launched the second generation of Gaudi deep learning accelerator - Habana Gaudi2 for the Chinese market. With its price/performance advantages, Gaudi2 can provide higher deep learning performance and efficiency, making it a better solution for large-scale AI deployment.

New choice, what changes will Intel's new Gaudi2 processor bring?

Sandra Rivera, executive vice president and general manager of Intel's Data Center and Artificial Intelligence Group, said that today, artificial intelligence is the most disruptive technology in the industry and is having a huge impact on our lives. Almost every industry is looking for ways to deploy AI to increase productivity and drive innovation. Based on this, Intel is also actively working with customers in a wide range of market segments to successfully deploy AI into their businesses.

Intel actively promotes the implementation of AI

The recent boom in generative AI and LLM (Large-Scale Language Modeling) has greatly accelerated the development of AI and spawned many computing needs.

AI's data streams include extensive and complex workloads and multimodal datasets. In the face of AI computing needs, there is no universal solution. General-purpose processors are widely used in the data ingestion stage and classical machine learning to train small and medium-sized models. The massive adoption of the x86 architecture and its built-in AI capabilities make general-purpose processors an ideal solution for solving AI data streams.

Sandra Rivera said Intel is committed to making it easier for customers to deploy AI wherever computing happens. For example, the AMX AI acceleration engine, an AI accelerator integrated in the fourth-generation Intel Xeon Scalable processor, can provide up to 10 times the AI inference and training performance improvement compared to the previous generation.

In addition to hardware-level innovations, Intel continues to invest in software stack tools such as oneAPI and OpenVINO, Pytorch, TensorFlow, and DeepSpeed to provide developers with openness and choice in using hardware architectures.

"Intel has a proven track record of working with open ecosystems to extend technology through long-term investments in developer ecosystems, tools, technologies, and open platforms that enable customers to easily deploy AI on the general-purpose processors already in their infrastructure." Sandra Rivera said.

The new Gaudi2 training accelerator

Although Intel Xeon Scalable processors can run many AI workloads, to support larger model sizes and meet a wide range of system requirements, heterogeneous computing approaches and different computing architectures are required. The Gaudi deep learning accelerator further enriches Intel's AI product array with a large language model.

The Gaudi2 deep learning accelerator and Gaudi2 mezzanine card HL-225B is based on the first-generation Gaudi high-performance architecture, which accelerates the operation of high-performance big language models with improved multi-directional performance and energy efficiency ratio. The accelerator features: 24 programmable Tensor processor cores (TPCs), 21 100 Gbps (RoCEv2) Ethernet interfaces, 96GB HBM2E memory capacity, 2.4TB/s total memory bandwidth, 48MB on-chip SRAM, and integrated multimedia processing engine.

Eitan Medina, COO of Habana Labs, said that the key factors that Gaudi2 can bring value to Chinese customers are its outstanding performance, scalability, all-round energy efficiency improvement, and ease of use.

The excellent performance of the Gaudi2 accelerator was fully certified in the MLCommons MLPerf benchmark published in June, with excellent training results on the GPT-3 model, the computer vision model ResNet-50 (using 8 accelerators), Unet3D (using 8 accelerators), and the natural language processing model BERT (using 8 and 64 accelerators). Compared with other products on the market for large-scale generative AI and large-language models, Gaudi2 has excellent performance and leading price/performance advantages, which can help users improve operational efficiency while reducing operating costs.

In addition, Gaudi2 provides excellent inference performance for large-scale multimodal and language models. In the recent Hugging Face evaluation, it performed at scale inference, including leading the industry when running Stable Diffusion (another of the most advanced generative AI models for generating images from text), 7 billion, and 176 billion parameter BLOOMz models.

The computational needs of generative AI and LLM require massive scaling, and the architecture of the Gaudi2 deep learning accelerator is designed to scale efficiently to meet the needs of large-scale language models and generative AI models. Each chip integrates 21 100Gbps (RoCEv2 RDMA) Ethernet interfaces dedicated to internal interconnects, enabling low-latency in-server scaling.

On Stable Diffusion training, Gaudi2 demonstrated near-linear 99% scalability from 1 card to 64 cards. In addition, MLCommons' just-announced MLPerf Training 3.0 results also verify that the Gaudi2 processor can achieve an impressive near-linear 95% scaling effect from 256 accelerators to 384 accelerators on the 175 billion parameter GPT-3 model.

Intel is committed to supporting customers in easily building new models and migrating current GPU-based model businesses and systems to new Gaudi servers. Based on this, Intel has created the SynapseAI software suite optimized for deep learning training and inference on the Gaudi platform.

Join hands with China's ecology to accelerate the implementation of Gaudi2

In addition to innovative hardware products, Intel accelerates the landing of AI through an open ecosystem. For example, Baidu Intelligent Cloud brings multiple times the performance optimization to the ERNIE-Tiny model through the fourth-generation Intel Xeon Scalable processor integrated with the Intel AMX acceleration engine.

He Yongzhan, senior manager of Baidu Intelligent Cloud Server, said that Baidu and Intel have jointly carried out a number of performance optimization work on the fourth-generation Xeon Scalable processor based on the AMX acceleration engine. For example, engine optimization improves processing efficiency, and uses oneDNN to achieve efficient invocation of AMX instructions and memory performance optimization, which brings 2.66 times the performance optimization to Baidu ERNIE Transcription Version, that is, Baidu Feijiao Wenxin Model Lightweight Version, and achieves satisfactory results.

At present, Intel is cooperating with Inspur to build and sell the Inspur AI server NF5698G7 based on the Gaudi2 deep learning accelerator.

Liu Jun, general manager of Inspur's AI & HPC product line, said that the NF5698G7 server launched by Inspur this time supports 8 Gaudi2 AI accelerators based on open acceleration module OAM high-speed interconnection in 6U space, each Gaudi2 chip is equipped with 96GB HBM high-speed memory, providing a total of 2.1Tbps P2P interconnection bandwidth, supporting full interconnection topology, and meeting the parallel communication needs of large model training tensors; equipped with 2 fourth-generation Xeon processors, supporting AMX/ AI acceleration engines such as DSA.

NF5698G7 is based on the global open computing OCP open accelerator specification OAM/UBB for product design, supports mainstream AI frameworks such as PyTorch and TensorFlow and popular development tools such as Megatron/DeepSpeed, and provides mature and cost-effective open ecological product solutions for generative AI.

In addition to Inspur Information, New H3C and Superfusion will also launch Gaudi2-based server products.

epilogue

For decades, Intel has been providing the Chinese market with a leading standards-based data center heterogeneous portfolio that enables them to deploy AI anywhere.

"We will continue to build an open ecosystem for general-purpose computing, and provide higher deep learning training performance through Gaudi2 deep learning accelerator, improve user productivity, and help accelerate the deployment and application of AI in China." Sandra Rivera concluded.