NVIDIA released a dedicated GPU for ChatGPT, which increased the inference speed by 10 times

author:Machine Heart Pro

Machine Heart report

Editors: Zenan, Egg Sauce

The iPhone moment of AI must have a good chip.

At some point, artificial intelligence entered a decades-long bottleneck due to insufficient computing power, and GPUs ignited deep learning. In the ChatGPT era, AI is once again facing the problem of insufficient computing power because of large models, this time NVIDIA still has a way?

On March 22, the GTC conference was officially held, and at the Keynote just held, NVIDIA CEO Jensen Huang moved out the chip prepared for ChatGPT.

"Accelerating computing is not an easy task, and in 2012, the computer vision model AlexNet used the GeForce GTX 580, which can process 262 PetaFLOPS per second. The model sparked an explosion in AI technology," Huang said. "Ten years later, Transformer appeared, and GPT-3 used 323 ZettaFLOPS computing power, 1 million times more than AlexNet, to create ChatGPT, the AI that shocked the world. A new computing platform has emerged, and the iPhone era of AI has arrived."

NVIDIA released a dedicated GPU for ChatGPT, which increased the inference speed by 10 times

The AI boom has boosted Nvidia's stock price by 77% this year, and Nvidia's market capitalization of $640 billion is nearly five times that of Intel. But today's release tells us that NVIDIA's footsteps have not stopped.

Design dedicated computing power for AIGC

The development of generative AI (AIGC) is changing the need for computing power in technology companies, and NVIDIA showed four inference platforms for AI tasks at once, all using a unified architecture.

NVIDIA released a dedicated GPU for ChatGPT, which increased the inference speed by 10 times

Among them, NVIDIA L4 provides "120 times higher AI-driven video performance than CPUs, and 99% energy efficiency", which can be used for video streaming, encoding and decoding, and generating AI video; The more powerful NVIDIA L40 is dedicated to 2D/3D image generation.

In response to the huge demand for computing power ChatGPT, NVIDIA released NVIDIA H100 NVL, a large language model (LLM) dedicated solution with 94GB of memory and accelerated Transformer Engine, equipped with PCIE H100 GPU with dual GPU NVLINK.

NVIDIA released a dedicated GPU for ChatGPT, which increased the inference speed by 10 times

"The only GPU that can actually handle ChatGPT is the NVIDIA HGX A100. A standard server with four pairs of H100 and dual NVLINK is now 10 times faster than the former, reducing the processing cost of large language models by an order of magnitude," said Jensen Wong.

Finally, there's the NVIDIA Grace Hopper for Recommendation Models, which, in addition to being optimized for recommendation tasks, can power graphical neural networks and vector databases.

Push the limits of physics with chips

At present, the production process of semiconductors has approached the limit of what physics can achieve. After the 2nm process, what is the breakthrough point? NVIDIA decided to start with the most primitive stage of chip manufacturing - lithography.

Fundamentally, this is an imaging problem at the limits of physics. In advanced processes, many features on the chip will be smaller than the wavelength of the light used in the printing process, and the design of the mask must be constantly modified, a step called optical proximity correction. Computational lithography simulates the behavior of light as it interacts with the photoresist through the element, described according to Maxwell's equations, which is the most computationally demanding task in chip design and manufacturing.

NVIDIA released a dedicated GPU for ChatGPT, which increased the inference speed by 10 times

Jensen Huang announced a new technology at GTC called CuLitho to accelerate the design and manufacture of semiconductors. The software uses NVIDIA chips to accelerate software-based chip designs and accelerate the steps between physical fabrication for printing the lithographic mask of that design on the chip.

Running on GPUs, CuLitho delivers 40x better performance than current lithography and can accelerate large-scale computing workloads that currently consume tens of billions of CPU hours per year. "It takes 89 masks to build the H100, and it takes two weeks to compute one on the CPU, but it only takes eight hours to run on CuLitho with the H100," Huang said.

This means that 500 NVIDIA DGX H100 systems can replace the work of a 40,000-CPU system and run all parts of the computational lithography process, helping to reduce power demand and potential environmental impact.

This advance will make the chip's transistors and circuits smaller than they are today, while accelerating the chip's time to market and improving the energy efficiency of large-scale data centers that operate around the clock to drive the manufacturing process.

Nvidia says it is working with ASML, Synopsys and TSMC to bring the technology to market. According to reports, TSMC will begin preparing for trial production of this technology in June.

"The chip industry is the foundation of almost every other industry in the world," Mr. Huang said. "As lithography is already at the limits of physics, through CuLitho and collaboration with our partners TSMC, ASML, and Synopsys, the fab is able to increase yields, reduce carbon footprints, and lay the groundwork for 2nm and beyond."

The first GPU-accelerated quantum computing system

At today's event, NVIDIA also announced a new system built using Quantum Machines that provides a revolutionary new architecture for researchers working on high-performance and low-latency quantum classical computing.

NVIDIA released a dedicated GPU for ChatGPT, which increased the inference speed by 10 times

As the world's first GPU-accelerated quantum computing system, NVIDIA DGX Quantum combines the world's most powerful accelerated computing platform (enabled by NVIDIA Grace Hopper super chips and CUDA Quantum open source programming model) with the world's most advanced quantum control platform, OPX (provided by Quantum Machines). This combination enables researchers to build more powerful applications than ever before, combining quantum computing with state-of-the-art classical computing to enable calibration, control, quantum error correction, and hybrid algorithms.

At the heart of the DGX Quantum is an NVIDIA Grace Hopper system connected by PCIe to Quantum Machines OPX+, enabling sub-microsecond latency between the GPU and the quantum processing unit (QPU).

Tim Co., Head of HPC and Quantum at NVIDIA, said: "Quantum-accelerated supercomputing has the potential to reshape science and industry, and NVIDIA DGX Quantum will enable researchers to push the boundaries of quantum-classical computing."

In this regard, NVIDIA integrated the high-performance Hopper architecture GPU with the company's new Grace CPU into "Grace Hopper", which provides super power for giant AI and HPC applications. It delivers up to 10x the performance for applications running terabytes of data, giving quantum-classical researchers more power to solve the world's most complex problems.

DGX Quantum also equips developers with NVIDIA CUDA Quantum, a powerful unified software stack that is now open source. CUDA Quantum is a hybrid quantum-classical computing platform capable of integrating and programming QPUs, GPUs, and CPUs in a single system.

$37,000 a month to train your own ChatGPT on the page

Microsoft spent hundreds of millions of dollars on tens of thousands of A100s to build GPT-specific supercomputing, and you may now want to rent OpenAI and Microsoft to train the same GPUs as ChatGPT and Bing Search to train your own large models.

NVIDIA's DGX Cloud provides a dedicated NVIDIA DGX AI supercomputing cluster paired with NVIDIA AI software, which enables every enterprise to access AI supercomputing using a simple web browser, eliminating the complexity of acquiring, deploying, and managing on-premises infrastructure.

NVIDIA released a dedicated GPU for ChatGPT, which increased the inference speed by 10 times

According to reports, each DGX Cloud instance has eight H100 or A100 80GB Tensor Core GPUs, and each node has a total of 640GB GPU memory. A high-performance, low-latency fabric built with NVIDIA Networking ensures workloads can scale across clusters of interconnected systems, allowing multiple instances to act as one giant GPU to meet the performance requirements of advanced AI training.

Enterprises can now rent DGX Cloud clusters on a monthly basis to quickly and easily scale development of large multi-node training workloads without waiting for accelerated compute resources that are typically intensive.

And the price of monthly rent, according to Huang, starts at $36,999 per instance per month.

"We are in the iPhone moment of artificial intelligence," said Jensen Wong, "Startups are racing to create disruptive products and business models, and incumbents are looking to respond." DGX Cloud gives customers instant access to NVIDIA AI supercomputing in the cloud at global scale."

To help enterprises embrace the wave of generative AI, NVIDIA also announced a series of cloud services that enable enterprises to build and improve custom large-scale language models and generative AI models.

Now, people can use NVIDIA NeMo language services and NVIDIA Picasso image, video, and 3D services to build proprietary, domain-specific, generative AI applications for intelligent conversation and customer support, professional content creation, digital simulation, and more. In addition, NVIDIA also announced a new model of NVIDIA BioNeMo biology cloud service.

"Generative AI is a new type of computer that can be programmed in the natural language of humans. The impact of this ability is profound – everyone can command a computer to solve a problem, and not long ago, it was the preserve of programmers," Huang said.

From today's release, it seems that NVIDIA is not only continuously improving hardware design for the AI workload of technology companies, but also proposing new business models. In the eyes of some, NVIDIA wants to be "TSMC in the AI field": providing advanced productivity foundry services like fabs, helping other companies train their own scenario-specific AI algorithms on top of it.

Using NVIDIA's supercomputing training to directly eliminate the middleman to earn the difference will be the direction of AI development in the future?