laitimes

Decode AI performance on RTX AI PCs and workstations

The era of AI PCs powered by NVIDIA RTX and GeForce RTX technology has arrived. Against this backdrop, a new way to evaluate AI-accelerated performance has emerged, along with a new set of terminology that has become the reference for users when choosing desktops and laptops.

While PC gamers are aware of frames per second (FPS) and similar statistics, measuring AI performance requires new metrics.

TOPS stands out

TOPS, or trillion operations per second, is the primary benchmark. "Trillions" is the key word here: the amount of processing behind generative AI tasks is enormous. You can think of TOPS as a raw performance indicator, similar to the engine's power rating. Naturally, the higher the number, the better.

Compare, for example, to Microsoft's recently announced Windows 11 AI PC, which contains a neural processing unit (NPU) capable of performing at least 40 trillion operations per second. 40 TOPS is more than enough hashrate for some lightweight AI-assisted tasks, such as asking a local chatbot where yesterday's notes were.

But much generative AI requires much more computing power than that. NVIDIA RTX and GeForce RTX GPUs deliver exceptional performance for all generative tasks, with the GeForce RTX 4090 GPU delivering up to 1,177 TOPS. That's the computing power needed to handle tasks like AI-assisted digital content creation (DCC), AI super-resolution for PC games, generating images from text or video, interacting with native large language models (LLMs), and more.

Performance is measured in tokens

TOPS is just a foundational metric. The performance of an LLM is measured by the number of tokens generated by the model.

A token is the output of an LLM, which can be a word in a sentence or even a smaller fragment like punctuation or a space. The performance of AI-accelerated tasks can be measured in "tokens per second".

Another important factor is batch size, which is the number of inputs that can be processed at the same time in a single inference. As large language models (LLMs) are at the heart of many modern AI systems, the ability to process multiple inputs, such as from or across multiple applications, will be a key differentiator. While a larger batch size can improve the performance of concurrent inputs, it also requires more memory, especially when running larger models.

RTX GPU 非常适合 LLM,因为它们拥有大量专用的显存(VRAM)、Tensor Core 和 TensorRT-LLM 软件。

GeForce RTX GPUs deliver up to 24GB of high-speed VRAM, while NVIDIA RTX GPUs offer up to 48GB of high-speed VRAM to support larger models and larger batch sizes. RTX GPUs can also take advantage of Tensor Cores, a purpose-built AI accelerator that dramatically accelerates compute-intensive computations in deep learning and generative AI models. Applications can easily achieve ultra-high performance with NVIDIA TensorRT software development kits (SDKs). The kit unlocks ultra-high-performance generative AI on more than 100 million Windows PCs and workstations powered by RTX GPUs.

Combined with large video memory, dedicated AI accelerators, and optimized software, RTX GPUs get a huge boost in throughput, especially as the batch size increases.

Text-to-image production is faster than ever

Measuring image generation speed is another way to evaluate performance. One of the most straightforward ways to do this is to use Stable Diffusion, a popular image-based AI model that allows users to easily transform text descriptions into complex visuals.

Loading...

With Stable Diffusion, users can quickly get the image they want by entering keywords, and when running AI models on RTX GPUs, they can generate the desired results faster than CPU or NPU.

The performance is even better when using the TensorRT extension in the popular Automatic1111 interface. With SDXL models, RTX users can generate images up to 2x faster with prompts, greatly simplifying the Stable Diffusion workflow.

Another popular Stable Diffusion user interface, ComfyUI, also supported TensorRT acceleration last week. RTX users can now speed up to 60% faster Bunsen graphs. RTX users can also use Stable Video Diffusion to convert these images into video, which can even be up to 70% faster with TensorRT.

Loading...

全新的 UL Procyon AI 图像生成基准测试现已支持 TensorRT 加速。 与最快的非 TensorRT 加速状态相比,TensorRT 加速可在 GeForce RTX 4080 SUPER GPU 上带来 50% 的速度提升。

TensorRT Acceleration for Stable Diffusion 3, Stability AI's highly anticipated new text-to-image model, was recently announced. In addition, the new TensorRT-Model Optimizer boosts performance even further. It delivers significant speed gains while reducing memory consumption compared to non-TensorRT accelerated states.

Of course, seeing is believing. The real test comes from iterating on the real-life scenario of the prompt word. On RTX GPUs, users can optimize images significantly faster by improving prompts, with each iteration taking only a few seconds. On the Macbook Pro M3 Max, the same iteration takes minutes. Plus, if running locally on an RTX-powered PC or workstation, users can enjoy the benefits of speed and security at the same time, and keep everything private.

The test results are released, and the related technology is open source

But don't just take our word for it. The team of AI researchers and engineers behind the open-source Jan.ai recently integrated TensorRT-LLM into their native chatbot application and then tested these optimizations for themselves.

The researchers tested the real-world performance of TensorRT-LLMs on a variety of GPUs and CPUs used by the community, using the open-source llama.cpp inference engine as a control. They found that TensorRT was "30-70% faster than llama.cpp on the same hardware" and was more efficient when doing continuous processing. The team also provides a testing methodology, inviting others to test the performance of generative AI in person.

Whether it's a game or generative AI, speed is the key to success. TOPS, tokens per second, and batch size are all taken into account when determining a performance champion.

Read on