laitimes

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

"Speed is a double-edged sword for Groq here. “

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive
Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

Two days before the release of the earnings report, Nvidia suddenly emerged with a fierce rival.

A company called Groq is swiping the screen in the AI circle today, and there is only one killer move: fast.

In traditional generative AI, waiting is a common thing, and characters pop up one by one, and it takes half a day to answer. But on Groq's open cloud service experience platform today, you will see one screen per second. When the model is prompted, it is able to generate an answer almost instantly. Not only are these answers believable, but they are also quoted and hundreds of words long.

Matt Shumer, CEO and co-founder of email startup Otherside AI, experienced the power of Groq firsthand during the demo. He praised Groq for being lightning fast, capable of generating factual and quoted answers of hundreds of words in less than a second. What's even more surprising is that it spends more than 3/4 of its time searching for information, while the time it takes to generate answers is as short as a fraction of a second.

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

Although it is only available today, Groq is not a fledgling start-up. In fact, the company was founded in 2016 and registered the Groq trademark at that time. Last November, when Musk released the artificial intelligence model Grok, the developers of Groq posted an article saying that Musk had named his company. The letter was written quite funny, but they didn't eat this wave of traffic at all.

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

An open letter from Groq to Musk at the time

The reason why they were able to suddenly explode this time is mainly because of the launch of the Groq cloud service, so that everyone can really feel how cool it is to use AI without lag.

Some users who work in AI development have praised Groq as a "game-changer" in the pursuit of low-latency products, which refers to the time it takes to process a request to get a response. Another user said that Groq's LPUs are expected to "revolutionize" the demand for GPUs in AI applications in the future, and believe that it could be a powerful alternative to the "high-performance hardware" of Nvidia's A100 and H100 chips.

The core technology of the Groq chip that can win in terms of speed is the LPU

According to the results of the first public benchmark of its model, the Llama2 or Mistreal models powered by Groq cloud services far outperform ChatGPT in terms of computational and response speed. Behind this remarkable performance is the Groq team's tailor-made ASIC for large language models (LLMs), which enables Groq to generate up to 500 tokens per second. In comparison, the current public version of ChatGPT-3.5 can only generate about 40 tokens per second.

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

The Groq is way ahead in terms of speed

The core technology of this chip to win in terms of speed is Groq's first-of-its-kind LPU technology.

According to k_zeroS investors close to Groq on Twitter, LPUs work very differently than GPUs. It uses a Temporal Instruction Set Computer architecture, which means it doesn't need to load data from memory as often as GPUs that use high-bandwidth memory (HBM). This feature not only helps to avoid the shortage of HBM, but also effectively reduces costs.

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

Comparison of LPU vs. GPU

Unlike Nvidia GPUs, which rely on high-speed data transfer, Groq's LPUs do not use high-bandwidth memory (HBM) in their systems. It uses SRAM, which is about 20 times faster than the memory used by GPUs.

Given that AI inference computation requires much less data than model training, Groq's LPUs are therefore more energy-efficient. When performing inference tasks, it reads less data from external memory and consumes less power than Nvidia's GPUs.

If Groq's LPUs are used in AI processing scenarios, there may not be a need to configure special storage solutions for Nvidia GPUs. LPUs don't have the same high storage speed requirements as GPUs. Groq claims that its technology can replace the role of GPUs in AI tasks through its powerful chips and software.

Another teaching assistant at Ankara University explained the difference between LPUs and GPUs more vividly, "Imagine you have two workers, one from Groq (we call them "LPU") and the other from Nvidia (we call them "GPU"). Both of them are tasked with sorting through a large pile of documents as quickly as possible.

The GPU is like a fast worker, but it also needs to use a high-speed transfer system (which is like high-bandwidth memory or HBM) to quickly transfer all the files to their desks. This system can be expensive and sometimes difficult to obtain (due to the limited HBM capacity).

Groq's LPU, on the other hand, is like a worker who organizes tasks efficiently, they don't need to deliver documents as quickly, so they use a smaller table that sits right next to them (it's like SRAM, a faster but smaller memory) so they get what they need almost instantly. This means that they can work quickly without relying on a fast delivery system.

The LPU is even better for tasks that don't require looking at every file in the heap (similar to AI tasks that don't use that much data). It doesn't need to be moved back and forth as usual, saving energy and getting the job done quickly.

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

LPU structure

The peculiar way the LPU organizes its work (which is temporal instruction set computer architecture) means that it doesn't have to stand up all the time to grab more papers from the pile. This is different from GPUs, which constantly require more files from high-speed systems. ”

It's fast, but it's expensive, and it doesn't make it a competitor to Nvidia at the moment

When Groq first swiped the screen, the AI industry was immersed in the shock of its lightning speed. However, after the shock, many industry bigwigs settled the accounts and found that the price of this fast may be a bit high.

Jia Yangqing made a calculation on Twitter, because the Groq's small memory capacity (230MB) requires 305 Groq cards to run the Llama-2 70b model, while with the H100 only needs 8 cards. At the current price, this means that the Groq has 40 times the hardware cost and 10 times the energy cost of the H100 at the same throughput.

Chip expert Yao Jinxin (Uncle J) explained to Tencent Technology in more detail:

According to Groq's information, the specifications of this AI chip are as follows:

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

From the specifications of the chip, you can see several key information points: the capacity of SRAM is 230MB, the bandwidth is 80TB/s, and the computing power of FP16 is 188TFLOPs.

According to the current inference deployment of large models, the 7B model needs more than 14G memory capacity, so in order to deploy a 7B model, about 70 chips are needed, according to the information revealed, a chip corresponds to a computing card, according to the 4U server configuration 8 computing cards to calculate, it will need 9 4U servers (almost accounting for a standard cabinet), a total of 72 computing chips, in this case, the computing power (under FP16) has also reached an astonishing 188T * 72 = 13.5P, if calculated according to INT8, it is 54P. With the computing power of 54P, it is not an exaggeration to use a cannon to fight mosquitoes.

At present, the article widely spread on social media is aimed at NvidiaH100, which uses 80G HBM, this capacity can deploy 5 7B large model instances;Let's look at the computing power, after sparsification, the computing power of H100 under FP16 is nearly 2P, and it is nearly 4P on INT8.

Then you can make a comparison, if you look at the same computing power, if you use INT8 to reason, the solution using Groq requires 9 server clusters containing 72 pieces, and if it is H100, it takes about 2 8-card servers to achieve the same computing power, and the INT8 computing power at this time has reached 64P, and the number of 7B large models that can be deployed at the same time has reached more than 80.

As mentioned in the original article, Groq's Token generation speed for Llama2-7B is 750 Tokens/s, and if the target is an H100 server, then the concurrent throughput of these two H100 chips with a total of 16 chips is so high that I don't know where it goes. From a cost point of view, 9 Groq servers are far more expensive than 2 H100 servers (even if the price is ridiculously high at the moment).

● Groq: $20,000 * 72 = $1.44 million, server $20,000 * 9 = $180,000, pure BOM cost of more than $1.6 million (all calculated according to the lowest method).

● H100: 300,000 USD*2 = 600,000 USD (overseas), 3 million RMB * 2 = 6 million RMB (actual domestic market price)

If it is a 70B model, the same INT8, it will use at least 600 cards and nearly 80 servers, and the cost will be higher.

And that's not even counting the rack-related costs, and the electricity consumed (9 4U servers take up almost the entire standard cabinet).

In fact, the most cost-effective deployment inference is precisely the 4090 god card.

Does Groq really surpass NVIDIA? In this regard, Yao Jinxin (Uncle J) also expressed his different views:

"NVIDIA's absolute leadership in this AI wave has made the world look forward to the challenger. Every time an article attracts attention, it will always be believed at first, in addition to this reason, it is because of the "routine" when making comparisons, deliberately ignoring other factors, and using a single dimension to make comparisons. It's like the famous saying, "Facts aside, aren't you at fault?"

It is actually inappropriate to talk about comparison regardless of the scene. For an architecture like Groq, there are also application scenarios where the advantages are shown, after all, such a high bandwidth is perfect for many scenarios that require frequent data handling.

To sum up, Groq's architecture is built on small memory and large computing power, so the limited amount of content to be processed corresponds to extremely high computing power, resulting in very fast speed.

Now to put it the other way around, Groq's extremely high speed is based on a very limited single-card throughput. To guarantee the same throughput as the H100, you'll need more cards.

Speed is a double-edged sword for Groq here. ”

Legendary CEO, small team

While Groq still faces a number of potential issues, it offers a glimpse of possible paths beyond the GPU. This is mainly due to the super team behind it.

Groq is led by Jonathan Ross, a former Google employee known as the "Father of TPU," and co-founder Douglas Whiteman is also part of Google's TPU team and has founded four companies. The company's chief technology officer, Jim Miller, was previously the head of design computing hardware for Amazon Web Services (AWS), and CMO led the launch of Apple's Macintosh.

Today's Groq chips are 20 times faster than NVIDIA, but they are also 40 times more expensive

Jonathan Rose

Groq's current team is also relatively small, headquartered in Mountain View, Calif., with just over 180 employees, less than a quarter of the number of engineers needed by large chipmakers like Intel.

The goal of Ross and others is to replicate his success at Google at Groq to build an in-house chip project that will lead the industry toward new technologies. He hopes to attract a small number of key customers and provide a steady stream of revenue for the company through the widespread deployment of Groq chips, driving the company's independent growth. Currently, the startup has started sending samples to potential customers.

"It's like hunting an elephant," Ross said, "and you only need a handful of prey to keep you alive, especially when we're so weak." ”

Read on