The price war of large language model inference, relying on scale to win?

The article is authorized to reprint Mr. Baoyu's personal blog (Weibo @Baoyu xp), link https://baoyu.io/translations/llm/inference-race-to-the-bottom-make

作者 | Dylan Patel ,Daniel Nishball 责编 | 夏萌

Listing | baoyu.io

Mixtral on H100, MI300X, H200, A100 hybrid inference cost, and speculative decoding

Currently, in addition to OpenAI, there are five companies whose models have surpassed GPT-3.5 in multiple benchmarks, including Mistral Mixtral, Inflection-2, Anthropic Claude 2, Google Gemini Pro, and X.AI Grok. What's even more surprising is that Mistral and X.AI achieved this with a team of less than 20 people. In addition, we expect Meta, Databricks, 01.AI (Yi), Baidu, and ByteDance to surpass GPT-3.5 performance soon. Of course, these results are obtained in benchmarks, and some companies are said to train on evaluation data...... But don't dwell too much on this little detail.

For those who are following the matter, a total of 11 companies will join the ranks in just a few months from now. Obviously, the pre-training of GPT-3.5 level models has become very popular. OpenAI is still the leader of GPT-4, but that lead has shrunk significantly. While we believe that the top-end models will account for most of the long-term value, the next-tier models will also create a multi-billion dollar segment in the market in terms of quality and cost, especially after fine-tuning.

So, if these models are ubiquitous, which companies can profit from them?

Those companies that reach customers directly through complete software-as-a-service or social media, and thus have unique distribution channels, will have an advantage. Companies that provide a full range of training or fine-tuning services to others to help them handle every stage from data to service will have an advantage. Companies that can provide data protection and ensure that all models are used legitimately, will also have an advantage. Companies that simply offer open model services will not have a competitive advantage.

Some significant advantages can be seen in Microsoft's Azure GPT API vs. OpenAI. When it comes to the amount of inference for public and private instances, Microsoft outperforms OpenAI's own APIs. For prudent businesses, the security, data assurance, and service contract bundling provided by Microsoft is important. In addition, these protections make it easier for abusive actors to evade accountability, as shown by ByteDance's use of Azure GPT-4 to train their upcoming large language models (LLMs).

The truth is, if you're not the market leader, then you have to adopt a low-price strategy to attract customers. For example, Google offers 60 free API calls per minute on its GPT-3.5 competitor, Gemini Pro. Google isn't alone in this, in fact, almost everyone today is losing money on large language model (LLM) inference.

Services with a purely open model have become very common. Although the main expense of carrying out such a service is significant, the initial capital requirements are not high. There are many sub-cloud services that offer great pricing (mainly because they have high expectations for a return on investment), but using these services can be subject to certain security risks.

It's a breeze for companies to lease some GPUs and start running open-source models on Nvidia and AMD GPUs with libraries like vLLM and TensorRT-LLM. PyTorch is also getting faster in terms of inference performance, so the barrier to entry is being dramatically lowered. On a related note, check out the heated public debate between Nvidia and AMD over the MI300 vs. H100 when it comes to LLM inference performance. AMD's response to this has left Nvidia quite embarrassed, as Nvidia initially published a misleading blog post.

With the launch of Mixtral by Mistral, there has been a fierce price competition for inference costs. This is driven by startups that burn venture capital and want to operate profitably at scale. OpenAI's GPT-3.5 Turbo model is inherently much cheaper to run than Mixtral, and OpenAI's ability to maintain high profit margins is largely due to their ability to achieve very high batch processing scales. This ability to process high batches is not available to other competitors with smaller user bases.

OpenAI charges $1.00 per million input tokens and $2.00 per million output tokens. Despite the higher operating costs, Mistral, which has a higher-quality model, had to be priced lower than OpenAI in order to attract customers. As a result, Mistral charges $0.65 per million input tokens and $1.96 per million output tokens. In fact, their pricing strategy is dictated by market forces, which are more market-driven than the cost and target ROI of Mistral's inference. It's important to note that the performance figures below are based on a custom inference stack on existing models, not the unoptimized TensorRT-LLM or vLLM that we've discussed with a large number of deployers. Since Mistral has not yet developed a highly optimized custom inference stack, their performance is even lower than the following data.

We'll explore these numbers in more detail later, but overall, even in the most optimistic scenario, assuming two H100 GPUs running around the clock on a continuous basis in BF16 format at a cost of $1.95 per hour, Mistral is still barely afloat and uses a sizable batch size. Of course, you can test the API directly and see that their token processing is quite fast, which means that they don't use such large batch processing. As a result, their API is likely to be a market-leading strategy at the expense of losses, which is logically necessary to attract customers in a market with strong existing competitors. Mistral's medium-term goal may be to increase trading volume and eventually become profitable as hardware and software costs decrease.

Mistral is reluctant to lag behind, and all parties are rushing to provide inference services for Mixtral models at a lower price. Every few hours, a new company announces its pricing. The first is Together, which proposes to charge $0.60 per million token outputs, with no input costs. This is followed by Perplexity, which is priced at $0.14 input/$0.56 output, followed by Anyscale coming up with $0.50 output. Last Deepinfra bid $0.27 for an output...... We thought that was the limit...... But then OpenRouter made this service available for free!, and it should be pointed out that their claimed number of tokens per second was practically unachievable, and imposed rate limits so strict that it was almost impossible to test.

All of these inference services are currently at a loss.

It's important to note that 2x H100 isn't actually the best choice for running Mixtral model inference. In fact, it's more cost-effective since 2x A100 80GB is roughly 32% higher bandwidth per dollar (assuming memory bandwidth utilization is similar). The A100's significantly lower floating-point performance (FLOPS) also has little impact on inference performance. It's worth noting that at the current level of crashed pricing, even the 2xA100 is not profitable. Later in this report, we'll also show the great advantages of the H200 and MI300X in terms of inference.

Because Mixtral is a model that integrates multiple expert systems, the way it works changes significantly when you increase the size of a processing batch. In the case of processing only a single batch of data, each forward propagation activates only a small subset of the parameters of the model, allowing the model to demonstrate high processing power even at lower bandwidths and floating-point operations per token (FLOPS). However, this ideal situation can only be achieved if you are processing a batch of data with a size of 1 and enough memory capacity to accommodate the entire model.

As the batch size increases, more expert systems in the model will be enabled, requiring that all parameters of the entire model need to be read with each forward propagation. At the same time, each decoded token will still only go through the processing of two expert systems. As a result, expert system-based models (MoE models) like Mixtral and GPT-4 have a higher bandwidth requirement than intensive models.

This has a significant impact on the inference process for large language models (LLMs), as costs are scaled in very different ways compared to intensive models. In short, for the MoE model, while increasing the size of the processing batch can still reduce the cost, the cost reduction is not as significant as that of the intensive model due to the need for more memory bandwidth. This is one of the main reasons why the base model cannot be infinitely multiplied with more expert systems. In terms of large-scale inference cost, it is ideal to use a high batch size for processing, but the benefits that MoE models get from this are not as significant compared to intensive models.

Given Together clearly has the best inference engine in the competition, especially in terms of reliability, time to first token generation, number of tokens generated per second, low rate limiting that is not artificially set, and they are determined not to stealthily change the way the model is quantified without the user's knowledge, as other providers do, so we decided to dive into their solution and evaluate it as a benchmark.

We've done an analysis of their test platform and API over the past few days. When setting the temperature to 0 and using a very regular long sequence, we reached a processing peak of about 170 tokens per second per sequence. For the same length but more difficult query with a temperature set to 2, we can only process at a speed of about 80 tokens per second.

Note that, in reality, because Together serves a large number of users and the batch size is quite large, the previously mentioned data is actually worse than it looks. The number of tokens per second mentioned above is an ideal scenario for small batch processing. Our research shows that Together uses a 2xA100 80GB system over an H100-based system. Temperature and performance tests have also shown that Together employs a technique called "speculative decoding."

推测性解码 / Medusa

Regarding speculative decoding, we have already explained it in more detail. In simple terms, speculative decoding is the process of running a small, fast, first-draft model before a large, slow-running model. This first-draft model provides multiple predictions to the larger, slower review model to help it generate multiple tokens in advance. The large model then examines these pre-generated predictions all at once, rather than generating tokens one by one as usual. The review model may adopt the recommendations of the first draft model and generate multiple tokens at once, or reject the suggestions and generate tokens one by one in the usual way.

The main purpose of speculative decoding is to reduce the memory bandwidth required to generate each token. Unfortunately, techniques like speculative decoding don't significantly improve performance on hybrid expert models such as Mixtral. This is because as the size of the batch increases, so does the memory bandwidth you need, as different suggestions for the first draft model need to be sent to different experts. Similarly, pre-populated tokens on hybrid expert models are more expensive than intensive models.

Back to the temperature issue mentioned earlier. The temperature of the LLM is actually a slider that controls creativity or randomness. We tested different temperatures because at low temperatures, the first draft model is more likely to correctly generate tokens that the review model will accept. However, at high temperatures, the response of the review model is more erratic, so it is difficult for the first-draft model to accurately predict the token in advance. Adjusting the temperature is one of the many ways to measure a model's actual rate of token generation per second, and without that, techniques like speculative decoding would interfere with any attempts at reverse engineering or performance analysis.

Quantitative techniques

The application of quantization techniques can significantly improve the speed and cost efficiency of model operation. But if not handled properly, quantization can severely degrade model quality. In general, after the model is quantized, fine-tune is required. However, some low-cost vendors on the market today do not carry out this necessary fine-tuning step, and they often do it halfway, ignoring the accuracy of the model. The models of these low-priced vendors produce far less quality of output than the 16-bit Mixtral models.

We believe that researchers can use the FP8 format for model inference without compromising the quality of the model. However, we believe that for these large models, inference using the INT4 format will not be feasible. In addition, inference using the FP8 format still requires two H100 or A100 GPUs, as the number of tokens processed is much lower than the 40-50+ user level required for most chat apps, and KVCache's size limits need to be taken into account.

H200 & MI300X 性能分析

The upcoming H200 and MI300X will be game-changing. They come with 141GB and 196GB of RAM, respectively, as well as higher memory bandwidth than the H100 and A100. In our model, the cost per token of H200 and MI300X is more advantageous than the A100 and H100 in the market. We've found that there are huge benefits to ditching tensor parallelism (Nvidia's current NCCL implementation is poor at full reduction).

Next, we will show how well these systems perform in terms of cost-effectiveness. Please note that these systems are still in the initial deployment phase. We anticipate, just like some major providers use highly optimized custom inference stacks based on H100 systems. Currently, Nvidia's closed-source TensorRT-LLM and AMD's relatively open vLLM integration strategy do not directly provide such optimizations, but we expect this to change over time.

The price war of large language model inference, relying on scale to win?

Mixtral on H100, MI300X, H200, A100 hybrid inference cost, and speculative decoding

推测性解码 / Medusa

Quantitative techniques

H200 & MI300X 性能分析

Read on

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

The tuyere belonging to major technology companies is here again! This large language model leads to the "new industrial revolution."

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

探索大语言模型：理解Self Attention| 京东物流技术团队

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills

Learn more about large language model operations (LLMOps)

The final round of mathematics in the high school entrance examination is to fill in the gaps: the general's drinking horse model and its extended application

The final round of mathematics in the high school entrance examination: the Fermat point model and its extended application

Recommend an open-world object detection model: DINO 1.5

#头条创作挑战赛#Gai是现在人工智能追求的目标, which is also the essence of artificial intelligence now, the establishment of a knowledge base cannot be like an industry knowledge base

16 college entrance examination records! Use mathematical models to predict Tang Shangjun's 2024 college entrance examination scores!

Podcast Update|First Voter for MiniMax: MiniMax, GenAI Conference, and Big Model Playing Cards

CVPR 2024|Only one language model is needed to generate high-quality 360-degree scenes from image diffusion models

The Digital Transformation Maturity Model and Assessment was released

3 types of children "will be abolished as soon as the test is taken", Dr. Tsinghua's iron triangle model will help you become a master of the exam

绝对新鲜实惠图源：archiminibricks#乐高 #乐高MOC #积木#模型#大人也要玩玩具

Development Trend of Large Models: Multimodal, Autonomous Intelligence, Edge Intelligence...

The effect is benchmarked against Sora's domestic AI video application, and the large model of Kuaishou video generation can be unveiled

Altman talks about the opportunities, challenges and human self-reflection of AI: China will have a unique large language model

In-depth report on the artificial intelligence industry - after the "first year" - let's look at the commercialization progress of large model applications

谷歌 Pixel 手机获功能更新：Pixel 8(a) 可用 Gemini Nano 模型

Dai Weijin, Executive Vice President and General Manager of the IP Business Unit of VeriSilicon: Large models are entering the edge and end side, and mobile phones, PCs and automobiles are the main forces. VeriSilicon's CPU, GPU, NPU, V

19 Best Large Language Models in 2024