laitimes

The price war of large language model inference, relying on scale to win?

author:CSDN
The price war of large language model inference, relying on scale to win?

The article is authorized to reprint Mr. Baoyu's personal blog (Weibo @Baoyu xp), link https://baoyu.io/translations/llm/inference-race-to-the-bottom-make

作者 | Dylan Patel ,Daniel Nishball 责编 | 夏萌

Listing | baoyu.io

The price war of large language model inference, relying on scale to win?

Mixtral on H100, MI300X, H200, A100 hybrid inference cost, and speculative decoding

Currently, in addition to OpenAI, there are five companies whose models have surpassed GPT-3.5 in multiple benchmarks, including Mistral Mixtral, Inflection-2, Anthropic Claude 2, Google Gemini Pro, and X.AI Grok. What's even more surprising is that Mistral and X.AI achieved this with a team of less than 20 people. In addition, we expect Meta, Databricks, 01.AI (Yi), Baidu, and ByteDance to surpass GPT-3.5 performance soon. Of course, these results are obtained in benchmarks, and some companies are said to train on evaluation data...... But don't dwell too much on this little detail.

For those who are following the matter, a total of 11 companies will join the ranks in just a few months from now. Obviously, the pre-training of GPT-3.5 level models has become very popular. OpenAI is still the leader of GPT-4, but that lead has shrunk significantly. While we believe that the top-end models will account for most of the long-term value, the next-tier models will also create a multi-billion dollar segment in the market in terms of quality and cost, especially after fine-tuning.

So, if these models are ubiquitous, which companies can profit from them?

Those companies that reach customers directly through complete software-as-a-service or social media, and thus have unique distribution channels, will have an advantage. Companies that provide a full range of training or fine-tuning services to others to help them handle every stage from data to service will have an advantage. Companies that can provide data protection and ensure that all models are used legitimately, will also have an advantage. Companies that simply offer open model services will not have a competitive advantage.

Some significant advantages can be seen in Microsoft's Azure GPT API vs. OpenAI. When it comes to the amount of inference for public and private instances, Microsoft outperforms OpenAI's own APIs. For prudent businesses, the security, data assurance, and service contract bundling provided by Microsoft is important. In addition, these protections make it easier for abusive actors to evade accountability, as shown by ByteDance's use of Azure GPT-4 to train their upcoming large language models (LLMs).

The truth is, if you're not the market leader, then you have to adopt a low-price strategy to attract customers. For example, Google offers 60 free API calls per minute on its GPT-3.5 competitor, Gemini Pro. Google isn't alone in this, in fact, almost everyone today is losing money on large language model (LLM) inference.

Services with a purely open model have become very common. Although the main expense of carrying out such a service is significant, the initial capital requirements are not high. There are many sub-cloud services that offer great pricing (mainly because they have high expectations for a return on investment), but using these services can be subject to certain security risks.

It's a breeze for companies to lease some GPUs and start running open-source models on Nvidia and AMD GPUs with libraries like vLLM and TensorRT-LLM. PyTorch is also getting faster in terms of inference performance, so the barrier to entry is being dramatically lowered. On a related note, check out the heated public debate between Nvidia and AMD over the MI300 vs. H100 when it comes to LLM inference performance. AMD's response to this has left Nvidia quite embarrassed, as Nvidia initially published a misleading blog post.

With the launch of Mixtral by Mistral, there has been a fierce price competition for inference costs. This is driven by startups that burn venture capital and want to operate profitably at scale. OpenAI's GPT-3.5 Turbo model is inherently much cheaper to run than Mixtral, and OpenAI's ability to maintain high profit margins is largely due to their ability to achieve very high batch processing scales. This ability to process high batches is not available to other competitors with smaller user bases.

OpenAI charges $1.00 per million input tokens and $2.00 per million output tokens. Despite the higher operating costs, Mistral, which has a higher-quality model, had to be priced lower than OpenAI in order to attract customers. As a result, Mistral charges $0.65 per million input tokens and $1.96 per million output tokens. In fact, their pricing strategy is dictated by market forces, which are more market-driven than the cost and target ROI of Mistral's inference. It's important to note that the performance figures below are based on a custom inference stack on existing models, not the unoptimized TensorRT-LLM or vLLM that we've discussed with a large number of deployers. Since Mistral has not yet developed a highly optimized custom inference stack, their performance is even lower than the following data.

The price war of large language model inference, relying on scale to win?

We'll explore these numbers in more detail later, but overall, even in the most optimistic scenario, assuming two H100 GPUs running around the clock on a continuous basis in BF16 format at a cost of $1.95 per hour, Mistral is still barely afloat and uses a sizable batch size. Of course, you can test the API directly and see that their token processing is quite fast, which means that they don't use such large batch processing. As a result, their API is likely to be a market-leading strategy at the expense of losses, which is logically necessary to attract customers in a market with strong existing competitors. Mistral's medium-term goal may be to increase trading volume and eventually become profitable as hardware and software costs decrease.

Mistral is reluctant to lag behind, and all parties are rushing to provide inference services for Mixtral models at a lower price. Every few hours, a new company announces its pricing. The first is Together, which proposes to charge $0.60 per million token outputs, with no input costs. This is followed by Perplexity, which is priced at $0.14 input/$0.56 output, followed by Anyscale coming up with $0.50 output. Last Deepinfra bid $0.27 for an output...... We thought that was the limit...... But then OpenRouter made this service available for free!, and it should be pointed out that their claimed number of tokens per second was practically unachievable, and imposed rate limits so strict that it was almost impossible to test.

All of these inference services are currently at a loss.

It's important to note that 2x H100 isn't actually the best choice for running Mixtral model inference. In fact, it's more cost-effective since 2x A100 80GB is roughly 32% higher bandwidth per dollar (assuming memory bandwidth utilization is similar). The A100's significantly lower floating-point performance (FLOPS) also has little impact on inference performance. It's worth noting that at the current level of crashed pricing, even the 2xA100 is not profitable. Later in this report, we'll also show the great advantages of the H200 and MI300X in terms of inference.

The price war of large language model inference, relying on scale to win?

Because Mixtral is a model that integrates multiple expert systems, the way it works changes significantly when you increase the size of a processing batch. In the case of processing only a single batch of data, each forward propagation activates only a small subset of the parameters of the model, allowing the model to demonstrate high processing power even at lower bandwidths and floating-point operations per token (FLOPS). However, this ideal situation can only be achieved if you are processing a batch of data with a size of 1 and enough memory capacity to accommodate the entire model.

As the batch size increases, more expert systems in the model will be enabled, requiring that all parameters of the entire model need to be read with each forward propagation. At the same time, each decoded token will still only go through the processing of two expert systems. As a result, expert system-based models (MoE models) like Mixtral and GPT-4 have a higher bandwidth requirement than intensive models.

The price war of large language model inference, relying on scale to win?

This has a significant impact on the inference process for large language models (LLMs), as costs are scaled in very different ways compared to intensive models. In short, for the MoE model, while increasing the size of the processing batch can still reduce the cost, the cost reduction is not as significant as that of the intensive model due to the need for more memory bandwidth. This is one of the main reasons why the base model cannot be infinitely multiplied with more expert systems. In terms of large-scale inference cost, it is ideal to use a high batch size for processing, but the benefits that MoE models get from this are not as significant compared to intensive models.

Given Together clearly has the best inference engine in the competition, especially in terms of reliability, time to first token generation, number of tokens generated per second, low rate limiting that is not artificially set, and they are determined not to stealthily change the way the model is quantified without the user's knowledge, as other providers do, so we decided to dive into their solution and evaluate it as a benchmark.

We've done an analysis of their test platform and API over the past few days. When setting the temperature to 0 and using a very regular long sequence, we reached a processing peak of about 170 tokens per second per sequence. For the same length but more difficult query with a temperature set to 2, we can only process at a speed of about 80 tokens per second.

Note that, in reality, because Together serves a large number of users and the batch size is quite large, the previously mentioned data is actually worse than it looks. The number of tokens per second mentioned above is an ideal scenario for small batch processing. Our research shows that Together uses a 2xA100 80GB system over an H100-based system. Temperature and performance tests have also shown that Together employs a technique called "speculative decoding."

The price war of large language model inference, relying on scale to win?

推测性解码 / Medusa

Regarding speculative decoding, we have already explained it in more detail. In simple terms, speculative decoding is the process of running a small, fast, first-draft model before a large, slow-running model. This first-draft model provides multiple predictions to the larger, slower review model to help it generate multiple tokens in advance. The large model then examines these pre-generated predictions all at once, rather than generating tokens one by one as usual. The review model may adopt the recommendations of the first draft model and generate multiple tokens at once, or reject the suggestions and generate tokens one by one in the usual way.

The main purpose of speculative decoding is to reduce the memory bandwidth required to generate each token. Unfortunately, techniques like speculative decoding don't significantly improve performance on hybrid expert models such as Mixtral. This is because as the size of the batch increases, so does the memory bandwidth you need, as different suggestions for the first draft model need to be sent to different experts. Similarly, pre-populated tokens on hybrid expert models are more expensive than intensive models.

Back to the temperature issue mentioned earlier. The temperature of the LLM is actually a slider that controls creativity or randomness. We tested different temperatures because at low temperatures, the first draft model is more likely to correctly generate tokens that the review model will accept. However, at high temperatures, the response of the review model is more erratic, so it is difficult for the first-draft model to accurately predict the token in advance. Adjusting the temperature is one of the many ways to measure a model's actual rate of token generation per second, and without that, techniques like speculative decoding would interfere with any attempts at reverse engineering or performance analysis.

The price war of large language model inference, relying on scale to win?

Quantitative techniques

The application of quantization techniques can significantly improve the speed and cost efficiency of model operation. But if not handled properly, quantization can severely degrade model quality. In general, after the model is quantized, fine-tune is required. However, some low-cost vendors on the market today do not carry out this necessary fine-tuning step, and they often do it halfway, ignoring the accuracy of the model. The models of these low-priced vendors produce far less quality of output than the 16-bit Mixtral models.

The price war of large language model inference, relying on scale to win?

We believe that researchers can use the FP8 format for model inference without compromising the quality of the model. However, we believe that for these large models, inference using the INT4 format will not be feasible. In addition, inference using the FP8 format still requires two H100 or A100 GPUs, as the number of tokens processed is much lower than the 40-50+ user level required for most chat apps, and KVCache's size limits need to be taken into account.

The price war of large language model inference, relying on scale to win?

H200 & MI300X 性能分析

The upcoming H200 and MI300X will be game-changing. They come with 141GB and 196GB of RAM, respectively, as well as higher memory bandwidth than the H100 and A100. In our model, the cost per token of H200 and MI300X is more advantageous than the A100 and H100 in the market. We've found that there are huge benefits to ditching tensor parallelism (Nvidia's current NCCL implementation is poor at full reduction).

Next, we will show how well these systems perform in terms of cost-effectiveness. Please note that these systems are still in the initial deployment phase. We anticipate, just like some major providers use highly optimized custom inference stacks based on H100 systems. Currently, Nvidia's closed-source TensorRT-LLM and AMD's relatively open vLLM integration strategy do not directly provide such optimizations, but we expect this to change over time.

The price war of large language model inference, relying on scale to win?
The price war of large language model inference, relying on scale to win?

Read on