Editor: Editorial Office
GPT-4 has broken the news in the industry again! Parameters, schemas, training datasets, tokens, training and inference costs... It exploded all at once. In view of the author's previous achievements, this revelation does have certain reference value.
Just now, OpenAI's GPT-4 has been "open sourced" by industry insiders!
These include GPT-4 architecture, training and inference infrastructure, number of parameters, training dataset, number of tokens, cost, Mixture of Experts (MoE) and other very specific parameters and information.
In particular, how OpenAI weighs up behind different projects. and how to overcome the biggest bottleneck in giant model inference.
Who did such a heavy revelation come from?
The authors are two contributors to SemiAnalysis, Dylan Patel and Gerald Wong.
It is worth mentioning that the previous Google internal document leak that caused an uproar in the industry ("We don't have a moat, OpenAI doesn't have either"), one of the authors is also Dylan Patel.
DeepMind CEO Hassabis recently confirmed the authenticity of the Google engineer leaked document in an interview with foreign media The Verge.
It can be seen that Dylan Patel does have some special channels, which makes the authenticity of today's revelation a little better.
Ask CEO Li Zhifei also gave a speech on this
Many companies can make GPT-4
In the opinion of the author of the breaking news article, OpenAI is not open not to ensure that humans are not destroyed by AI, but because what they build is replicable.
He even predicts that in the future, all Chinese and American Internet giants or AI start-ups will be able to build models that are the same as or even more GPT-4.
But he also acknowledges that GPT-4 is OpenAI's masterpiece. It embodies the engineer's ingenuity, complex architecture and various ingenious engineering trade-offs.
OpenAI's most enduring moat is that they have feedback from real users, the top engineering talent in the industry, and continuous leadership brought about by first-mover advantage.
Model framework
First of all, the author believes that GPT-4 contains a total of 1.8 trillion parameters in 120 layers, while GPT-3 has only about 175 billion parameters.
That is, GPT-4 is more than 10 times larger than GPT-3.
Previously, the saying circulating on the Internet was that the parameter of GPT-4 was 1 trillion, which seems to be underestimated from the actual situation
To keep costs reasonable, OpenAI employs a MoE model to build.
Specifically, GPT-4 has 16 expert models with approximately 111 billion parameters per MLP expert. Among them, two expert models are used for forward propagation.
While there is a lot of discussion in the literature about advanced algorithms for choosing which experts each token points to, OpenAI's algorithm for GPT-4 is said to be pretty straightforward.
In addition, there are about 55 billion parameters in the model that are used to share attention mechanisms.
In each forward propagation inference (generating a token), GPT-4 only needs to use about 280 billion parameters and 560 TFLOPs.
This is in stark contrast to many pure dense models that require about 1.8 trillion parameters and 3700 TFLOPs per forward propagation.
The composition of the dataset
OpenAI trained GPT-4 with 13 trillion tokens.
This dataset not only contains 13 trillion tokens, but because there are no high-quality tokens, this dataset also contains many epochs.
Inside Scale AI and the dataset, millions of rows of instruction fine-tuning data are also included.
However, the authors said that they did not find much information on these RLHF data.
The context length in the pre-training phase reaches 8K (seqlen), while the 32k version is fine-tuned based on the pre-trained 8K version.
The batch size gradually increased over several days in the cluster, and eventually OpenAI used a batch size of 60 million.
Of course, this is "only" the size of each 7.5 million token expert model, because not every expert model will see all the tokens.
Parallel strategy
The parallel strategy is quite important for the A100 GPU.
OpenAI uses 8-way tensor parallelism because NVLink only supports so much at most.
But beyond that, the author heard that OpenAI uses 15 parallel pipelines.
Theoretically, considering the data communication and calculation time, 15 pipelines is a bit more.
But because of the limitation of memory capacity, so many pipelines make sense.
When pure pipelines and tensors are parallel, the FP16 parameters per GPU are about 30GB.
But once you add in KV cache and cost, if most of the GPUs used by OpenAI are 40GB A100s, then such an architecture makes sense in theory.
OpenAI may be using ZeRo Stage 1, and may be using block-level FSDP or hybrid sharing data parallelism.
Why don't they use the full model of FSDP? It may be because of the high cost of communication.
Although OpenAI has high-speed networks between most nodes, it does not cover all nodes.
At least some of these clusters will have much lower connection bandwidth than others.
But the author said that he did not understand how OpenAI could avoid generating "huge bubbles" in each batch at such a high degree of pipeline parallelism, and it is likely that OpenAI is naturally resistant to these costs.
Training costs
OpenAI trained GPT-4 on FLOPS at about 2.15e25 and trained on about 25,000 A100s for 90 to 100 days, with utilization rates between 32 and 36 percent.
This extremely low utilization is partly due to the excessive number of failures, which leads to the need to retrain from the previous checkpoint. For example, the bubble cost mentioned above.
This situation is extremely costly to waste.
Another reason is that all-reduce between so many GPUs is very expensive.
This chart assumes that the inability to fuse the memory bandwidth required for each operation, attention mechanisms, and hardware overhead equivalent to parameter reads leads to inefficiencies. In fact, even with optimized libraries, such as NVIDIA's FasterTransformrmer, the total overhead is even greater
The author suspects that if such a cluster is actually a group of smaller clusters with weak network connections, then the non-block connection speed between different parts of the cluster is 800G/1.6T, but the connection speed between these parts is only 200G/400G.
If the cost of OpenAI cloud computing is almost $1 per A100 hours, then under these conditions, the cost of training is about $63 million.
This does not include all the experiments, failed training, and other costs such as data collection, RLHF, labor costs, etc.
If you take into account the factors just mentioned, the real cost is much higher.
In addition, this has to be provided that someone else can buy chips/networks/data centers, build these systems at the capital expense, and lease them to OpenAI.
But today, at $2/h100 hours, pre-training can be performed on about 8,192 H100 in just 55 days and at a cost of $21.5 million.
The figure above shows the number of parameters and tokens of some of the advanced models that have been exposed. The line in the graph is Google DeepMind's Chinchilla zoomed observation (smoothed out large error bars), and each point on the line shows the theoretical FLOPS required to train a model using this parameter and token number
However, the author said that by the end of this year, at least 9 companies will have more than the above size H100 cluster.
While not all of these companies will use them all for single model training, if they did, they would have larger models than GPT-4.
For example, Meta will have more than 100,000 H100s by the end of this year, but a significant portion of them will be distributed in its own data centers for inference.
But its largest single cluster will still exceed 25,000 H100.
In short, by the end of this year, many companies will have enough computing power resources to train GPT-4-sized models.
This table is the theoretical optimal cost of training a model on the NVIDIA A100, without taking into account the required manpower, ML Ops tools, data collection/preprocessing, failure recovery, one-shot/few-shot learning examples, inference, etc., and the cost of many parts is surprisingly high
Trade-offs in terms of hybrid expert models
MoE (Hybrid Expert Model) is a good way to reduce the number of parameters during inference, although at the same time increase the number of parameters.
But this is necessary for each training tag to encode more information, because it is very difficult to get enough high-quality labels.
If OpenAI really wants to go for the best performance, they need to train twice as many tokens to achieve it.
That being said, OpenAI has made a lot of trade-offs.
For example, it is very difficult to handle MoE during inference because every part of the model is not used at every token generation.
This means that some parts may be dormant while others are working.
This situation significantly reduces utilization when serving users.
Researchers have shown that using 64-128 expert models can achieve better loss situations than using 16 expert models, but this is only the result of the study.
There are many reasons to adopt relatively few expert models, and one of the reasons OpenAI chose 16 experts is because more expert models on many tasks are difficult to generalize.
Using more expert models is also more difficult to achieve.
In such a large training process, OpenAI chose to be more conservative in the number of expert models.
In addition, using fewer expert models helps their inference infrastructure. When switching to a hybrid expert model inference architecture, there are various difficult trade-offs and trade-offs.
The authors start with the basic trade-offs of LLM inference, and then discuss the problems OpenAI faces and the choices they make.
Reasoning trade-offs
Before getting into the reasoning trade-offs, by the way, after talking to all LLM companies, the whistleblower found that NVIDIA's FasterTransformer inference library is very bad, and TensorRT is even more so.
This means that if NVIDIA doesn't modify it, people will also need to create their own solution from scratch.
There are three main trade-offs for inference large language models, namely the batch size (number of simultaneous processing users) dimension and the number of chips used, as follows:
1. Delay
The model must respond with a reasonable delay time. No one wants to wait a few seconds in a chat app before they start receiving output. The processing time for prefilling (input token) and decoding (output token) varies.
2. Throughput
The model must output a certain number of tokens per second. Humans need about 30 tokens per second. For a variety of other use cases, both lower and higher throughput are acceptable.
3. Utilization
The hardware on which the model is running must be highly utilized or the cost is prohibitive. While higher latency and lower throughput can be used to group more user requests together for higher utilization, they can also be more difficult.
The key to LLM inference is to balance the two points of memory bandwidth and computation.
LLM Theoretical Bandwidth Requirements: It is assumed that the maximum model size that can be run on the iPhone 14 is ~1 billion FP16 parameters, or ~4 billion int4 parameters, which is the basic limitation of smartphone-based LLM, and any larger model will not be adopted
In simple terms, each parameter must be read, and there are 2 FLOPs associated with it.
Therefore, the ratio of most chips (H100 SXM has only 3TB/s memory bandwidth, but FP8 has 2,000 TFLOP/s) is completely unbalanced in the inference of a batch size of 1.
If there is only one user (batch size is 1), then each time the token is generated, the memory bandwidth required to read each parameter will mainly occupy the inference time, and the calculation time is almost negligible.
To efficiently scale a large language model to multiple users, the batch size must exceed 1. Multiple users read the parameter cost allocation. For example, with a batch size of 256/512, each byte of memory read can get 512 FLOP/s or 1024 FLOP/s.
This ratio is closer to the balance between the memory bandwidth of the H100 and FLOPS. This helps achieve higher utilization, but at the cost of higher latency.
Many people think that memory capacity is a major bottleneck in LLM inference, because large models require multiple chips for inference, and higher memory capacity means they can accommodate fewer chips.
However, it is actually better to use more chips in order to reduce latency, increase throughput, and use larger batch sizes for higher utilization.
GPT-4 Inference Tradeoffs and Infrastructure
As mentioned above, it is very difficult for GPT-4 reasoning. But being a MoE model again introduces a whole new set of difficulties.
Each forward pass of the generated token can be routed to a different expert group. This creates a trade-off between throughput, latency, and utilization at larger batch sizes.
OpenAI's GPT-4 has 16 specialists, and each forward pass is routed to 2 of them.
This means that if the batch size is 8, the parameter reads per expert may only have a batch size of 1.
Worse, this could mean that one expert has a batch size of 8 and the other experts have a batch size of 4, 1, or 0.
Each time a token is generated, the routing algorithm sends forward passes in a different direction, resulting in significant changes in latency between tokens and expert lot sizes.
Inference infrastructure is one of the main reasons why OpenAI selects a smaller number of experts. If they choose more experts, memory bandwidth becomes a bottleneck for inference.
OpenAI's inference clusters can typically reach batch sizes of 4k+, meaning that even with the best load balance between experts, the batch size for experts is only around 500. This requires a very large amount of use to achieve.
According to the whistleblower, we learned that OpenAI is inference on a cluster of 128 GPUs. They have multiple such clusters across multiple data centers and geographies.
Inference uses 8-way tensor parallel and 16-way pipeline parallelism. Each node consisting of 8 GPUs has only about 130B parameters, or less than 30GB per GPU under FP16 and less than 15GB under FP8/int8.
This allows inference to be run on a 40GB A100, as long as the KV cache size for all batches is not excessive.
Layers containing different experts on different nodes are not segmented, as this would result in too irregular network traffic and too expensive to recalculate the KV cache between each generated token.
For future MoE model extensions and conditional routing, the biggest difficulty is how to handle KV cached routing.
The model has 120 layers, so it is possible to simply assign them to 15 different nodes, but because the first node needs to be loaded and embedded, it makes sense to place fewer layers on the master node of the inference cluster.
In addition, there are some rumors of "speculative decoding" (consequential), which explains why masternodes need to contain fewer layers.
Inference costs
Compared to the Davinchi model, which has 175 billion parameters, GPT-4 costs 3 times more, although its feed-forward parameters are only 1.6 times higher.
This is mainly because GPT-4 requires larger clusters and achieves lower utilization.
The authors believe that reasoning GPT-4's 8k sequence length on 128 A100s costs $0.0049 per 1,000 markers, while reasoning GPT-4's 8k sequence length on 128 H100s costs $0.0021 per 1,000 marks.
Note that this is assuming that there is a fairly high utilization and a high batch size is maintained.
But it's clear that OpenAI is sometimes very underutilized.
The authors hypothesize that OpenAI will reduce inference costs by shutting down clusters, reconfiguring nodes, resuming training smaller test models, and experimenting with various new technologies.
If OpenAI doesn't, their utilization will be lower and the cost will more than double.
Multi-query attention
In addition, OpenAI is also using Multi-Query Attention (MQA).
Address: https://arxiv.org/pdf/1911.02150.pdf
In short, only one attention header is required, and the memory footprint of the KV cache can be significantly reduced.
Even so, the 32k length GPT-4 certainly won't run on a 40GB A100, and there's a cap on the maximum batch size of 8k.
Continuous batch processing
OpenAI implements variable batch sizes and continuous batch processing.
Doing so allows for some maximum latency and optimizes inference costs.
Speculative Decoding
It is reported that OpenAI uses "speculative decoding" in the reasoning process of GPT-4, which is still 100% uncertain.
The token-to-token latency variation, and the difference between simple retrieval tasks and more complex tasks, seems to indicate that this is possible, although there are still too many variables to determine.
Here, the whistleblower explained through the text from a DeepMind study "Accelerating LLM Inference with Staged Speculative Decoding", with appropriate modifications/additions.
Using LLM is usually divided into two phases.
The first is prefill, which will prompt text into the model to generate the logarithmic probability of the KV cache and the first output (the probability distribution of the possible token output). This process is usually quick because the entire prompt text can be processed in parallel.
The second stage is decoding. Select a token from the logarithmic odds of the output and feed it back into the model, which will generate the logarithmic odds of the next token. Repeat this process until the required number of tokens are generated.
Since the decoding must be done sequentially, each time a stream of weights needs to be passed through the computational unit to generate a single token. Therefore, when run in small batches, the computation intensity of this second stage (i.e., calculating the number of bytes of FLOP/memory bandwidth) is very low. Therefore, decoding is often the most expensive part of autoregressive generation.
This is why in OpenAI's API calls, the input token is much cheaper than the output token.
The basic idea of "speculative decoding" is to use a smaller, faster draft model to decode multiple tokens in advance and then feed them into the predictive model as a batch.
If the draft model's predictions are correct, i.e. the larger model agrees with them, then multiple tokens can be decoded using a single batch, which can save a lot of memory bandwidth and time.
However, if the larger model rejects the token predicted by the draft model, the remaining batches are discarded and the algorithm naturally reverts to standard token-by-token decoding.
"Conjecture decoding" may also be accompanied by a scheme of rejecting sampling to sample from the original distribution. It's worth noting that this is only useful in small batch setups where bandwidth is the bottleneck.
"Speculative decoding" in exchange for bandwidth has become an attractive performance engineering goal for two key reasons:
First, it does not degrade the model quality. Second, the performance improvements it provides are often orthogonal to other methods, as their performance comes from converting "sequential execution" to "parallel execution".
The current speculative method is a separate sequence of batch predictions. However, this approach does not scale well to large batches, or low draft model alignment.
Intuitively, the probability of two models agreeing on tokens of consecutive long sequences is exponentially low, which means that as the arithmetic density increases, the gain on speculative decoding decreases rapidly.
The whistleblower believes that if OpenAI uses "speculative decoding", they may only use it in a sequence of about 4 tokens.
Incidentally, the whole conspiracy about OpenAI castration and degradation of GPT-4 may simply be because they made predictive models accept low-probability sequences of "speculative decoding" models.
There has also been speculation that Bard also used "speculative decoding" because Google waits for the entire sequence to be fully generated before sending it to the user, but in the whistleblower's opinion, this guess is completely incorrect.
Visual multimodality
Visual multimodal capabilities are the least impressive part of GPT-4, at least compared to leading research.
Of course, no one has yet commercialized the results of multimodal LLM.
According to whistleblowers, it is a unique visual encoder with cross-attention, architecture similar to Flamingo, and more parameters added to GPT-4 1.8T.
The GPT-4 multimodal capability is fine-tuned with about 2 trillion tokens after text pre-training.
It is said that on the visual model, OpenAI originally hoped to train from scratch, but because it was not mature enough, it had no choice but to fine-tune the text training model.
The next generation model, GPT-5, should train a vision model from scratch and be able to generate images and even audio.
One of the main purposes of such visual capabilities is to allow autonomous agents to read web pages and transcribe content in images and videos.
It is worth mentioning that the data that OpenAI uses to train multimodal models includes: "federated data" (LaTeX/text), screenshots of web pages, YouTube videos (sample frames, and running Whisper to get subtitles).
An interesting fact about the over-optimization of LLM is that the IO cost of visual models is different from that of text models. In the visual model, the data loading IO is approximately 150 times that of the text model.
The IO cost of the vision model is low
Each token in the visual model is 600 bytes, and the text is 4 bytes/token.
Therefore, this requires a lot of work in image compression. This is extremely important for hardware vendors as they are optimizing hardware 2-3 years from now around LLM use cases and ratios.
They may find themselves in a world where each model has powerful visual and audio capabilities.
They may find that their architecture adaptability will be poor.
Overall, the architecture will certainly surpass the dense text-based simplification-based models we see today, and the MoE model.
Resources:
https://www.semianalysis.com/p/gpt-4-architecture-infrastructure