laitimes

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

author:GGV Capital
Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |
Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

GGV has something to say:

Recently, the network released a GPT-4 internal technology decryption document, the original text:

GPT-4 Architecture, Infrastructure, Training Dataset, Costs,Vision, MoE by Dylan Patel and Gerald Wong 2023.7.11

This document answers the details of GPT-4's architecture, infrastructure, training dataset, cost, vision, and MoE.

In fact, in the past few months, there have been some speculations and revelations about the GPT-4 architecture, and today's GGView, let's take a look at the engineering trade-offs made by OpenAI in the GPT-4 architecture design.

Source | Web3 Sky City

OpenAI keeps the GPT-4 architecture closed not because there's some existential risk to humans, but because what they build is replicable. In fact, we expect Google, Meta, Anthropic, Inflection, Character, Tencent, ByteDance, Baidu, and others to have models as strong as GPT-4 in the near term. Make no mistake, OpenAI has amazing engineering capabilities, and what they've built is incredible, but the solutions they reach aren't magic. It's an elegant solution that involves many complex trade-offs. Scaling up is only part of the battle. OpenAI's most enduring advantage is that they have the most use cases in real-world applications, leading engineering talent, and can continue to outperform other companies in future models. We have collected a lot of information about GPT-4 from many sources that we want to share today. These include the model architecture, training infrastructure, inference infrastructure, number of parameters, composition of the training dataset, number of labels, number of layers, parallel strategy, multimodal visual adaptation, the thought process behind different engineering trade-offs, the unique techniques implemented, and how they mitigate some of the biggest bottlenecks associated with huge model inference. The most interesting aspect of GPT-4 is understanding why they made certain architectural decisions. In addition, we will outline the cost of training and inference GPT-4 on the A100 and illustrate how it scales with the H100 in next-generation model architectures. First, let's talk about the problem statement. From GPT-3 to GPT-4, OpenAI hopes to scale up to 100 times, but cost is a vexing issue. The dense Transformer model will not be able to scale further. Dense Transformer is the model architecture used by OpenAI GPT-3, Google PaLM, Meta LLAMA, TII Falcon, MosaicML MPT, etc. We can easily name 50 companies that are using the same architecture for LLM (Large Language Models) training. This is a good architecture, but there are problems with scaling. See our discussion on the cost of intensive model training ahead of GPT-4 on the upcoming AI bottleneck. There, we reveal OpenAI's advanced content in terms of GPT-4 architecture and training costs for various existing models.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Over the past 6 months, we have realized that training costs are irrelevant. Sure, it may seem crazy on the surface, spending tens or even hundreds of millions of dollars of computing time to train a model, but for these companies, this expense is trivial. This is actually a fixed capital expenditure project, and better results can be consistently achieved by scaling up. The only limiting factor is scaling computing resources to the timescale on which humans can get feedback and modify the architecture. Over the next few years, several companies, including Google, Meta, and OpenAI/Microsoft, will train models on supercomputers worth more than $100 billion. Meta burns $16 billion a year on the "metaverse," and Google wastes $10 billion a year on all sorts of unattainable projects. Amazon has lost more than $50 billion on Alexa. Cryptocurrencies waste $100 billion on something that has no value. These companies and society as a whole can and will spend more than $100 billion to create supercomputers capable of training a single large-scale model. These large-scale models can then be productized in various ways. This effort will be replicated in multiple countries and companies. This is a new space race. Unlike previous waste, AI now has a clear value and will gain real value from human assistants and autonomous agents in the short term. A more important problem with scaling AI, the real AI bottleneck, is reasoning. The goal is to separate training computation from inference computation. That's why training beyond optimal state makes sense for any model that will be deployed. This is also why a sparse model architecture is used; During inference, not every parameter is activated. The real challenge is that the cost of scaling these models to users and agents is prohibitive. The cost of inference is many times higher than the cost of training. This is what OpenAI is innovating for in terms of model architecture and infrastructure. For dense models, the inference of large models is a multivariate problem. We've discussed the edge computing aspect in detail here, but for data centers, the problem statement is very similar. In simple terms, a device can never provide enough memory bandwidth to achieve certain throughput levels for large language models. Even if they have enough bandwidth, hardware computing resource utilization on edge devices will be low.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

In data centers and cloud computing, utilization is critical. Nvidia is praised for software excellence in part because over the lifetime of GPUs, Nvidia has continuously updated low-level software, improving FLOPS utilization by more intelligently transferring data between chips, chips, and memory. In most current use cases, LLM inference is targeted as a real-time assistant, which means it must achieve high enough throughput for users to actually use it. The average human reading speed is about 250 words per minute, but some people read as fast as 1,000 words per minute. This means that you need to output at least 8.33 marks per second, but closer to 33.33 marks per second to cover all cases. According to mathematical calculations, a dense model with trillions of parameters cannot achieve such throughput on the latest Nvidia H100 GPU servers, since it requires greater memory bandwidth. Each time a tag is generated, each parameter needs to be loaded from memory into the chip. The resulting token is then entered into the prompt to generate the next token. In addition, KV cache for attention mechanisms requires additional bandwidth for streaming.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

(The chart assumes that the inefficiencies of each operation cannot be combined, and that the memory bandwidth required by the attention mechanism and the hardware overhead are equivalent to parameter reading.) In fact, even with "optimized" libraries such as Nvidia's FasterTransformer, the total overhead would be greater) The chart above shows the memory bandwidth required to infer one LLM to achieve high enough throughput to serve individual users. The graph shows that even with 8 H100 GPUs, dense models with trillion parameters cannot be served at a rate of 33.33 marks per second. In addition, the FLOPS utilization of 8 H100 GPUs at 20 marks per second is still less than 5%, resulting in very high inference costs. Therefore, for the H100 system with 8-way tensors parallel, there is an inference constraint of about 300 billion feed-forward parameters. However, OpenAI uses the A100 GPU to achieve human reading speeds, and it uses models with more than a trillion parameters to widely serve it at a low price of just $0.06 per 1,000 marks. This is because the model is sparse, i.e. not every parameter is used. Let's discuss the GPT-4 model architecture, training infrastructure, inference infrastructure, number of parameters, training dataset composition, number of labels, number of layers, parallel strategy, multimodal vision encoder, the thought process behind different engineering trade-offs, unique implementation techniques, and how they mitigate some of the biggest bottlenecks associated with large-scale model inference.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

The model architecture GPT-4 is more than 10 times larger than GPT-3. We think it has about 1.8 trillion parameters in 120 layers, while GPT-3 has only about 175 billion parameters. OpenAI keeps costs reasonable by using a hybrid expert (MoE) model. If you're new to MoE, read our post six months ago on generalized GPT-4 architecture and training costs. In addition, OpenAI uses 16 experts in its model, each with about 111 billion MLP parameters. Each forward pass is routed by 2 specialists. While there is a lot of discussion in the literature about advanced routing algorithms for choosing which experts to route each marker to, OpenAI's algorithm is said to be fairly simple and applicable to current GPT-4 models. In addition, about 55 billion shared parameters are used for attention mechanisms. Each forward-passed inference (generating 1 marker) utilizes only about 280 billion parameters and 560 TFLOPs of computation. This is in contrast to the approximately 1.8 trillion parameters and 3700 TFLOPs required for each forward pass of the pure dense model.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

The dataset makes up OpenAI, which trained GPT-4 to use about 13 trillion markers. This makes sense given that CommonCrawl contains about 5 trillion high-quality tagged RefinedWeb data. For reference, DeepMind's Chinchilla model and Google's PaLM model were trained using about 1.4 trillion and about 780 billion labels, respectively. It is claimed that even PaLM2 is trained on about 5 trillion labels. This dataset does not contain 13 trillion unique tags. Instead, the dataset contains multiple epochs due to the lack of high-quality markup. There are 2 epochs for text-based data and 4 epochs for code-based data. Interestingly, this is nowhere near Chinchilla's optimal solution, indicating that the model needs to be trained on a double number of labels. This indicates that there is a limited number of easily available tags on the network. There are 1,000 times as many high-quality text markup and more audio and video, but getting them isn't as simple as web scraping. There are also millions of lines of instruction fine-tuning data from ScaleAI and internally. Unfortunately, we couldn't find much information when it came to RLHF data. The context length (seqlen) of the pre-training phase is 8K. The 32k seqlen version of GPT-4 was fine-tuned to 8k after pre-training. On the cluster, the batch size gradually increased over several days, but by the end, OpenAI used a batch size of 60 million! Of course, since not every expert can see all the markers, this is "only" a lot size of 7.5 million marks per expert.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

The parallel strategy is the strategy of parallelizing all A100 GPUs. They used 8-way tensor parallelism because this is the limitation of NVLink. In addition, we heard that they also use 15 pipelines in parallel. Theoretically, that's too much of a pipeline when considering data communication and computation time, but if they're limited by memory capacity, then it makes sense. Just by pipelines + tensors parallelism, the parameters of each GPU take up about 30GB under FP16. Once you add in KV cache and overhead, it makes sense if most of OpenAI's GPUs are 40GB A100s. They may have used ZeRo Phase 1. They may also have used block-level FSDP or hybrid shared data parallelism. As for why they don't use the full model of FSDP, it may be because of the higher communication overhead. While most of OpenAI's nodes have high-speed network connectivity between them, this is not true between all nodes. We believe that at least some clusters have much lower bandwidth. We didn't understand how they could avoid huge delays per batch at such high pipeline parallelism. Most likely they just absorbed this cost.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |
Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Training Costs OpenAI for GPT-4 training FLOPS is about 2.15e25, using about 25,000 A100 GPUs for 90 to 100 days of training, with a utilization rate of about 32% to 36%. The extremely low utilization is partly due to the large number of failures that require a checkpoint restart. The delays mentioned above are extremely costly. Another reason is that global reduction between so many GPUs is extremely expensive. If our guess is correct, then the cluster is actually a combination of many smaller clusters, and the network connection between them is very weak, i.e. the non-blocking connection speed between the various parts of the cluster is 800G/1.6T, but the connection speed between these parts is only 200G/400G. If they cost about $100 per A100 hour in the cloud, this training alone would cost about $63 million. This does not include all experiments, failed training runs, and other costs such as data collection, RLHF, statistics, etc. Due to these factors, the actual cost is much higher. In addition, it means that someone needs to buy a chip/network/data center, take on the capital expenditure, and lease it to you. Today, pre-training with about 8,192 H100s in about 55 days costs about $21.5 million, and an hour of each H100 costs $2. Note that we believe there will be 9 companies with more H100s by the end of the year. These companies will not use all H100s for a single training run, but those that use all H100s for training will have larger models. Meta will have more than 100,000 H100s by the end of the year, but a significant portion of them will be spread across their data centers for inference. Their largest single cluster will still exceed 25,000 H100. By the end of this year, many companies will have enough computing resources to train a GPT-4 scale model.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Mixing the trade-off MoE of expert mode is an excellent way to reduce the number of parameters during inference while still increasing the number of parameters, which is necessary for each training label because more information needs to be encoded. This is necessary because it is very difficult to obtain a sufficiently high-quality markup. If OpenAI really tries to reach Chinchilla's best, they will have to train 2x the number on the training marker. That being said, OpenAI makes several trade-offs. For example, MoE is very difficult to handle during inference because every part of the model is not exploited at every token generation. This means that some parts may be dormant while other parts are in use. For serving users, this can really have a big impact on utilization. Researchers have shown that using 64 to 128 experts is better than using 16 experts, but that's just the findings. There are several reasons to choose fewer specialists. One reason OpenAI chose 16 experts is that more experts are difficult to generalize on many tasks. It may also be more difficult for more experts to achieve convergence. In such large-scale training, OpenAI chose to be more conservative in the number of experts. In addition, using fewer experts also helps their inference infrastructure. There are many difficult trade-offs when moving to an expert hybrid inference architecture. Let's start with the basic trade-offs of LLMs' inference before turning to the dilemma OpenAI faces and the choices they make.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Before we begin, we would like to point out that after talking to all LLM companies, we found that Nvidia's FasterTransformer inference library is very bad, and TensorRT is even worse. Not being able to use Nvidia's templates and modify them means that people need to create their own solutions from scratch. If you're an Nvidia worker, you need to improve this problem with LLM inference as soon as possible, otherwise it will in fact become an open tool that makes it easier to add third-party hardware support. The wave of large-scale models is coming. If there is no software advantage in inference, and cores still need to be written manually, there will be a larger market for AMD's MI300 and other hardware. For inference of large language models, there are 3 main trade-offs involving batch size (the number of services served to multiple users simultaneously) and the number of chips used.

  1. Latency - The model must respond with a reasonable delay. People don't want to wait a few seconds before waiting for the output to start flowing into the chat app. Preloading (input token) and decoding (output token) require different processing times.
  2. Throughput - The model must output a certain number of tokens per second. For human use, about 30 tokens per second are required. Lower and higher throughput are also acceptable for a variety of other use cases.
  3. Utilization - The hardware running the model must achieve high utilization or the cost is too high. Although higher latency and lower throughput can be used to group more user requests and achieve higher utilization, this makes the situation more difficult.

The key to LLM inference is to balance two main factors, memory bandwidth and compute. To simplify, each parameter must be read, and there are 2 FLOPs associated with it. Therefore, the ratio of most chips (H100 SXM with only 3TB/s memory bandwidth, but FP8 with 2,000 TFLOP/s) is completely unbalanced in inference with batch size 1. If only one user is served with a batch size of 1, then the memory bandwidth required to generate streaming for each token dominates inference time, which is almost negligible. To effectively scale a large language model to multiple users, the batch size must be greater than 1. Multiple users allocate parameter reading costs. For example, with a batch size of 256 or 512, each memory byte read corresponds to 512 FLOP/s or 1024 FLOP/s. This ratio is closer to the ratio between the memory bandwidth of the H100 and FLOPS. This helps achieve higher utilization, but at the cost of higher latency. Many people believe that a major bottleneck in LLM inference is memory capacity, as the size of the model limits the number of chips it can accommodate, but this is not true. While large models require multiple chips for inference, and a higher memory capacity allows it to accommodate fewer chips, it is best to use more chips than needed in order to reduce latency, increase throughput, and use larger batch sizes for higher and higher utilization.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Google demonstrates these trade-offs in their PaLM inference paper. However, it's worth noting that this is for dense models like PaLM, not sparse models like GPT-4. If an application requires the lowest latency, we need to apply more chips and partition the model in as many ways as possible. Lower latency can usually be achieved with smaller batch sizes, but smaller batch sizes also result in worse MFUs (utilization), resulting in a higher total cost per token (in chip seconds or dollars). If an application requires offline inference and latency is not an issue, the main goal is to maximize throughput per chip (i.e. minimize the total cost per token). Increasing the batch size is the most efficient because larger batches generally result in better MFUs (utilization), but some partitioning strategies that are inefficient under small batch sizes become efficient as the batch size increases. More chips and larger batch sizes are the cheapest because they improve utilization, but also introduce a third variable, which is network time. Some methods of distributing models across multiple chips are more efficient for latency, but have trade-offs with utilization. The time of the memory loading part and the non-attention calculation time are directly proportional to the model size and inversely proportional to the number of chips. However, for a given partition layout, the time required for chip-to-chip communication decreases more slowly (or not at all) with the number of chips used, so this becomes an increasingly important bottleneck as the number of chips increases. Although today we will only briefly discuss it, it should be noted that the memory requirements of the KV cache increase dramatically as the batch size and sequence length grow. If an application needs to generate text with long attention contexts, the inference time will increase significantly. For a 500B+ model with multi-head attention, the attention KV cache becomes large: for a batch size of 512 and a context length of 2048, the KV cache requires a total of 3TB of capacity, which is 3 times the size of the model parameters. The on-chip memory needs to load this KV cache from off-chip memory, and during this time, the chip's computing core is basically idle. Long sequence lengths are especially challenging for memory bandwidth and memory capacity. OpenAI's 16k sequence length GPT-3.5 Turbo and 32k sequence length GPT-4 are much more expensive because they cannot take advantage of larger batch sizes due to memory limitations. Smaller batch sizes result in lower hardware utilization. In addition, as the sequence length increases, so does the KV cache. KV caches cannot be shared between users, so separate memory reads are required, further limiting memory bandwidth. More on MQA later.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

GPT-4 inference trade-offs and infrastructure all of the above are difficult for GPT-4 inference because the model architecture as a mixture of experts (MoE) introduces a whole new set of difficulties. The forward pass generated by each tag can be routed to a different collection of experts. This introduces a new dilemma in terms of throughput, latency, and utilization trade-offs, especially with larger batch sizes.

OpenAI's GPT-4 has 16 specialists, and each forward pass is routed to 2 of them. This means that if the batch size is 8, the parameter reads for each expert may only have a batch size of 1. Worse, this could mean that one expert has a batch size of 8, while other experts may have a lot size of 4, 1, or 0. With each tag generation, the routing algorithm sends forward passes in different directions, resulting in delays between tags and significant changes in expert batch sizes. Inference infrastructure is one of the main reasons why OpenAI chooses to employ fewer experts. If they choose more experts, memory bandwidth will become more of a bottleneck for inference. OpenAI's inference clusters typically reach batch sizes of 4k+, which means that even with optimal load balancing among experts, the batch size of experts is only about 500. This requires a very large amount of use to achieve. We learned that OpenAI runs inference on a cluster of 128 GPUs. They have multiple such clusters in multiple data centers and geographies. Inference uses 8-way tensor parallelism and 16-way pipe parallelism. Each node consisting of 8 GPUs only has about 130B of parameters, or less than 30GB per GPU in FP16 mode and less than 15GB in FP8/int8 mode. This allows inference to run on the 40GB A100, provided that the KV cache size in all batches does not balloon too much. Layers containing various specialists are not split between different nodes, as this would make network traffic too irregular and recalculate the KV cache between each token generation too expensive. For any future MoE model extensions and conditional routing, the biggest difficulty is how to deal with the routing problem of the KV cache. The model has 120 layers, so a simple distribution is made between 15 different nodes, but since the first node needs to load and embed data, it makes sense to put fewer layers on the head node of the inference cluster. In addition, there are some rumors about speculative decoding, which we will discuss later, but we are not sure whether to believe these rumors. This may also explain why the head node needs to contain fewer layers.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

GPT-4 Inference CostAlthough the forward parameters of GPT-4 are only 1.6 times that of the 175B Davinchi model, it costs 3 times more than the Davinchi model. This is mainly due to the need for larger clusters and lower utilization of GPT-4. We believe that the cost per 1,000 markers is $0.0049 per 1,000 markers during GPT-4 8k sequence length inference for 128 A100s, and $0.0021 per 1,000 markers during GPT-4 8k sequence length inference for 128 H100s. It is important to note that we assume good high utilization and keep the batch size large. This may be a false assumption, as it is clear that OpenAI is sometimes very underutilized. We assume that OpenAI will shut down the cluster during off-peak hours and repurpose these nodes to recover training from checkpoints for smaller test models, trying out various new technologies. This helps reduce inference costs. If OpenAI doesn't do this, their utilization will be lower and our cost estimate will more than double.

Multi-query attention

MQA is something everyone else is doing, but we want to point out that OpenAI is doing it too. In short, only one head is needed, and the memory capacity of the KV cache can be significantly reduced. Even so, the 32k length GPT-4 will definitely not run on the 40GB A100, and the 8k batch size is limited. Without MQA, models with 8k lengths would be significantly limited in batch size, even to the point of being uneconomical.

Continuous batch processing

OpenAI implements variable batch sizes and continuous batching. This is done to strike a balance between maximum latency and optimized inference costs. If you're not familiar with this concept, you can read this from Anyscale:

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Four sequences are completed using static batching. In the first iteration (left), each sequence generates a token (blue) from the prompt token (yellow). After several iterations (right), the completed sequences vary in size because their end-of-sequence tokens (red) are issued in different iterations. Although sequence 3 completes after two iterations, static batching means that the GPU is in an underutilized state until the last sequence in the batch completes (in this case, sequence 2 completes after six iterations).

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Seven sequences are completed using continuous batching. The left side shows the batch after a single iteration, and the right side shows the batch after multiple iterations. Once a sequence issues a sequence end token, we insert a new sequence to replace it, such as sequences S5, S6, and S7. This enables higher GPU utilization because the GPU does not need to wait for all sequences to complete before starting a new one.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

We know from some reliable sources that OpenAI uses speculative decoding in GPT-4's reasoning process. We are not sure if we should believe this statement. The general variation in inference time from token to token and the difference between performing simple retrieval tasks versus more complex tasks seems to indicate that this is possible, but there are too many variables to determine. To make sure, we will use some of the text from the article "Staged guessing decoding to accelerate LLM inference" here, with appropriate modifications and additions. Using LLM is usually divided into two phases. The first is the prefill phase, where the prompt is run through the model to generate the KV cache and the logarithmic probability of the first output (the probability distribution of possible token outputs). This is usually quick because the entire prompt can be processed in parallel. The second stage is decoding. Select a token from the logarithmic odds of the output and feed it back into the model, which generates logarithmic odds for the next token. This process is repeated until the required number of tokens are generated. Since decoding must happen sequentially, each compute unit needs to stream weights to generate a single token, so the arithmetic intensity (i.e. the number of floating-point arithmetic/memory bandwidth bytes computed) at this stage is very low in small batches. Therefore, decoding is often the most resource-intensive part of autoregressive generation. This is why in OpenAI's API calls, input tokens are cheaper than output tokens. The basic idea of guess decoding is to pre-decode multiple tokens using a smaller, faster draft model and then feed them into the formal model as a batch. If the draft model's prediction is correct, i.e. agreement with a larger model, then multiple tokens can be decoded using a single batch, which can save a lot of memory bandwidth and time. However, if the larger model rejects the token predicted by the draft model, the remaining batches are discarded and the algorithm naturally reverts to the standard token-by-token decoding. Guess decoding may also be accompanied by a reject sampling scheme for sampling from the original distribution. Note that this is only useful in small batch setups where bandwidth is a bottleneck. Guess decoding by sacrificing computing resources in exchange for bandwidth. There are two key reasons that make guessing decoding an attractive performance optimization target. First, it does not degrade the model quality. Second, the advantages it provides are often independent of other methods, since its performance comes from converting sequential execution to parallel execution. The current guessing method is to predict a separate sequence for the batch. However, this approach does not scale well at large batch sizes or when draft model alignment is low. Intuitively, the probability of two models agreeing on a long succession of token sequences is exponentially low, which means that as the strength of arithmetic increases, the gain on guess decoding decreases rapidly. We think that if OpenAI is using guess decoding, they may only use it in sequences that are about 4 tokens in length. In addition, there has been speculation that the whole conspiracy of GPT-4 to reduce quality may be because they allow formal models to accept guesses to decode sequences with lower probabilities in the model. There has also been speculation that Bard uses guess decoding because Google waits for the sequence to be fully generated before sending the full sequence to the user, but we don't believe this guess is correct.

Demystifying GPT-4: The Engineering Trade-Offs OpenAI Made in Architecture Design |

Visual multimodal GPT-4's visual multimodal capabilities are the least compelling part relative to leading research. Of course, no one has yet commercialized research on multimodal LLM.

The GPT-4's vision encoder is separate from the text encoder, but there is cross-attention. As far as we know, the architecture is similar to Flamingo. This increased the number of parameters for GPT-4. After text pre-training, fine-tuning is made with about another 2 trillion marks. For the vision model, OpenAI originally wanted to train it from scratch, but the model was not yet mature, so they decided to reduce the risk by starting with text. The next model, GPT-5, is said to train vision from scratch and will be able to generate images autonomously. In addition, it is capable of processing audio. A major purpose of visual capabilities is to enable autonomous agents to read web pages and transcribe content in images and videos. The data they trained on included federated data (rendered LaTeX/text), screenshots of web pages, YouTube videos: sampled frames, and transcribed using Whisper. Interestingly, for all this content that is over-optimized for LLM, the IO cost of the visual model is different from that of the text model. In the text model, data loading is very cheap, as we described in our article on the Amazon cloud crisis. In the visual model, the IO cost of data loading is about 150 times higher. The data load per tag is about 600 bytes instead of 4 bytes of text. There is currently a lot of work going on on image compression. This is important for hardware vendors who are hardware optimized for the use cases and proportions of LLM after 2 to 3 years. They may find themselves in a world where each model has powerful visual and audio capabilities. They may find that their architecture is not well suited to this situation. Overall, the architecture will definitely evolve beyond the dense text-based models and/or simplified forms of the MoE models that we are currently seeing. Tips: Although we push every day, recently some readers have said that due to the adjustment of platform push rules, sometimes they can't see our articles~* The views of the articles are for reference only and do not represent the position of this institution.

Read on