The impact of large models and pre-training on the automotive industry, AI-accelerated chips may lose the market

Large language model, originally called Large language model, LLM, or large-scale language model, which has no precise definition, and in 2018 was generally defined as a model with billions of parameters. Today, this definition should be a model with more than 100 billion parameters, which appeared at the same time as the Pre-Training Model (PTM), complementing each other, and the large model has officially appeared since the advent of Bert and GPT in 2018.

The massive parameters of large models mean that they consume storage costs crazy, and basically the storage cost of chips that can correspond to large models is more than $2,000, which means that it is impossible for real large models to get on the car. However, in the automotive industry, a model with more than 1 billion parameters may also be called a large model, which is very different from the real large model, what impact will it have?

The first is the field of computing chips, which require higher bandwidth memory, and very expensive HBM is likely to appear on automotive chips, which means that the price of automotive computing chips will exceed $1,000;
Secondly, the old accelerator designed for CNN networks is completely unusable or extremely inefficient, and GPUs or CPU+GPU-based systems are more suitable and more able to cope with future model changes;
Finally, PTM greatly reduces the cost of data training and shrinks the gold content of the dataset.

The impact of large models and pre-training on the automotive industry, AI-accelerated chips may lose the market

To analyze the impact of large models, or transformers, on computing hardware, see this paper: Full Stack Optimization of Transformer Inference: a Survey at NVIDIA and Barkley University. The paper is described in great detail, 45 pages long, and those who are interested can read the full text.

The paper points out that the bottleneck in Transformer accelerator operations is mainly in the CPU, and the non-matrix operation part consumes 96% of the total time. In other words, CPU is far more important than accelerators, in the Transformer era, accelerators do not need to exist, CPU+GPU is the best choice. If you have to use an accelerator, you also have to order a high-performance CPU.

It should also be pointed out that the current large models can be regarded as a type of Transformer.

The foundation of the large model evolutionary tree is the Transformer. Transformer is a language model proposed in 2017, originally used to solve the problem of machine translation, but with the deepening of research, Transformer shines in different problems, and even in different fields, becoming a strong solution in the field of text representation, classification, generation, question answering and other problems in the field of natural language, and also excellent in the field of vision.

Compared with traditional CNN focused designs, Transformers mainly consist of matrix multiplication (matrix modulus) and memory-intensive nonlinear operations. The computational graph and data flow of the Transformer model are more complex than CNNs, with more operating nodes, more data flow segmentation and connection. The previous generation of traditional AI accelerators in the automotive field are basically for CNNs, unable to cope with data flow segmentation, traditional AI accelerators are the greatest possible to let data flow between computing units, reduce storage and reading, and can not do anything in the face of data flow segmentation.

Above is the Transformer schematic, the encoder is stacked by the same layer, and the structure of each layer has two parts, multi-head attention and feed-forward. Decoders are also stacked by the same layers, each structured as multihead attention, encoder-decoder attention, and feed-forward. Each element in the encoder is visible to the entire sequence. There are two multi-head attention in each layer of the decoder, one is the input part of the decoder as the self-attention of QKV, one is the output of the previous decoder layer as Q, and the output of the last encoder layer is used as the encode-decode attention of KV.

Each part of the encoder layer and decoder layer is in the form of residual blocks and includes a layer norm. There is position encoding at the input of both the encoder and decoder, and the position encoding and token embedding are summed. Transformer uses trigonometric positional coding, which means trigonometric operations, trigonometric operations, which are scalar operations, the CPU is the best at, AI accelerators can do nothing about this, AI accelerators can only calculate the product and accumulation of matrices. Even the GPU needs the strong cooperation of the CPU, which is one of the reasons why NVIDIA is dedicated to developing its own CPU.

At present, the architecture of LLM models is mainly based on Transformer, which is further divided into Encoder-Only, Decoder-Only, Encoder-Decoder, and the two representative models of Bert (Encoder) and GPT2 (Decoder) are selected for analysis.

Due to the huge parameters of the large model, even a chip of $30,000 per chip such as H100 cannot fit 175 billion parameters, so NVIDIA proposed a tensor parallel computing method, which once again shows that the real large model cannot get on the car. The title of NVIDIA's paper is "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism."

Simply put, cut the parametric model (i.e. the weight model) vertically, cut horizontally, and then insert an AllReduce.

Attention's multi-head computation is simply tailored for tensor models in parallel, because each head can be computed independently and the result can be concated. In other words, the parameters of each header can be placed on a piece of GPU. For the three parameter matrices Q, K, V, according to the "column cutting", each head is placed on a GPU to do parallel computing. For linear layer B, follow the "row cut". The cutting method is basically the same as the MLP layer, and its forward and backward principles are also consistent, which will not be repeated here.

Regarding model cutting, we will not do an in-depth analysis here, you just need to know that the current large models are based on multiple GPU operations. Usually 8 GPUs plus 2 CPUs as a node node, 8 GPUs are connected with NVLink, nodes and nodes are connected with NVIDIA Connect TX-7 smart network card, this network card is not ordinary, TSMC 7nm process, 80 billion transistors, 400G GPUDirect throughput, 400G encryption acceleration, 405 million / second information rate, the cost is estimated to be more than $500.

Typical AI accelerator framework diagram, without CPU scheduling tasks, the accelerator is a headless fly, unable to work.

The cutting of model data and the arrangement of task scheduling are more suitable for the CPU, and the GPU is not suitable for this kind of computation with interrupt branches, but the GPU can at least correspond to the vector. AI accelerators are even worse, AI accelerators can basically only correspond to tensors or matrices, requiring no correlation between data and data. The Transformer has pre-feedback and timing correlation, and the AI accelerator requires the full cooperation of the CPU, eliminating those inefficient computing tasks and only doing tasks that suit you, and the CPU's voice is much higher than that of the AI accelerator. CPUs also require high computing power, such as Tesla's HW4.0 second-generation FSD, which uses up to 24-core Cortex-A72, and Nvidia's Thor is at least a 32-core ARM V1.

MOPs are the number of accesses per second, and the larger the model, the higher the MOPs. For models below 100 million parameters, memory like GDDR6 is barely acceptable, and Tesla and Mobileye EyeQ6H already use the expensive GDDR6. However, most manufacturers still use cheap LPDDR5, which can only correspond to models with 10 million parameters, and must use expensive HBM in the future. HBM itself is not too expensive, several times more expensive than GDDR6, but once HBM is used, it must cooperate with TSMC's CoWoS process. This CoWoS process has extremely tight production capacity and extremely high prices, and is generally only used in server chips.

Next, let's talk about pre-training, understand PTM to start from the human learning mechanism, the use of deep learning automatic learning features has gradually replaced artificial construction of features and statistical methods. But one of the key problems is that a lot of data is required, otherwise it will be overfitted due to too many parameters. But this is very expensive, taking the image description task as an example, the MSCOCO dataset only labels 120,000 images, each image gives 5 tags, a total of 10.8W dollars. In the field of autonomous driving, each image has at least 10 marks, and the dataset is a million-level jump.

As we all know, Transformer was originally a product of the field of natural language processing (NLP), which uses self-supervised learning for pre-training, motivated by the use of text intrinsic associations as supervised signals instead of artificial labeling. Initial explorations focused on the semantics of words acquired by shallow pretrained models, such as Word2Vec and Glove, but their limitation was that they could not represent word polysemy well. Naturally, RNNs were thought of providing contextual representations, but model performance was still limited by model size and depth.

In 2018, GPT and BERT were born, bringing NLP's PTM into a new era. These new models are very large, a large number of parameters can capture word polysemy, lexical, syntactic structure, real-world knowledge and other information from the text, and by fine-tuning the model, only a few examples can achieve amazing performance on downstream tasks.

The original intention of transfer learning is to save the time of manually labeling samples, so that the model can migrate from existing labeled data (source domain data) to unlabeled data (target domain data). Thus, a model suitable for the target domain is trained. It should be pointed out that Transfer Learning is a branch of machine learning that can be implemented without neural networks, but now neural networks are basically merged with transfer learning. In layman's terms, it is to use existing knowledge to learn new knowledge, the core is to find the similarity between existing knowledge and new knowledge, in the idiom is to draw inferences. Since it is too expensive to learn from scratch directly from the target domain, we turned to using the relevant knowledge we already had to help learn new knowledge as quickly as possible.

There are two strategies for the application of transfer learning to deep learning, but the naming of these two strategies has not yet been unified. One strategy is finetuning — which involves using a pre-trained network on the underlying dataset and training all layers on the target dataset; The other is freeze and train – which involves freezing all layers except the last layer (weights are not updated) and training the last layer. Of course, transfer learning is not limited to deep learning, but there are indeed many applications in deep learning at present.

A typical example can be found in the paper "Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks", produced by Microsoft Research Asia.

The idea is very simple, that is, to think of pictures as language, like text, using generative self-supervised pre-training, BEiT-3 uses a shared Multiway Transformer structure, through mask data modeling on single-modal and multimodal data to complete pre-training, and can be transferred to a variety of visual, visual-language downstream tasks. In layman's terms, only a small number of finely labeled videos are needed, a model is trained with this finely labeled dataset, and then a large number of unlabeled videos are trained with this model, and finally a better weight model is obtained. The principle is similar to when we learn a foreign language, when we encounter a word we don't know (mask data), we don't need to know the exact meaning of the word, we can guess the meaning of the word through previous experience, and take the meaning of this guess as an accurate mark.

Despite the great success, there are some fundamental problems that cannot be solved, we still do not know the nature hidden in the large number of model parameters, and the huge computational cost of training these behemoths also hinders further exploration of the nature of large models, which may soon push large models to the ceiling level, unable to go any further.

Disclaimer: The views and data in this article are for reference only, and there may be deviations from the actual situation. This article does not constitute investment advice, all views and data in this article only represent the author's position, and do not have any guidance, investment and decision-making opinions.