laitimes

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

author:Xi Xiaoyao Technology said

Here's a quick question, how to train a ChatGPT from scratch with a GPU card with limited performance?

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

At present, as the number of model parameters continues to skyrocket, everyone's demand for computing power is also rising. Compared with GPT-3, which used 10,000 GPUs and spent 30 days training 175 billion parameters, we can't organize unlimited computing power under realistic conditions, and secondly, the hardware equipment of the computing card in our hands may be very different, and the level of video memory and bandwidth is uneven.

We find the answer in this paper.

Title of the paper:

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

Paper Links:

https://arxiv.org/ftp/arxiv/papers/2311/2311.15786.pdf

Project Address:

https://github.com/IEIT-Yuan/Yuan-2.0

Model Download:

https://huggingface.co/IEITYuan

Last month, Inspur Information released the 100-billion-level open source large model Source 2.0, as an open source model, Source 2.0 surpassed GPT-3.5 and approached GPT-4.0 in multiple evaluation indicators. As a technical report of Source 2.0, the core focus of this paper is still on the innovation of Source 2.0 on the model structure, that is, the new attention mechanism structure LFA (Localized Filtering-based Attention) mentioned in the title, but we have noticed that the innovation of Source 2.0 in distributed training strategies can also be explored.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

Starting from Scaling Law, there is a natural contradiction between the current "limited computing power" and the "model parameters" that can hardly see the upper limit, and the efficient training and development of large models under limited resources is a key technical difficulty and research problem in the world, while Source 2.0 has made efforts on the "distributed training strategy", and finally solved the key engineering problems of efficient large model training in the context of limited and diverse computing power.

So how did Source 2.0 train a high-quality and efficient large model from scratch in a limited GPU environment? OK, then let's start from "how the large model was born" and start slowly.

1. The birth of large models: 3D parallelism

To train large models, it is naturally inseparable from "parallelism". And why should we consider parallelism in large model training? Obviously, it is because "the amount is large but the force is insufficient".

In the process of large model training, "large quantity" can be divided into two aspects, one is "large model", and the arms race of model parameters is soaring to trillions, while the other is "large data", and the amount of training data has become astronomical. The "insufficient power" can also be understood from two aspects, one is that the "insufficient computing power" is too slow, and the other is that the "insufficient video memory" cannot be saved. To give a more concrete example, GPT-3 has 175 billion parameters trained on 570GB of corpus data, and to train such a "behemoth", using 8 V100 graphics cards, the training time is expected to take 36 years.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

Obviously, this is not a reasonable time, let's review the basic process of model training, a neural network, input data we first have to run the forward process, calculate wx+b, calculate the activation function, and then calculate the Loss function, use the loss function to perform Backward, derive the parameters to get the gradient Grad, get the Grad, and throw it to the optimizer to update the model weights, and so on.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

In fact, the training complexity is limited by the size of the model and the amount of input data, which derives two classical distributed training strategies, model parallelism and data parallelism.

First of all, the model parallelism, one of the pain points of large model training is that the model is too large and has too many parameters, and a direct idea is that the model has a large number of parameters, so we will disassemble the model. Therefore, the idea of model parallel (MP) came into being, and there are two ideas of model parallelism, namely pipeline parallel (PP) and tensor parallel (TP).

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

Let's take a look at pipeline parallelism first, generally speaking, the model of deep learning Xi is always hierarchical, assuming there is an 8-layer model, then the most simple method of pipeline parallelism is to parallel according to layer slicing. For example, 4 GPUs are used for pipeline parallelism, where GPU 0 is responsible for the calculations

And GPU 1 takes care of it

,

And so on GPU 3 is responsible

GPU 0 first calculates the median value 1 and passes it to GPU 1 in the form of tensors, and then GPU 1 passes the calculated value to GPU 2 until GPU 3 receives the output of the model and backpropagates it to complete a training process.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

In addition to splitting according to the layer layer, there is also a parallel idea of model is "direct splitting tensor", which actually stems from the nature of matrix multiplication, assuming that there is matrix multiplication, now we can split the column into matrix multiplication, so that matrix multiplication, if there are N GPUs, so that a large matrix multiplication can be split into N small multiplication.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

In addition to model parallelism, the size of the input data also directly determines the amount of memory used and the amount of computation, so we split the model when the model is large, and we will split the data if the data is large, so the idea of a distributed strategy is also about to emerge: data parallel (DP), in short, the data that cannot be calculated by a card can not be calculated, we divide the data according to Batch, and distribute the data to multiple cards for model training.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

At this time, the combination of pipeline parallelism (PP), tensor parallelism (TP) and data parallelism (DP) mentioned above constitutes the basic method of large-scale model training: 3D parallelism. By applying distributed training and parallelism techniques in three aspects: training data, model tensor, and model layer, 3D parallelism technology has successfully trained models with trillions of parameters. The application of 3D parallelism avoids the resource bottleneck of large model training in terms of computation and video memory, and makes the model scale reach a height that was unimaginable in the past.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

In fact, as early as 2021, Inspur Information launched the source 1.0 against GPT-3, and the traditional 3D parallel strategy of data + pipeline + tensor has been used. According to public data, the training efficiency of Source 1.0 with 245.7 billion parameters reached 44.8%, while the training efficiency of GPT-3 with 175 billion parameters was 21.3%.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

2. Breaking 3D parallelism: creating another possibility for large model training

Obviously not, for example, when the hardware environment is very different, there are significant differences in the memory and bandwidth of different GPU cards, and sometimes even the same GPU card, the load on the card is not exactly the same. In the context of this multi-computing power, 3D parallel is often prone to the problem of over-standard OOM in the middle of the video memory, and the development of large models on multi-computing power chips in real scenarios is often a necessary means of last resort, so how to "solve" this complex practical problem on the basis of the existing 3D parallel technology has become an imminent difficult pain point.

Two years later, the parameter scale of Source 2.0 is 102.6 billion, less than half of Source 1.0, but it is obvious that Source 2.0 not only "far surpasses" Source 1.0 in all aspects of capabilities, but also greatly improves the efficiency of computing power. In fact, the answer is a set of innovative distributed training strategies - non-uniform pipeline parallelism + optimizer parameter parallelism (ZeRO-1) + data parallelism + Loss computing blocking, which is also the key to the success of Source 2.0!

Specifically, if you think back to the pipeline-parallel workflow, one of the problems is that the traditional pipeline-parallel allocates the same number of model layers to each working GPU. This idea seems to be fine at first glance, but in the actual model training process, it is often necessary to cache more activation values for backpropagation calculations in the first few stages, and only a small number of activation values need to be cached in the later parallel stages. That is to say, it is likely that when the first few stages of video memory are almost used up to the limit, there is still a lot of video memory left in the later stages. Assuming that a 24-layer Transformer uses 8 GPUs as a pipeline parallel, if it is evenly distributed, there are 3 layers of network in one GPU, and according to the backpropagation from the back to the front, the last GPU will have to save more activation values, which is very likely to make the GPU exceed the upper limit of video memory and cause the training to fail, and this problem is even worse in the scenario where the GPU card performance is not high.

As shown in Figure (b) above, doesn't it mean that the "absolute egalitarianism" of uniform pipeline parallelism will lead to "more work and less gain", so it is better to "distribute according to work", and the uneven style of layers borne by each GPU in the pipeline parallelism is not enough, so as to reduce the peak memory overhead and achieve the overall best performance of storage and computing. For example, if a model with 12 layers is divided into 4 GPUs, GPU 1 is divided into only 2 layers, GPU 2 is divided into 3 layers, GPU 3 is divided into 4 layers, and GPU 4 is divided into 3 layers.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

At the same time, in order to further reduce the loss, Source 2.0 proposes a method of "block-by-block cross-entropy calculation", for example, the number of input tokens is 2048. The block-by-block cross-entropy calculation first splits the tokens, divides them into 16 blocks, each block is 128 tokens in length, and 128 Losses are obtained for each calculation, and by connecting all the Losses into a tensor, the intermediate temporary variables are released, so that the model training can solve the memory bottleneck in the last pipeline stage mentioned above without additional computation and communication.

In addition, if you want to continue to give full play to the idea of data parallelism, a careful analysis shows that in data parallelism, there is also a complete model parameter stored in each working GPU, which directly makes the GPU video memory become the bottleneck of model scale. So, is it really necessary to have a complete copy of the model parameters stored in each GPU? The answer is obviously not necessary, which leads to the famous ZeRO.

Dismantling these letters, ZERO stands for Zero Redundancy Optimizer, that is, zero redundancy optimization, ZeRO first analyzed the composition of "model parameters", which can be obtained from the training process of the model above, the so-called model parameters mainly contain three parts, namely the parameters of the model itself, gradient parameters and optimizer parameters data, on the contrary, the whole training process actually only needs "one parameter" is enough.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

From this point of view, it is very natural, assuming that there are N GPUs, no matter which part of the model parameters, in fact, the current working GPU only needs 1/N parameters to complete the training, therefore, Source 2.0 uses the idea of ZeRO, through the trade-off between the "memory decrease" and "communication cost increase" of the three parameters, Source 2.0 chooses ZeRO-1, that is, the optimization parameters are preferentially paralleled. As can be seen from the figure above, the optimizer parameters are used to compare the baseline side-by-side to reduce the video memory in the instance from 120 GB to 31.4 GB, which greatly optimizes the GPU video memory usage.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

In addition to the previously mentioned ZeRO-1 parallel to non-uniform pipelines, Source 2.0 also eliminates the tensor parallel strategy. The tensor parallel splitting of the tensor tensor directly from the "source" leads to the fine-grained slicing of the model parameters, and the fine-grained calculation means that the GPUs must communicate frequently, making the P2P communication bandwidth between chips the main bottleneck of training. For example, when the P2P communication bandwidth between chips is 900 GB/s, the communication time accounts for 10%, and if the P2P communication bandwidth is 400 GB/s, the communication time will increase to 23%. Further conversion shows that the P2P communication bandwidth is reduced, resulting in a performance loss of about 17%.

Note: Inspur Information "Source 1.0" 3D parallel training strategy communication analysis

According to this finding, Source 2.0 cancels tensor parallelism, and reorganizes the distributed training strategy by applying technologies such as non-uniform pipeline parallelism, Loss calculation blocking, ZeRO-1, and data parallelism, so that the inter-chip communication bandwidth is no longer an obstacle to limiting model training, as shown in the figure below, when the P2P bandwidth between chips is reduced from 400GB/s to 100GB/s, the performance of this distributed training algorithm will hardly change much (0.23% ), which can be better applied to various computing power clusters with large differences in performance levels.

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

Third, a little summary

At the end of 2023, looking back at the year of rapid development of large models, from the advent of ChatGPT to the surging tide of AI Agent, from GPT-4 to the world's leading large models to the current large models blooming and contending...... In just one year, the large model has given us the closest to the future "intelligent era" of fanatical imagination, a variety of technologies have sprung up and various industry applications have poured down, even in the last short month, we can see models such as Google's Gemini, Microsoft's Phi-2, Mistral AI's Mixtral and so on.

And this batch after batch of large models came and went. If you only pay attention to how many places these models rise in a certain benchmark or how well they are on a certain data set, you may be able to have a lively discussion after tea and dinner for a period of time, but at the moment of the ultra-rapid development of large models, they will soon be forgotten by everyone and become a thing of the past.

Aside from the SOTA, in fact, it is very important to innovate training technology for actual scenarios, and we often expect various domestic technologies to "catch up with the United States", but the so-called "engineering advantage" is never a paper or a technology that suddenly appeared, and the emergence of the real engineering advantage of a big country often lies in the accumulation of innovative technologies such as Yuan 2.0 distributed training.

It may be easy to put forward a concept of Fancy in a paper, and it may be simple to brush up two beautiful results in a Toy example, but it may be quite difficult for us to move forward to assemble and build various theories step by step, face practical problems, and connect one small technological innovation after another until we support a giant project or even open source it.

And at the moment when large models are blooming everywhere, perhaps this is the answer to why work like Source 2.0 is particularly rare!

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

Read on