Here's a quick question, how to train a ChatGPT from scratch with a GPU card with limited performance?

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

At present, as the number of model parameters continues to skyrocket, everyone's demand for computing power is also rising. Compared with GPT-3, which used 10,000 GPUs and spent 30 days training 175 billion parameters, we can't organize unlimited computing power under realistic conditions, and secondly, the hardware equipment of the computing card in our hands may be very different, and the level of video memory and bandwidth is uneven.

We find the answer in this paper.

Title of the paper:

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

Paper Links:

https://arxiv.org/ftp/arxiv/papers/2311/2311.15786.pdf

Project Address:

https://github.com/IEIT-Yuan/Yuan-2.0

Model Download:

https://huggingface.co/IEITYuan

Last month, Inspur Information released the 100-billion-level open source large model Source 2.0, as an open source model, Source 2.0 surpassed GPT-3.5 and approached GPT-4.0 in multiple evaluation indicators. As a technical report of Source 2.0, the core focus of this paper is still on the innovation of Source 2.0 on the model structure, that is, the new attention mechanism structure LFA (Localized Filtering-based Attention) mentioned in the title, but we have noticed that the innovation of Source 2.0 in distributed training strategies can also be explored.

Starting from Scaling Law, there is a natural contradiction between the current "limited computing power" and the "model parameters" that can hardly see the upper limit, and the efficient training and development of large models under limited resources is a key technical difficulty and research problem in the world, while Source 2.0 has made efforts on the "distributed training strategy", and finally solved the key engineering problems of efficient large model training in the context of limited and diverse computing power.

So how did Source 2.0 train a high-quality and efficient large model from scratch in a limited GPU environment? OK, then let's start from "how the large model was born" and start slowly.

1. The birth of large models: 3D parallelism

To train large models, it is naturally inseparable from "parallelism". And why should we consider parallelism in large model training? Obviously, it is because "the amount is large but the force is insufficient".

In the process of large model training, "large quantity" can be divided into two aspects, one is "large model", and the arms race of model parameters is soaring to trillions, while the other is "large data", and the amount of training data has become astronomical. The "insufficient power" can also be understood from two aspects, one is that the "insufficient computing power" is too slow, and the other is that the "insufficient video memory" cannot be saved. To give a more concrete example, GPT-3 has 175 billion parameters trained on 570GB of corpus data, and to train such a "behemoth", using 8 V100 graphics cards, the training time is expected to take 36 years.

Obviously, this is not a reasonable time, let's review the basic process of model training, a neural network, input data we first have to run the forward process, calculate wx+b, calculate the activation function, and then calculate the Loss function, use the loss function to perform Backward, derive the parameters to get the gradient Grad, get the Grad, and throw it to the optimizer to update the model weights, and so on.

In fact, the training complexity is limited by the size of the model and the amount of input data, which derives two classical distributed training strategies, model parallelism and data parallelism.

First of all, the model parallelism, one of the pain points of large model training is that the model is too large and has too many parameters, and a direct idea is that the model has a large number of parameters, so we will disassemble the model. Therefore, the idea of model parallel (MP) came into being, and there are two ideas of model parallelism, namely pipeline parallel (PP) and tensor parallel (TP).

Let's take a look at pipeline parallelism first, generally speaking, the model of deep learning Xi is always hierarchical, assuming there is an 8-layer model, then the most simple method of pipeline parallelism is to parallel according to layer slicing. For example, 4 GPUs are used for pipeline parallelism, where GPU 0 is responsible for the calculations

，

And GPU 1 takes care of it

And so on GPU 3 is responsible

。

GPU 0 first calculates the median value 1 and passes it to GPU 1 in the form of tensors, and then GPU 1 passes the calculated value to GPU 2 until GPU 3 receives the output of the model and backpropagates it to complete a training process.

In addition to splitting according to the layer layer, there is also a parallel idea of model is "direct splitting tensor", which actually stems from the nature of matrix multiplication, assuming that there is matrix multiplication, now we can split the column into matrix multiplication, so that matrix multiplication, if there are N GPUs, so that a large matrix multiplication can be split into N small multiplication.

In addition to model parallelism, the size of the input data also directly determines the amount of memory used and the amount of computation, so we split the model when the model is large, and we will split the data if the data is large, so the idea of a distributed strategy is also about to emerge: data parallel (DP), in short, the data that cannot be calculated by a card can not be calculated, we divide the data according to Batch, and distribute the data to multiple cards for model training.

At this time, the combination of pipeline parallelism (PP), tensor parallelism (TP) and data parallelism (DP) mentioned above constitutes the basic method of large-scale model training: 3D parallelism. By applying distributed training and parallelism techniques in three aspects: training data, model tensor, and model layer, 3D parallelism technology has successfully trained models with trillions of parameters. The application of 3D parallelism avoids the resource bottleneck of large model training in terms of computation and video memory, and makes the model scale reach a height that was unimaginable in the past.

In fact, as early as 2021, Inspur Information launched the source 1.0 against GPT-3, and the traditional 3D parallel strategy of data + pipeline + tensor has been used. According to public data, the training efficiency of Source 1.0 with 245.7 billion parameters reached 44.8%, while the training efficiency of GPT-3 with 175 billion parameters was 21.3%.

2. Breaking 3D parallelism: creating another possibility for large model training

Obviously not, for example, when the hardware environment is very different, there are significant differences in the memory and bandwidth of different GPU cards, and sometimes even the same GPU card, the load on the card is not exactly the same. In the context of this multi-computing power, 3D parallel is often prone to the problem of over-standard OOM in the middle of the video memory, and the development of large models on multi-computing power chips in real scenarios is often a necessary means of last resort, so how to "solve" this complex practical problem on the basis of the existing 3D parallel technology has become an imminent difficult pain point.

Two years later, the parameter scale of Source 2.0 is 102.6 billion, less than half of Source 1.0, but it is obvious that Source 2.0 not only "far surpasses" Source 1.0 in all aspects of capabilities, but also greatly improves the efficiency of computing power. In fact, the answer is a set of innovative distributed training strategies - non-uniform pipeline parallelism + optimizer parameter parallelism (ZeRO-1) + data parallelism + Loss computing blocking, which is also the key to the success of Source 2.0!

Specifically, if you think back to the pipeline-parallel workflow, one of the problems is that the traditional pipeline-parallel allocates the same number of model layers to each working GPU. This idea seems to be fine at first glance, but in the actual model training process, it is often necessary to cache more activation values for backpropagation calculations in the first few stages, and only a small number of activation values need to be cached in the later parallel stages. That is to say, it is likely that when the first few stages of video memory are almost used up to the limit, there is still a lot of video memory left in the later stages. Assuming that a 24-layer Transformer uses 8 GPUs as a pipeline parallel, if it is evenly distributed, there are 3 layers of network in one GPU, and according to the backpropagation from the back to the front, the last GPU will have to save more activation values, which is very likely to make the GPU exceed the upper limit of video memory and cause the training to fail, and this problem is even worse in the scenario where the GPU card performance is not high.

As shown in Figure (b) above, doesn't it mean that the "absolute egalitarianism" of uniform pipeline parallelism will lead to "more work and less gain", so it is better to "distribute according to work", and the uneven style of layers borne by each GPU in the pipeline parallelism is not enough, so as to reduce the peak memory overhead and achieve the overall best performance of storage and computing. For example, if a model with 12 layers is divided into 4 GPUs, GPU 1 is divided into only 2 layers, GPU 2 is divided into 3 layers, GPU 3 is divided into 4 layers, and GPU 4 is divided into 3 layers.

At the same time, in order to further reduce the loss, Source 2.0 proposes a method of "block-by-block cross-entropy calculation", for example, the number of input tokens is 2048. The block-by-block cross-entropy calculation first splits the tokens, divides them into 16 blocks, each block is 128 tokens in length, and 128 Losses are obtained for each calculation, and by connecting all the Losses into a tensor, the intermediate temporary variables are released, so that the model training can solve the memory bottleneck in the last pipeline stage mentioned above without additional computation and communication.

In addition, if you want to continue to give full play to the idea of data parallelism, a careful analysis shows that in data parallelism, there is also a complete model parameter stored in each working GPU, which directly makes the GPU video memory become the bottleneck of model scale. So, is it really necessary to have a complete copy of the model parameters stored in each GPU? The answer is obviously not necessary, which leads to the famous ZeRO.

Dismantling these letters, ZERO stands for Zero Redundancy Optimizer, that is, zero redundancy optimization, ZeRO first analyzed the composition of "model parameters", which can be obtained from the training process of the model above, the so-called model parameters mainly contain three parts, namely the parameters of the model itself, gradient parameters and optimizer parameters data, on the contrary, the whole training process actually only needs "one parameter" is enough.

From this point of view, it is very natural, assuming that there are N GPUs, no matter which part of the model parameters, in fact, the current working GPU only needs 1/N parameters to complete the training, therefore, Source 2.0 uses the idea of ZeRO, through the trade-off between the "memory decrease" and "communication cost increase" of the three parameters, Source 2.0 chooses ZeRO-1, that is, the optimization parameters are preferentially paralleled. As can be seen from the figure above, the optimizer parameters are used to compare the baseline side-by-side to reduce the video memory in the instance from 120 GB to 31.4 GB, which greatly optimizes the GPU video memory usage.

In addition to the previously mentioned ZeRO-1 parallel to non-uniform pipelines, Source 2.0 also eliminates the tensor parallel strategy. The tensor parallel splitting of the tensor tensor directly from the "source" leads to the fine-grained slicing of the model parameters, and the fine-grained calculation means that the GPUs must communicate frequently, making the P2P communication bandwidth between chips the main bottleneck of training. For example, when the P2P communication bandwidth between chips is 900 GB/s, the communication time accounts for 10%, and if the P2P communication bandwidth is 400 GB/s, the communication time will increase to 23%. Further conversion shows that the P2P communication bandwidth is reduced, resulting in a performance loss of about 17%.

Note: Inspur Information "Source 1.0" 3D parallel training strategy communication analysis

According to this finding, Source 2.0 cancels tensor parallelism, and reorganizes the distributed training strategy by applying technologies such as non-uniform pipeline parallelism, Loss calculation blocking, ZeRO-1, and data parallelism, so that the inter-chip communication bandwidth is no longer an obstacle to limiting model training, as shown in the figure below, when the P2P bandwidth between chips is reduced from 400GB/s to 100GB/s, the performance of this distributed training algorithm will hardly change much (0.23% ), which can be better applied to various computing power clusters with large differences in performance levels.

Third, a little summary

At the end of 2023, looking back at the year of rapid development of large models, from the advent of ChatGPT to the surging tide of AI Agent, from GPT-4 to the world's leading large models to the current large models blooming and contending...... In just one year, the large model has given us the closest to the future "intelligent era" of fanatical imagination, a variety of technologies have sprung up and various industry applications have poured down, even in the last short month, we can see models such as Google's Gemini, Microsoft's Phi-2, Mistral AI's Mixtral and so on.

And this batch after batch of large models came and went. If you only pay attention to how many places these models rise in a certain benchmark or how well they are on a certain data set, you may be able to have a lively discussion after tea and dinner for a period of time, but at the moment of the ultra-rapid development of large models, they will soon be forgotten by everyone and become a thing of the past.

Aside from the SOTA, in fact, it is very important to innovate training technology for actual scenarios, and we often expect various domestic technologies to "catch up with the United States", but the so-called "engineering advantage" is never a paper or a technology that suddenly appeared, and the emergence of the real engineering advantage of a big country often lies in the accumulation of innovative technologies such as Yuan 2.0 distributed training.

It may be easy to put forward a concept of Fancy in a paper, and it may be simple to brush up two beautiful results in a Toy example, but it may be quite difficult for us to move forward to assemble and build various theories step by step, face practical problems, and connect one small technological innovation after another until we support a giant project or even open source it.

And at the moment when large models are blooming everywhere, perhaps this is the answer to why work like Source 2.0 is particularly rare!

The memory bottleneck has been broken! The black technology of large model training is coming, and the bandwidth no longer restricts large model training

1. The birth of large models: 3D parallelism

2. Breaking 3D parallelism: creating another possibility for large model training

Third, a little summary

Read on

Shadowless Cloud Classroom at an altitude of 3,200 meters: Children under the snow-capped mountains meet AI models

Xiao Xin shared: cellular automata model

The man stole 800 yuan of mobile phone models and was detained

Only Google's injured world has been achieved, but should the "all-round model" be followed?

Unraveling the Mystery of Memory: Ebbinghaus's Forgetting Curve and Mind Model Playing Cards Help You Grow and Leap

After GPU, NPU becomes the standard configuration again, how do mobile phones and PCs carry large AI models?

Be a sneak peek! ByteDance is unprecedented! The large model is stunningly unveiled, and the price is as low as 99%!

39 million people watched Lei Jun's live test drive; Musk recruits second brain-computer experiment patient; DeepMind launches a large-scale model risk assessment framework

From "sky-high prices" to "fracture prices", large models are about to change

If you want to land a large model, let everyone afford to use it first

Direct interaction with hundreds of millions of users Third-party AI models accelerate access to the Weibo ecosystem

iFLYTEK Xinghuo large model empowerment, opening up the "new consciousness" of virtual people

The charging standard of electric bicycle charging pile is a bit messy! It involves Xiangchong Cloud, Qianrun Technology, Quanlai Call, etc

Climb to "high" and "green" - the development of Southern New Material Technology Co., Ltd

Jin Chen's "Celebrating More Than Years 2" overturned, and he was complained about a sense of science and technology, and the original actor's comment area fell

"Technology + Culture" featured IP helps the high-quality development of cultural and tourism integration in Xixian New Area

Giti Tire: T5 Qizhi Technology was launched

【6·18】Table tennis black technology equipment inventory, which one do you like the most? - National Ball Collection

When open source meets large models, what kind of changes will occur?

The chairman of the "first share of energy storage" Peneng Technology was placed on file and retained, and some shareholders or accurately reduced their holdings and cashed out before the stock price plummeted

Technology giants are actively deploying in the field of AI! Microsoft will hold its annual developer conference, or reveal plans for AI PCs

雾麻科技宣布四位管理层任命 Jim McCormick为首席财务官

Tech giants are betting heavily on the AI war

FAW-Volkswagen's Range Cruise, IQ Technology, is the leader of the 5-seater SUV with 274,900 units

Technology helps the disabled, and their lives are just as exciting

It is said that the senior management of the Tsinghua Department of the large model company has changed

58.com Sun Qiming: How to build a large model of life service vertical? Self-developed + open source with both hands

When the construction of key projects is underway|Jiejia Technology's new plant started: innovation and iteration to promote the upgrading of the warp knitting industry

Technology and ruthless DH World Cup equipment appreciation

AI Dimensity Full Push, China's First End-to-End Large Model Mass Production on the Car Xpeng opens the era of AI intelligent driving

Pharmaceutical innovation, the deep driving force of science and technology to help the disabled

Technology empowers and asks for food from facility agriculture