laitimes

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

author:Heart of the Machine Pro

Reprinted at the Heart of the Machine

Source: Zhihu

Written by Microsoft DeepSpeed

Large AI models are changing the digital world. Generative language models such as Turing-NLG, ChatGPT, and GPT-4 based on Large Language Models (LLM) are versatile and capable of performing tasks such as summarization, code generation, and translation. Similarly, large, multimodal generative models such as DALL・E, Microsoft Designer, and Bing Image Creator can generate art, architecture, video, and other digital assets, enabling content creators, architects, and engineers to explore new creative productivity.

However, training these large models requires a lot of memory and compute resources on hundreds or even thousands of GPU devices. For example, training a Megatron-Turing NLG 530B model requires more than 4,000 NVidia A100 GPUs. Efficient use of these resources requires a sophisticated optimization system that appropriately distributes the model to the memory of individual devices and efficiently parallelizes the computations on those devices. At the same time, for the deep learning community to easily train large models, these optimizations must be easy to use.

DeepSpeed's ZeRO optimization series provides a powerful solution to these challenges and has been widely used in the training of large deep learning models such as TNLG-17B, Bloom-176B, MPT-7B, Jurrasic-1. Despite its transformative capabilities, in some key scenarios, ZeRO incurs significant data transfer overhead between GPUs, which reduces training efficiency. This occurs especially in scenarios where a) the global batch size is small and the number of GPUs is large, which results in a smaller batch size on each GPU that requires frequent communication; or b) train on a low-end cluster with limited cross-node network bandwidth, resulting in high communication latency. In these cases, ZeRO's training efficiency is limited.

To address these limitations, we released ZeRO++. ZeRO++ reduces total traffic by 4x compared to ZeRO without compromising model quality. This has two key implications:

1. ZeRO++ accelerates large-scale model pre-training and fine-tuning

When batch size is small per GPU: Whether pre-training large models on thousands of GPUs or fine-tuning them on hundreds or even dozens of GPUs, ZeRO++ provides 2.2x higher throughput than ZeRO when the batch size per GPU is small, directly reducing training time and cost.

Low-bandwidth compute clusters: ZeRO++ enables low-bandwidth clusters to achieve similar throughput to high-end clusters with 4x higher bandwidth. As a result, ZeRO++ enables efficient training of large models across a wider range of clusters.

2. ZeRO++ accelerates RLHF training for ChatGPT classes

Although ZeRO++ is primarily designed for training, its optimizations also automatically apply to ZeRO-Inference, as the communication overhead also applies to ZeRO training and inference. Thus, ZeRO++ can improve the efficiency of algorithms such as human feedback reinforcement learning (RLHF), which combines training and inference.

Through integration with DeepSpeed-Chat, ZeRO++ can improve the efficiency of the generative phase of RLHF training by up to 2x and the reinforcement learning training phase by up to 1.3x compared to the original ZeRO.

Next, we'll explain ZeRO and its communication overhead in more depth, and discuss key optimizations in ZeRO++ to address these issues. Then we will show the impact of ZeRO++ on training throughput for different model sizes, batch sizes, and bandwidth limitations. We will also discuss how ZeRO++ can be applied to DeepSpeed-Chat to speed up the training of conversational models using RLHF.

ZeRO++ explained

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 2: ZeRO optimizer workflow diagram (this is a partial demonstration, please see the original text of Zhihu for the complete process)

ZeRO is a memory-efficient version of Data Parallelism, where model state is partitioned across all GPUs and does not need to be replicated and reconstructed using gather/broadcas-based communication during training. This enables ZeRO to efficiently utilize the aggregate GPU memory and compute power of all devices, while providing easy-to-use data parallel training.

Suppose the model size is M. During forward propagation, ZeRO performs all-gather/broadcast operations to collect parameters (total size M) for each model layer when needed. In backward passing, ZeRO calculates its local gradient (total size M) using a similar communication pattern for the parameters of each layer. In addition, ZeRO uses reduce or reduce-scatter communication to average and split storage (total size M) immediately after calculating each local gradient. Thus, ZeRO has a total of 3M traffic, evenly distributed across two all-gather/broadcast operations and one reduce-scatter/reduce operation.

To reduce these communication overheads, ZeRO++ has three sets of communication optimizations for each of the above three communication sets:

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 3: Partition quantization legend for qwZ

Weighting Quantization in ZeRO Communication (qwZ)

First, to reduce parameter traffic during all-ga, we use weight quantization to dynamically shrink each model parameter from FP16 (two bytes) to INT8 (one byte) data type before communication, and dequantize the weights after communication. However, simply quantifying the weights reduces the accuracy of model training. In order to maintain good model training accuracy, we use partition quantization, that is, each subset of model parameters is quantized independently. There is currently no existing implementation of high performance quantization for partitioning. As a result, we implemented a highly optimized quantization CUDA core from the ground up with 3x better accuracy and 5x faster than basic quantization.

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 4: Hierarchical segmented storage (hpZ) for weights

Hierarchical segmentation storage (hpZ) of ZeRO model weights

Second, to reduce the communication overhead of all-gather-gather-weights during backward passing, we communicate with GPU memory. More specifically, instead of distributing the entire model weights across all machines, as in ZeRO, we maintain a complete copy of the model in each machine. At the cost of higher memory overhead, this allows us to replace the expensive all-gather/broadcast with model weighted all-gather/broadcast within the machine, which makes communication speed greatly improved due to the higher communication bandwidth within the machine.

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 5: qgZ end-to-end workflow

ZeRO Ladder Metric (qgZ) during communication

Third, it is more challenging to reduce the cost of reduce-scatter communication for gradients. Because it is not feasible to directly apply quantization to reduce traffic. Even though we use partition quantization to reduce quantization errors, gradient reduce accumulates and amplifies quantization errors. To solve this problem, we only quantize gradients before communication, but derate them to original accuracy before any reduce operation. To do this effectively, we invented a novel all-to-all-all based quantized gradient communication paradigm called qgZ, which is functionally equivalent to the reduced - scatter operation of compression.

qgZ is designed to solve two challenges: i) if we simply implement reduce-scatter in INT4/INT8 it results in a significant loss of precision, and ii) using quantization in a traditional tree or ring-based reduce-scatter requires a long series of quantization and dequantization steps, which directly leads to error accumulation and significant latency, even if we reduce at full precision. To solve these two challenges, qgZ does not use tree or ring-based reduce-scatter algorithms, but is based on a novel hierarchical all-to-all approach.

There are three main steps in qgZ:

  • gradient slice reordering;
  • intra-node communication and reduce;
  • Inter-node communication and reduce.

First, before any communication occurs, we slice the gradients and reorder the tensor slices to guarantee that the final gradient position on each GPU at the end of the communication (i.e., the green block in Figure 5) is correct. Second, we quantify the reordered gradient slices, perform all-to-all communication within each node, dequantize the received gradient slices from all-to-all, and perform local reduce. Third, we quantify the gradient after local reduction again, perform all-to-all communication between nodes, dequantize the received gradient again, and calculate the final high-precision gradient reduce, and obtain the result of the green block in Figure 5.

The reason for this layered approach is to reduce cross-node traffic. More precisely, given N GPUs per node, model size of M, and quantization ratio of Z, single-hop all-to-all will generate M*N/Z cross-node traffic. In contrast, with this layered approach, we reduce cross-node traffic per GPU from M/Z to M/(Z*N). As a result, the total traffic is reduced from M*N/Z to M*N/(Z*N) = M/Z. We further optimize the end-to-end latency (Tensor Slice Reordering + Intra-node quantization) and (Intra-node Dequantization + Intra-node Reduction) and (Intra-node Dequantization + Intra-node Reduction) + by overlapping intra-node and inter-node communication and fusing CUDA kernels Inter-node quantization.

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Traffic volume optimization

By combining all three components above, we reduced cross-node traffic from 3M to 0.75M. More specifically, we use qwZ to reduce the forward full collection/broadcast of model weights from M to 0.5M. We use hpZ to eliminate cross-node all-gathers during backpropagation, reducing communication from M to 0. Finally, we use qgZ to reduce cross-node reduce-scatter communication during backpropagation from M to 0.25M.

ZeRO++ accelerates large-scale language model training

Here, we present the test results of ZeRO++ on a real LLM training scenario on 384 Nvidia V100 GPUs.

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 6: ZeRO++ vs. ZeRO throughput at various model sizes on 384 V100 GPUs, interconnected using four Infiniband (IB) between nodes, each running at 100 Gbps.

In the case of small batch size GPU, ZeRO++ achieves higher training efficiency

High-bandwidth clustering: As shown in Figure 6, we first demonstrate the throughput improvement of ZeRO++ relative to ZeRO, testing 4x Infiniband (IB) for 400Gbps cross-node interconnect bandwidth for different model sizes and micro-batch sizes, each running at 100Gbps. With a micro-batch size of 1k tokens per GPU, ZeRO++ delivers 28% to 36% more throughput than ZeRO-3. For 2k tokens micro-batch size, ZeRO++ achieves throughput gains of 24% to 29% over ZeRO-3.

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 7: Throughput of various LLMs at 100Gbps cross-node bandwidth on 384 V00 GPUs

Low-bandwidth clustering: In low-bandwidth network environments such as 100Gbps, ZeRO++ performs significantly better than ZeRO-3. As shown in Figure 7, ZeRO++ achieves up to 2.2x speedup in end-to-end throughput compared to ZeRO-3. On average, ZeRO++ achieves approximately 2x speedup over the ZeRO-3 baseline.

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 8: ZeRO++ achieves high-bandwidth cluster performance with significantly reduced bandwidth

Achieve model training efficiency equivalence between high-bandwidth ZeRO and low-bandwidth ZeRO++ clusters

In addition, ZeRO++ can achieve comparable system throughput in low-bandwidth clusters compared to ZeRO in a much higher bandwidth environment. As shown in Figure 8, for the 18B and 138B model sizes, a ZeRO++ with 200Gbps cross-node bandwidth can achieve a TFLOP similar to ZeRO-3 with 800Gbps cross-node bandwidth.

Given ZeRO++'s excellent scalability, we see ZeRO++ as the next generation of ZeRO for training large AI models.

DeepSpeed-Chat is combined with ZeRO++ for RLHF training

Introduction to RLHF training

The ChatGPT class model is powered by LLM and fine-tuned using RLHF. RLHF consists of a generation (inference) phase and a training phase. In the build phase, the actor model takes part of the dialogue as input and generates a response using a series of forward passes. Then, during the training phase, the critic model ranks the generated responses based on quality, providing reinforcement signals to the actor model. Use these rankings to fine-tune the participant model so that it can produce more accurate and appropriate responses in subsequent iterations.

RLHF training introduces significant memory pressure because it uses four models (actor, reference, review, reward). A common solution is to employ low-rank adaptive training (LoRA) to address memory pressure on RLHF. LoRA freezes the weights of the pretrained model and injects trainable rank factorization matrices into each layer of the Transformer architecture, significantly reducing the number of trainable parameters. LoRA accelerates RLHF by reducing memory usage, allowing for larger batch sizes, which greatly improves throughput.

DeepSpeed-Chat with ZeRO++ is used for RLHF training

DeepSpeed ZeRO+: Significantly improves the training efficiency of large models and ChatGPT-like models

Figure 9: ZeRO++ accelerates the generation and training phases of RLHF training

ZeRO++ has unique applications in the context of RLHF + LoRA because most of the model weights are frozen. This means that ZeRO++ can save these frozen weight quantizations to INT4/8 instead of storing them in fp16 and quantizing them before each communication operation. Dequantization after communication is still to prepare the weights for computation, but the dequantized weights are simply discarded after computation.

RLHF training with ZeRO++ in this way can reduce memory usage and traffic. This means higher training throughput by reducing communication and enabling larger batch sizes due to reduced memory usage. During the build phase, ZeRO++ uses hpZ to keep all weighted traffic within each node to take advantage of higher intra-node communication bandwidth, reducing traffic and further increasing build throughput.

ZeRO++ is integrated into DeepSpeed-Chat to support RLHF training for ChatGPT-like models. In Figure 9, we compared the RLHF generation throughput for different size actor models. The test configuration was 32 V100 GPUs and the actor model sizes were 30B and 66B to test ZeRO and ZeRO++ performance. The results show that the RLHF generation throughput of ZeRO++ is 2.25 times higher than that of ZeRO. We also show speedups in the training phase on 16 V100 GPUs, where ZeRO++ achieves 1.26x higher throughput than ZeRO due to the lower traffic and larger batch sizes supported by ZeRO++.

DeepSpeed ZeRO++ is now available!

Contributors:

The project was made possible by contributions from the DeepSpeed team by:

Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Ammar Ahmad Awan, Jeff Rasley, Michael Wyatt, Yuxiong He (team lead)

This article is reproduced from the Microsoft DeepSpeed group.

Read on