A review of the latest literature on large-scale neural networks: training efficient DNNs, saving memory usage, optimizer design

Selected from arXiv

Author: Julia Gusak et al

Machine Heart Compilation

Editors: Du Wei, Zenan

In this review paper, the researchers explain how different technologies work, evaluate, and compare, and also analyze some of the frameworks for implementing them.

The development of modern deep learning and artificial intelligence techniques involves the use of deep neural networks (DNNs) to solve various problems such as images, video, audio, natural language processing, content generation in image forms, or tasks such as generating text for a given format topic.

The Skolkovo Institute of Science and Technology in Russia, the University of Lille in France, the University of Bordeaux, Inria and other scientific institutions jointly published a paper "Survey on Large Scale Neural Network Training", which attempts to solve the problem: how to train is the most efficient given model and computing platform. For training to be efficient, it must be feasible to maximize the computing power of resources, and in parallel, it must not allow information transmission to become a bottleneck. The efficiency of training fundamentally depends on the effective implementation of the compute core on the computing resources (CPU, TPU, GPU) and the effective implementation of communication between GPUs and between different memories.

A review of the latest literature on large-scale neural networks: training efficient DNNs, saving memory usage, optimizer design

Thesis link: https://arxiv.org/abs/2202.10435

In both cases, much work has been done to optimize the arithmetic strength of the computing core and to effectively enable communication on hardware networks. For consumers, powerful analysis tools exist to identify hardware bottlenecks and can be used to determine which strategies described in this survey can be used to address arithmetic strength, memory, and control of the amount of exchanged data.

The review study covers common techniques for addressing these limitations. If the a priori calculation cannot be performed because the model, optimizer state, and activation do not fit in memory, you can use memory swap computation (reimplementation) or data transfer (activation and weight offloading). We can also compress memory usage by approximating optimizer states and gradients (compression, trimming, quantization).

Parallel methods (data parallel, model parallel, pipeline model parallel) can also distribute memory requirements across multiple hashing resources. If the computing power is not strong enough to make full use of the GPU and TPU, usually because the mini-batch is too small, then the above technology can also increase the size of the mini-batch. Finally, if the communication overhead caused by using data parallelism is so expensive that it drags down the computation speed, other forms of parallelism (model parallelism, pipeline model parallelism) can be used, and gradient compression can also limit the amount of data exchanged.

In this survey, the researchers explain how these different technologies work, describing the literature on the methods proposed to evaluate and compare, and also analyzing some of the frameworks for implementing these technologies.

Table 1 below shows the different technologies discussed in the article and their impact on communication, memory, and compute efficiency.

The researchers distinguished the following approaches according to purpose: first discuss reducing GPU memory usage, then consider using parallel training for models that are not suitable for GPUs, and finally discuss the design of the optimizer developed to train models stored on multiple devices.

Reduced memory usage in a single GPU scenario

During forward propagation, the neural network stores the activations required to perform backpropagation. In some cases, these activations consume a lot of memory and make it impossible for the model to train. There are two main ways to reduce memory usage: reimplementation (also known as checkpointing) and offloading.

Reactivation of activation

The reimplemented policy stores only a small fraction of activations during forward propagation and recalculates the rest during backpropagation. Reimplementation methods can be distinguished by the computational graphs they process. The first set comes from automatic differentiation (AD), which finds the best schedule for a homogeneous sequential network (a multi-layer DNN that executes sequentially and has the same compute and memory costs). The second group focuses on transition models, such as heterogeneous sequence networks (any sequence neural network that can be composed of arbitrary complex modules, such as CNNs, ResNets, some transformers), which adapts the solution from AD to heterogeneous setups.

Some methods can be reimplemented on the general calculation chart, although the exact cost of the calculation may rise exponentially, as shown in Table 2 below.

Activate uninstallation

Offloading (also known as memory swapping) is a technique that conserves GPU memory by transferring activations to CPU memory during forward pass and prefetching them back into GPU memory for corresponding backward computation.

Due to the limited bandwidth of the PCI bus between the CPU and GPU, the choice of transfer activation and when to transfer must be optimized.

In the vDNN [Rhu et al., 2016] study, the authors followed a heuristic approach that worked for CNNs by simply unloading the inputs of convolutional layers, however it did not generalize well to generalized to general DNNs. Another study [Le et al., 2018] considered the activation lifecycle to select the content to unload and used the graph search method to identify the moment when an unload/prefetch operation was inserted. AutoSwap [Zhang et al., 2019] decides which activations to offload by assigning a priority score to each variable.

Weight offloading

Many of the methods mentioned earlier also apply to unload weights because offload weights rely on general techniques that apply to any tensor, such as TFLMS, AutoSwap, or SwapAdvisor.

Parallelism of models that do not fit a single GPU

In model parallelism, activation information only needs to be communicated, and the transfer occurs only between successive layers assigned to different processors. The work mentioned in this section is shown in Table 4 below.

If multiple small batches are pipelined, execution in model parallelization can be accelerated, resulting in multiple training iterations being activated at the same time, as can be seen [Huang et al., 2019]. Once the forward and backward stages are calculated on all these small batches, the weights are updated. This approach is fairly simple to implement, but it also results in most of the compute resources being left vacant. The PipeDream method proposed in [Narayanan et al., 2019] improves this training process by only forcing forward and backward tasks to use the same model weights for a given small batch.

Reducing the frequency with which updates are performed has also been shown to help limit weight expiration (Narayanan et al., 2021a). [Yang et al., 2021] Proposed PipeMare adjusts the learning rate and model weights backwards based on the pipeline stage.

Modeling the storage costs resulting from activation in pipeline methods is a daunting task (Beaumont et al., 2021b). For example, Chimera in DAPPLE in [Fan et al., 2021] and [Li and Hoefler, 2021] use 1F1B (One-Forward-One-Backward) scheduling to reduce activation-related memory consumption. 1F1B is a synchronous weight update technique that schedules the backpass of each micro-batch as early as possible to free up memory occupied by activation.

Some papers deal specifically with challenging topologies. For example, to solve the problem of high communication costs and heterogeneous network capabilities, Pipe-torch in [Zhan and Zhang, 2019] proposes a newer dynamic programming strategy that assumes no overlap between computation and communication. Pipe in [Park et al., 2020] solves other problems with heterogeneous GPUs by splitting these heterogeneous GPUs into virtual workers and running pipeline parallelism in each virtual worker, while relying on data parallelism between workers.

Optimizer for cross-device model training

Zero redundancy optimizer

In 2020, Rajbhandari, S. et al. proposed the Zero Redundancy Optimizer (ZeRO) as a data parallelization implementation to reduce memory usage in the paper ZeRO: Memory Optimizations toward Training Trillion Parameter Models. Depending on the tensor divided on the device, the algorithm has three phases, namely Phase 1 - Optimizer State, Phase 2 - Optimizer State and Gradient, and Phase 3 - Optimizer State, Gradient, and Model Hyperparameters.

In 2021, Ren, J. et al. unified ZeRO with CPU-side computation for Zero-Offload internal parameter updates in their paper ZeRO-Offload: Democratizing Billion-Scale Model Training, where gradients are migrated to the CPU where the parameter copies are stored and updated weights are migrated back to the GPU.

Low-precision optimizer

To further reduce memory usage, low-precision optimizers have a place. These methods use a low-precision format tension to represent the optimizer state and the auxiliary vector of the state. Moreover, error compensation techniques can be used to maintain the approximate accuracy of tracking statistics.

In 2021, Dean, J. et al. proposed in their paper Large Scale Distributed Deep Networks that stored the Adam optimizer in 8-bit while maintaining overall performance when using the 32-bit format. In 2020, Sun, X. et al. proposed a more aggressive reduction in their paper Ultra-Low Precision 4-bit Training of Deep Neural Networks, in which specific paths for handling 4-bit representations were developed.

Convergence accelerates

Another way to accelerate large-scale deep learning models is to reduce the communication time between nodes and the number of epochs required for proper local minimum convergence.

About the reduction of communication costs. Different methods have emerged for compressing gradients between compute nodes, in three categories, namely sparsification, quantization, and low-rank methods.

The split method migrates only a subset of the full gradient elements and updates the corresponding elements in the parameter vector. This approximation method can significantly reduce communication costs while maintaining the performance of the trained model, represented by the 2017 paper "Sparse Communication for Distributed Gradient Descent" by Aji, A. F. and Heafield, K, and the 2019 paper "The Convergence of Sparsified" by Alistarh, D. et al Gradient Methods》。

Another approach is based on quantification of the migration gradient, which migrates only a certain number of bits, reconstructs the entire gradient vector from those bits, and updates all elements of the parametric vector. This approach has yielded good results for some neural network architectures and experimental setups, represented by Alistarh, D. et al.'s 2017 paper QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding.

The final method to reduce communication costs is the low-rank method, where a low-rank approximation of the gradient is built, migrated, and used before updating the parametric vector to restore the gradient in full format. Low-rank approximations can be constructed by block power methods or minimization strategies, with their respective representative jobs being Vogels et al., 2019, respectively

Cho et al., 2019。

Mass training. Another way to accelerate optimizer convergence is to use a large number of samples for each batch. This training setup reduces the number of iterations in each epoch and increases GPU utilization. In Goyal, P et al.'s 2017 paper Accurate, Large Minibatch SGD, the researchers proposed using a linear scaling rule to update the learning rate and batch size. This setup stabilizes the optimization process and converges the final performance of the model to the same.

Cover source: https://www.youtube.com/watch?v=RSRkp8VAavQ

A review of the latest literature on large-scale neural networks: training efficient DNNs, saving memory usage, optimizer design

Read on