Jeff Dean: We've written a "Sparse Model Design Guide," which you can check out

Reports from the Heart of the Machine

Editors: Zhang Qian, Du Wei

Sparse models are playing an increasingly important role in the field of deep learning. For a given token or sample, it can activate only a small part of the model, making it computationally friendly while having a large parameter volume. However, how to reliably train such models is still a problem that needs to be solved. In this article, researchers such as Barret Zoph, Irwan Bello, William Fedus, jeff Dean, and others from Google give a "guide to the design of efficient sparse expert models."

Jeff Dean: We've written a "Sparse Model Design Guide," which you can check out

Sparse expert neural networks demonstrate the advantages of pure scale and provide an effective alternative to today's commonly used static neural network architectures. Instead of applying the same parameters to all inputs, the sparse expert network dynamically selects which parameters to use for each input. This allows the network to greatly expand the number of parameters while keeping the FLOPs per token roughly unchanged. The adoption of these methods has resulted in an SOTA translation model, 4-7x pre-training acceleration, and one-shot performance at the GPT-3 level using only 1/3 of the training cost. Despite the staggering number of parameters, sparse models reduce the carbon footprint of training large neural networks by an order of magnitude. However, difficulties remain.

Fedus et al. (2021) observed that the sparse 1.6T parameter model achieved 4x pre-training acceleration compared to the previous SOTA method (Raffel et al., 2019), but lagged behind the smaller model when fine-tuned on commonly used benchmarks such as SuperGLUE. In Artetxe et al. In (2021), the researchers fine-tuned the MoE language model on extraterritorial data and observed similar gaps.

To solve this problem, the Switch-XXL model was proposed, which has fewer parameters but increases the computational footprint by 8 times the original (FLOPs is approximately equal to the maximum T5 model), and the performance on natural language understanding tasks is improved. However, the necessary pre-training is hampered by training instabilities that were not previously detected in small-scale studies. These instabilities were later identified in other sparse models. These results reveal the necessary balance between parameters and computation, but how to reliably train such a model remains a problem to be solved.

The purpose of this paper is to improve the practicality and reliability of sparse models. They studied both issues and gave design guidelines. Finally, they scaled the parameters of the sparse model to 269B, which computationally cost comparable to that of the 32B dense encoder-decoder Transformer (stable, migratory Mixture-of-Experts, ST-MoE-32B). This is the first time that a sparse model has achieved SOTA performance in transfer learning, spanning a range of different tasks, including inference (SuperGLUE, ARC Easy, ARC Challenge), abstract (XSum, CNN-DM), closed-book questions (WebQA, Natural Questions), and adversarial construction tasks (Winogrande, ANLI R3).

The contributions in this article can be summarized as follows:

1. Conducted a large-scale study on the quality-stability trade-offs of stability technology;

2. Router z-loss was introduced to solve the stability problem, while slightly improving the model quality;

3. A fine-tuning analysis of the sparse and dense models was given, revealing the sensitivity of the two to different hyperparameters of batch size and learning rate; they found that poor hyperparameters resulted in almost no fine-tuning gain on the dense model, although pre-training had a great acceleration;

4. The architecture, routing and model design principles of designing Pareto's efficient sparse model in a distributed environment are given;

5. Qualitative analysis of token routing decisions that track cross-expert layers is given;

6. A 269B sparse model was trained to achieve SOTA performance on a different set of natural language benchmarks.

router z-loss

One of the most successful ways to stabilize neural networks is to have constraints and gradients on activation. A popular approach is to compensate for the explosion gradient when backpropaging through a deep network by cropping the gradient norm.

In this paper, the researchers used the Adafactor optimizer because of its memory efficiency (although the recently introduced 8-bit optimizer (Dettmers et al., 2021) may provide better trade-off). Adafactor uses update clipping instead of gradient clipping, where changes to weights are limited to a certain norm. They try to tighten the update crop to a smaller value.

Next, they looked at the constraints on the logit that was about to enter the router. Router calculates the expert's probability distribution in float32. However, the researchers found that at maximum scale, this was not enough to lead to reliable training results. To solve this problem, they introduced router z-loss,

where B is the number of tokens, N is the number of experts, x ∈ RB× N is the logit that will enter the router.

Table 4 below shows that both update clipping and router z-loss stabilized the model in three runs, but update clipping severely affected the quality of the model. Therefore, the researchers use the z-loss method to fix the stability of the model.

Router z-loss introduces another hyperparameter (c_z), which is a weighting factor as part of the total loss of the optimization. The total loss is a linearly weighted combination of cross entropy loss (L_CE), auxiliary load balance loss (L_B), and router z-loss (L_Z).

Based on the best model quality after pretraining with hyperparameter sweeps, the investigators chose a value of c_z = 0.001. Appendix B records losses during pre-training.

The design of the sparse model

The design of the dense model was influenced by Kaplan et al. (2020) Guidance on basic work. But when it comes to the sparse model, there are countless additional questions to be solved, such as: (1) How many experts are used? (2) Which routing algorithm is used? (3) What is the value of the capacity factor? (4) How does hardware change these decisions? In this paper, the researchers give the following advice:

1, in their settings, they recommend top-2 routing with a capacity factor of 1.25, and each core has at most one expert;

2. During the evaluation period, the capacity factor can be changed to adapt to the new memory/computing requirements;

3) Dense layering and multiplicative bias can improve quality.

Please refer to the original paper for more details.

Thesis link: https://arxiv.org/pdf/2202.08906.pdf

Jeff Dean: We've written a "Sparse Model Design Guide," which you can check out

Read on

The maximum cost is $17 million, which is the cost of renting a card to train Google's 540 billion parameter PaLM

Jeff Dean, Google Bull, single author: The Golden Decade of Deep Learning Research

Jeff Dean posted a review: The Golden Decade of Deep Learning

Jeff Dean's team paper was questioned internally, and Google fired a 43-year-old AI researcher