With 15 billion parameters, Google open sourced the entire code of V-MoE, the largest visual model in history

Selected from Google AI

Machine Heart Compilation

Editors: Du Wei, Chen Ping

Remember the 43-page paper "Scaling Vision with Sparse Mixture of Experts, published by the Google Brain team last June? They introduced V-MoE, the largest visual model in history, achieving a Top-1 accuracy close to STOA. Today, Google Brain open-sources the entire code for training and fine-tuning models.

Advances in deep learning over the past few decades have been driven by several key factors: a small number of simple and flexible mechanisms, large data sets, and more specialized hardware configurations that have allowed neural networks to achieve impressive results in tasks such as image classification, machine translation, protein prediction, and more.

However, the use of large models and datasets comes at the expense of a large number of computing needs. Recent studies have shown that enhancing the generalization ability and robustness of the model is inseparable from the support of the large model, so it is very important to coordinate with the limitation of training resources while training the large model. One possible approach is to utilize conditional computing, which instead of activating the entire network for a single input, activates different parts of the model based on different inputs. This paradigm has gained prominence in Google's vision of pathway (a new ai-solving approach that overcomes many of the shortcomings of existing systems while reinforcing their strengths) and recent large language model studies, but has not been well explored in computer vision.

The Sparse Gated Hybrid Expert Network (MoE) demonstrates excellent scalability in natural language processing. In computer vision, however, almost all high-performance networks are dense, that is, every input is converted into parameters for processing.

Last June, researchers from Google Brain proposed V-MoE (Vision MoE), a new visual architecture based on a sparse mix of experts. When applied to image recognition, V-MoE requires only half the amount of computation to achieve advanced network performance when inference. In addition, the study proposes an extension to the routing algorithm that prioritizes a subset of each input throughout the batch, enabling adaptive image computation. This allows V-MoE to weigh performance and smooth calculations when testing. Finally, the study demonstrated the potential of V-MoE to extend the visual model and trained a model with 15 billion parameters that reached 90.35% on ImageNet.

With 15 billion parameters, Google open sourced the entire code of V-MoE, the largest visual model in history

Address of the paper: https://arxiv.org/pdf/2106.05974.pdf

Code address: https://github.com/google-research/vmoe

V-MoE

Google Brain builds V-MoE on different variants of ViT: ViT-S(mall), ViT-B (ase), ViT-L (arge), and ViTH(uge) with the following hyperparameters:

ViT has been shown to be good scalability in transfer learning settings, achieving higher accuracy than CNNs with fewer pre-training calculations. ViT processes the image into a series of patches, and the input image is first divided into patches of equal size, which are linearly projected into the hidden layers of the Transformer, and after the positional embedding, the patch embedding (token) is processed by the Transformer, which consists mainly of alternating self-attention and MLP layers. MLP has two layers and a GeLU nonlinearity. For Vision MoE, the study replaced a subset of it with a MoE layer, where each expert was an MLP, as shown in the following figure:

To massively scale the visual model, the study replaced some dense feed-forward layers (FFNs) in the ViT architecture with a sparse mixture of independent FFNs called experts. The learnable routing layer selects the corresponding expert for each individual token. That is, different tokens from the same image may be routed to different experts. In total E experts (E is typically 32), each token can only be routed to a maximum of K (typically 1 or 2) experts. This allows to extend the size of the model while keeping each token computed constant. The following figure shows the structure of the V-MoE encoder block in more detail.

V-MoE Transformer encoder block

Experimental results

Google Brain first pre-trained the model on a large image dataset, JFT-300M.

The left of the following figure shows the pre-training results of the model at all sizes, from small s/32 to huge H/14. Then, use a new head (the last layer in a model) to migrate the model to a new downstream task, such as ImageNet. They explored two migration settings: fine-tuning a model on all available new task examples or freezing the pre-trained network and using only the new head tweaks with a small number of examples (so-called small sample migration).

The following figure, on the right, summarizes the effect of migrating the model to ImageNet, where each image category is trained on only 5 images (called a 5-shot transfer).

On the left is the Precision@1 curve on the JFT-300M dataset and on the right is the accuracy curve of ImageNet 5-shot.

In both cases, Google Brain found that sparse models were significantly better than dense models or achieved similar performance faster given the amount of training computation. To explore the limits of the visual model, they trained a model with 15 billion parameters and 24 MoE layers (from 48 blocks) on the JFT-300M extended dataset. The largest visual model to date achieves a Top-1 accuracy of 90.35 on ImageNet.

Priority routing

In practice, using dynamically sized buffers is not efficient due to hardware limitations, so models typically use predefined buffer capacities for each specialist. Once the expert becomes "full", the allocated token beyond this capacity will be discarded and will not be processed. As a result, higher capacities result in higher accuracy, but they are also more computationally expensive.

Google Brain uses this implementation constraint to make V-MoE faster at reasoning. By reducing the total combined buffer capacity below the number of tokens to be processed, the network is forced to skip processing some tokens in the expert layer. Instead of arbitrarily choosing which tokens to skip (as in previous jobs), the model learns to sort them based on importance scores. This maintains high-quality forecasts while saving a lot of computation. They call this approach Batch Priority Routing (BPR), and the dynamic diagram looks like this:

At high capacity, both Vanilla and Priority Routing handle all patches well. However, when reducing the buffer size to save calculations, Vanilla routing handles arbitrary patches, which often results in poor predictions; BPR intelligently prioritizes the handling of important patches, resulting in better predictions at a lower computational cost.

Properly removing tokens has proven critical to providing high-quality and more efficient speculative predictions. When expert capacity decreases, the performance of the Vanilla routing mechanism degrades rapidly. Conversely, BPR is more robust to low volumes.

Overall, Google Brain observations found that V-MoE is very flexible when it comes to reasoning: for example, it is possible to reduce the number of experts selected per token to save time and computation without any further training on the model weights.

Explore V-MoE

Since there is still much to discover about the inner workings of sparse networks, Google Brain also explored the routing patterns of V-MoE. One hypothesis is that routers learn to distinguish and assign tokens to experts based on certain semantic contexts (e.g., "car" experts, "animal" experts, etc.).

To test this, they show diagrams of two different MoE layers below, one very early-on and one closer to the head. The x-axis corresponds to each of the 32 experts, and the y-axis displays the ID of the image category (from 1 to 1000). Each entry in the graph shows how often an expert is selected for a token that corresponds to a particular image class, with darker colors indicating a higher frequency.

The results show that while there is little correlation in the early layers, in the later stages of the network, each expert only receives and processes tokens from a handful of categories. Therefore, it can be concluded that some semantic clusters of patches appear deeper in the network.

Higher routing decisions are related to image categories.

Google Brain believes this is just the beginning of large-scale conditional computing in computer vision. Heterogeneous expert architectures and conditional variable-length routing are also potential research directions. Sparse models are especially beneficial in data-rich areas, such as large-scale video modeling. They hope that open source code and models will attract more researchers to the field.

https://ai.googleblog.com/2022/01/scaling-vision-with-sparse-mixture-of.html?continueFlag=b96fa8ed72dfc82b777e51b7e954c7dc

With 15 billion parameters, Google open sourced the entire code of V-MoE, the largest visual model in history

Read on