Reports from the Heart of the Machine
EDIT: Boats
Researchers from the University of Texas at Austin, the University of Technology Sydney and Google have proposed a training-free, self-scaling framework, As-ViT, that automatically discovers and scales ViT in an efficient and principled way.
There are two main pain points in the current Vision Transformers (ViT) field: 1. Lack of effective methods for designing and extending ViTs; 2. The computational cost of training ViTs is much larger than that of convolutional networks.
To address both of these problems, researchers from the University of Texas at Austin, the University of Technology Sydney and Google have proposed As-ViT (Auto-scaling Vision Transformers), a training-free ViT auto-scaling framework that automatically designs and scales ViTs in an efficient and principled manner.

Thesis link: https://arxiv.org/abs/2202.11921
Specifically, the researchers first designed the "seed" of the ViT topology using a training-free search process, an extremely fast search that was achieved through a comprehensive study of the complexity of the ViT network, resulting in a strong Kendall-tau correlation with true accuracy. Second, starting with the "seed" topology, the extension rules of viT are automated by increasing the width/depth to different ViT layers, implementing a series of architectures with different numbers of parameters in a single run. Finally, based on the experience that ViTs can tolerate coarse-grained tokenization in the early training stages, the study proposes a progressive tokenization strategy to train ViTs faster and more economically.
As a unified framework, As-ViT delivers powerful performance on classification (83.5% top1 on ImageNet-1k) and instrumentation (52.7% mAP on COCO) tasks without any manual tuning or scaling of the ViT architecture, and the end-to-end model design and scaling process takes only 12 hours on a single V100 GPU.
ViT with network complexity is automatically designed and scaled
To speed up ViT design and avoid tedious manual work, the study aims to target efficient, automated, and principled ViT search and scaling. Specifically, there are two problems that need to be solved: 1) How to efficiently find the optimal ViT architecture topology with minimal or even zero training costs? 2) How can I expand the depth and width of the ViT topology to meet the different needs of the model size?
Extend the topology space of the ViT
Before designing and expanding, the first step is to search the space for the topology extended by As-ViT: first embed the input image into a block with 1/4 scale resolution, and employ a step-by-step space reduction and channel doubling strategy. This is to facilitate intensive forecasting tasks, such as the detection of features that require multiple scales.
Assess the ViT complexity at initialization through manifold propagation
ViT training is slow, so the cost of doing an architecture search by evaluating the accuracy of the trained model will be prohibitively high. Recently, many training-free neural architecture search methods using ReLU-based CNNs have emerged, using local linear graphs (Mellor et al., 2020), gradient sensitivity (Abdelfattah et al., 2021), and number of linear regions (Chen et al., 2021e; f) or network topology (Bhardwaj et al., 2021) or other means.
However, ViTs are equipped with more complex nonlinear functions such as self-attention, softmax, and GeLU. Its ability to learn needs to therefore be measured in a more general manner. In the new study, the researchers consider measuring the complexity of manifold propagation through ViT to estimate how complex functions can be approximated by ViT. Intuitively, a complex network can propagate a simple input into a complex manifold at its output layer and therefore may have strong learning capabilities. In UT Austin's work, they map multiple complexities of simple circular inputs via ViT: h(θ) = √ N [u^0 cos(θ) + u^1 sin(θ)]. Here, N is the dimension of the ViT input (for example, for An ImageNet image, N = 3 × 224 × 224), and u^0 and u^1 form the standard orthogonal base of the two-dimensional subspace of R^N where the circle is located.
Search ViT topology rewards
The researchers proposed L^E-based training-free search (Algorithm 1), and most NAS (neural architecture search) methods evaluate the accuracy or loss value of a single path or super network as proxy inference. When applied to ViT, this training-based search will require more computational cost. For each architecture sampled, instead of training the ViT, L^E is calculated and treated as a reward to guide the search process.
In addition to L^E, the NTK condition number κΘ = λ_max/λ_min is included to indicate the trainability of ViT (Chen et al., 2021e; Xiao et al., 2019; Yang, 2020; Hron et al., 2020)。 λ_max and λ_min are the maximum and minimum eigenvalues of the NTK matrix Θ.
The search uses reinforcement learning methods, where the strategy is defined as a joint classification distribution and updated by a strategy gradient, which updates the strategy to 500 steps and observes enough for the strategy to converge (entropy drops from 15.3 to 5.7). The search process is very fast: there are only seven GPU hours (V100) on the ImageNet-1k dataset, thanks to a simple calculation that bypasses the L^E trained by ViT. To solve the different sizes of L^E and κΘ, the study normalizes them by their relative ranges (line 5 in algorithm 1).
Table 3 summarizes the ViT topology statistics for the new search method. We can see that L^E and κΘ are highly preferred: (1) tokens with overlap (K_1 K_4 are larger than strides), and (2) greater FFN scalability (E_1 in deeper layers
ViT autonomous, principled extension
Once you have the optimal topology, the next question to solve is: how do you balance the depth and width of the network?
Currently, there is no such rule of thumb for ViT extensions. Recent work has attempted to expand or grow convolutional networks of different sizes to meet various resource constraints (Liu et al., 2019a; Tan & Le, 2019)。 However, in order to automatically find a principled scaling rule, training ViT will cost a huge computational cost. You can also search for different ViT variants, as described in Section 3.3, but this requires multiple runs. Instead, "scaling up, scaling-up" is a more natural way to generate multiple model variants in one experiment. Therefore, the study attempts to extend the searched basic "seed" ViT to larger models in a training-free and principled and effective way. This method of automatic scaling is described in Algorithm 2:
Each stage of the initial architecture has an attention block, the initial hidden dimension C = 32. Each iteration finds the optimal depth and width for further expansion. For depth, the study tried to figure out which stage to deepen (i.e., add an attention block at which stage); for width, the study tried to find the optimal expansion ratio (i.e., how far to expand the number of channels).
The expansion trace is shown in Figure 3 below. Comparing autonomous and random scaling, the researchers found that the expansion principle prefers to discard depth in exchange for more width, using shallower but wider networks. This extension is more similar to Zhai et al. (2021) Rules of development. In contrast, ResNet and Swin Transformer (Liu et al., 2021) chose narrower and deeper.
Efficient ViT training with progressive and flexible re-tokenization
The study provides a positive answer by proposing progressive and flexible re-tokenization training strategies. In order to update the number of tokens during training without affecting the shape of the weights in the linear projection, the study employs different sampling granularities in the first linear projection layer. Take the first projection kernel K_1 = 4 and stride = 4 as an example: during training, the researcher gradually changes the (stride, dilation) pair of the first projection kernel to (16, 5), (8, 2) and (4, 1), keeping the shape and structure of the weights unchanged.
This re-tokenization strategy inspires ViT's curriculum learning: training begins with the introduction of coarse sampling to significantly reduce the number of tokens. In other words, As-ViT quickly learns rough information from images at a very low computational cost (only 13.2% FLOPs for full-resolution training) during the early training phase. In the later stages of training, the study gradually switches to fine-grained sampling, recovering full token resolution and maintaining competitive accuracy. As Figure 4 shows, when a ViT is trained with coarse sampling in the early training phase, it can still achieve high accuracy while requiring extremely low computational costs. Conversions between different sampling granularities introduce performance jumps, and eventually the network resumes competitive final performance.
As Figure 4 shows, when ViT is trained with coarse sampling in the early training phases, it can still achieve high accuracy while requiring extremely low computational costs. Conversions between different sampling granularities introduce performance jumps, and eventually the network resumes competitive final performance.
experiment
AS-VIT: Automatically scales VIT
The study shows the searched As-ViT topology in Table 4. This architecture, in the first tokenization step and three re-embedding steps, promotes strong overlap between tokens. The FFN extension is narrower than first, and then widened at a deeper layer. Leverage a small amount of attention splitting to better aggregate global information.
Image classification
Table 5 below shows how As-ViT compares to other models. Compared to previous Transformer-based and CNN-based architectures, As-ViT implements SOTA performance with a fair number of parameters and FLOPs.
Train efficiently
The researchers adjusted the period of reduction phase for each token in Table 6 and displayed the results in Table 6. Standard training requires 42.8 TPU days, while efficient training saves up to 56.2% of training FLOP and 41.1% of training TPU days, still achieving high accuracy.
Contributions to topology and scaling
To better validate the contribution of search-based topologies and scaling rules, more ablation studies were conducted in this study (Table 7). First, train the searched topology directly before scaling. The seed topologies searched for in this study were superior to the best of the 87 random topologies in Figure 2.
Second, the study compared complexity-based rules to "stochastic scaling + As-ViT topology." Under different extensions, the study's automatic expansion is also better than random expansion.
Object detection on COCO datasets
The study compared As-ViT with standard CNNs and previous Transformer networks. The comparison is made by changing only the trunk and the other settings without changing. As you can see from the results of Table 8 below, As-ViT can also capture multiscale features and achieve state-of-the-art detection performance, although it is designed on ImageNet and its complexity is measured for classification.