ViT's Revenge: Meta AI proposes a new baseline for ViT training

Selected from arXiv

By Adam Zewe

Machine Heart Compilation

Editors: Zhao Yang, Zhang Qian

This paper proposes three data enhancement methods for training visual Transformer (ViT): grayscale, overexposure, Gaussian blur, and a simple random trimming method (SRC). Experimental results show that these new methods are much better than ViT's previous fully supervised training methods in terms of effectiveness.

ViT's Revenge: Meta AI proposes a new baseline for ViT training

The Transformer model [55] and its derivative models have also grown in popularity in computer vision tasks after great success in NLP tasks. This series of models is increasingly being used in areas such as image classification [13], detection and segmentation [3], and video analysis. In particular, the visual Transformer (ViT) proposed by Dosovistky et al. [13] has become a reasonable alternative to convolutional structures. These phenomena illustrate that the Transformers model has become a general architecture for learning convolutions and larger interval operations through attention mechanisms [5,8]. In contrast, convolutional networks [20,27,29,41] have translational invariance that no longer needs to be acquired through training. Therefore, it is not surprising that hybrid architectures with convolution converge faster than ordinary Transformers [18].

Because transformers only combine pixels in the same location in multiple patches, the Transformer must understand the structure of the image and optimize the model so that it processes the input used to solve the goal of a given task. These tasks can be generating labels under supervision, or other proxy tasks under the self-supervision method. However, despite Transformer's great success, there has been little work in computer vision to study how to effectively train visual Transformers, especially on medium-sized datasets like ImageNet1k. Starting with the work of Dosovistky et al. [13], the training steps were mostly using variants of the methods proposed by Touvron et al. [48] and Steiner et al. [42]. In contrast, there has been much work that has proposed alternative architectures by introducing pooling, more efficient attention mechanisms, or recombining hybrid architectures of convolutional and pyramidal structures. These new designs, while particularly effective for some tasks, are less versatile. So researchers will wonder whether the performance improvement is due to a specific architectural design or because it improves the optimization process in the way that ViT Convolution proposes.

Recently, the popular BerT-based approach to self-supervision inspired by pre-training has given hope to the BerT era in computer vision. From the Transformer architecture itself, there are some similarities between the NLP and CV domains. However, not all aspects are the same: the processed modalities have different properties (continuous and discrete). CV provides large annotated databases like ImageNet [40], and fully supervised pre-training on ImageNet is effective for handling different downstream tasks, such as transfer learning [37] or semantic segmentation.

Without further study of the fully supervised approach on ImageNet, it is difficult to determine whether the performance of a self-supervised approach like BeiT [2] should be attributed to the training process of the network, such as data enhancement, regularization, optimization, or the underlying mechanisms capable of learning more general implicit representations. In this paper, the researchers did not force an answer to this conundrum, but explored the problem by updating the training program of the regular ViT architecture.

Address: https://arxiv.org/pdf/2204.07118.pdf

The researchers hope that this work will help to better understand how to make the most of Transformer's potential and illustrate the importance of pre-training similar to BerT. Their work builds on the latest full-supervised and self-supervising approaches and provides new insights into data augmentation. The authors propose new training methods for ViTs on ImageNet-1k and ImageNet-21k. The main components are as follows:

The authors build on the work of Wightman et al. [57] and similarly use ResNet50. In particular, binary cross-entropy losses are employed only for imagenet1k training processes, a step that can be achieved by introducing some methods that significantly improve large ViT [51] training, namely stochastic depth [24] and LayerScale [51].

3-Augment: This is a simple way to augment data inspired by self-supervised learning. Surprisingly, when using ViT, the authors observed that this approach worked better than the common automated/learning data enhancements used to train ViTs, such as RandAugment [6].

When pre-trained on larger datasets like ImageNet-21k, simple random cropping is more efficient than resizing and then randomly cropping.

Reduce the resolution during training. This choice reduces the difference between the training and testing process [53] and has not yet been used by ViT. The authors observe that this also allows for a regularization effect on the largest model by preventing overfitting. For example, with a target resolution of 224 × 224, a ViT-H pre-trained at a resolution of 126 × 126 (81 tokens) performs better on ImageNet-1k than it does at a resolution of 224 × 224 (256 tokens). And the requirements for pre-training are also lower, because the number of tokens is reduced by 70%. From this perspective, doing so provides scaling properties similar to those of the masked autoencoder [19].

This "new" training strategy is not saturated with the largest model, which is a step further than The Data-Efficient Image Transformer (DeiT) [48] of Touvron et al. So far, researchers have achieved competitive performance in both image classification and segmentation, even compared to recently popular architectures such as SwinTransformers [31] or modern convolutional network architectures such as ConvNext [32]). Here are some of the results that the authors find interesting.

Even on medium-sized datasets, researchers take advantage of models with more capabilities. For example, when viT-H was trained only on ImageNet1k, top-1 accuracy was 85.2 percent, which was +5.1 percent higher than the best ViT-H for supervised training procedures with a resolution of 224×224 reported in the literature.

The ImageNet-1k training program allows training a viT-H with one billion parameters (layer 52) without any hyperparameter adaptation, simply using the same random depth degradation rate as ViT-H. 84.9% at 224×224, which is +0.2% higher than the corresponding ViT-H trained at the same setup.

Without sacrificing performance, the number of GPUs required and the training time for ViT-H can be reduced by more than half, allowing such models to be trained efficiently without reducing resources. This is thanks to the researchers' pre-training at a lower resolution, which reduces peak memory.

For the ViT-B and Vit-L models, the authors propose a supervised training method comparable to the BerT-like self-supervised method [2, 19] with default settings, and both are suitable for image classification and semantic segmentation tasks when using the same level of annotation and fewer epochs.

With this improved training process, vanilla ViT bridges the gap with the most recent state-of-the-art architectures while generally providing better compute/performance trade-offs. The model proposed by the authors is also relatively better on the additional test set ImageNet-V2 [39], suggesting that the model they trained generalized better than most previous work on another validation set.

Ablation experiments were performed on the effects of clipping ratios used in transfer learning classification tasks. The researchers observed that the crop results had a significant effect on performance, but the optimal values depended heavily on the target dataset/task.

Vision Transformers revisits training and pre-training

In this section, the researchers describe the training process for visual Transformers and compare them to existing methods. They detail the different ingredients in Table 1. Based on the work of Wightman et al. [57] and Touvron et al. [48], the authors describe several changes that have a significant impact on the accuracy of the final model.

Data enhancements

Since the advent of AlexNet, there have been several major changes to the data augmentation process used to train neural networks. Interestingly, the same data enhancements, such as RandAugment [6], are widely used in ViT, and their strategy was originally generated for convolutional network learning. Given that the schema a priori and bias in these schemas are quite different, the enhancement strategy may not be adaptable and may be overfitting given the large number of choices involved in the selection. So the researchers revisited this prior step.

3-Augment: The authors propose a simple data augmentation inspired by what is used in supervised learning (SSL). The authors suggest that the following three variants should be considered:

Grayscale: Favors color invariance and pays more attention to shape.

Overexposure: Strong noise is added to the color to better accommodate changes in color intensity, resulting in more attention to shape.

Gaussian Blur: To slightly change the details in the image.

For each image, they chose one of the data enhancements with a mean probability. In addition to these 3 enhancement options, common color dithering and horizontal flips are included. Figure 2 illustrates the different enhancements used in the 3-Augment method.

In Table 2, they provide the results of ablation experiments on different data enhancement components.

crop

Random Resized Crop (RRC) is described in GoogleNet [43]. It is a regularization that limits the overfitting of the model, and at the same time facilitates that the decisions made by the model are unchanged for a certain type of transformation. This data augmentation is considered important on Imagenet1k to prevent overfitting, which happens to be more common in modern large models.

However, this cropping strategy introduces some differences in aspect ratio and apparent size of the object between the training image and the test image [53]. Because ImageNet-21k contains more images, it is less prone to overfitting. Therefore, researchers question whether the advantages of strong RRC regularization can compensate for the disadvantages of training on larger data sets.

Simple Random Cropping (SRC) is a simpler method of crop extraction. It is similar to the original cropping selection proposed in AlexNet [27]: resize the image so that the smallest edges match the training resolution. A 4-pixel reflection fill is then applied to all sides, and finally a square crop mechanism that randomly selects the size of the training graph along the x-axis of the image

Figure 3 shows the crop box for RRC and SRC sampling. RRC offers a lot of crop boxes of different sizes and shapes. In contrast, the SRC covers more parts of the entire image and preserves the aspect ratio, but provides less shape diversity: the crop boxes overlap significantly. Therefore, when training on ImageNet1k, it is better to use the commonly used RRC. For example, if you don't use RRC, the top-1 accuracy on ViT-S is reduced by 0.9%.

However, in ImageNet-21k (10 times larger than ImageNet-1k), the risk of overfitting is small, and increasing the regularization and diversity provided by RRC is not important. In this case, SRC has the advantage of reducing the difference in form factor and aspect ratio. What's more, it makes it more likely that the actual label of the image matches the cropped label: RRC is relatively aggressive in cropping, and in many cases, the tagged objects are not even present in the crop, as shown in Figure 4, some of which crop objects that do not contain markers. For example, for RRC, there are no zebras in the crop picture in the left example, or no train in the three crop pictures in the middle example. This is unlikely to happen with the SRC because the SRC covers most of the image pixels.

In Table 5, the researchers provided the results of ablation experiments with randomly adjusted crop sizes on ImageNet-21k, and we can see that these cropping methods translate into significant performance gains.

Experimental results

The researchers' evaluation of image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning, and semantic segmentation showed that their program was significantly superior to previous fully supervised training protocols for ViT. The study also shows that the performance of supervised vits is comparable to the performance of recent architectures. These results can serve as a better baseline model for the self-supervised approach recently used on ViT.

Please refer to the original paper for more details.

ViT's Revenge: Meta AI proposes a new baseline for ViT training

Read on

The paper man "lives"! Meta AI new work: In just a few minutes, the hand-drawn villain has a soul

Foreign media take stock of the best player trades of the new season! Uzi's comeback is the most talked about

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Foreign media selected the new season team TOP10: EDG topped the list of six LPL teams

Foreign media selected the top 10 teams: EDG, LNG won the top two BLG tenth

The five most anticipated contestants in the transfer period: Uzi joined the BRG and successfully won the first place

LECW1D1 Roundup: G2 FNC ushered in the opening of VIT's first defeat

Welcome back to LEC! Perkz returned to his debut and was trained by the military! VIT is no match for MAD!

Perkz poured! VIT's first Wednesday losing streak was at the bottom, and TSM tweeted: Galaxy Battleship?

LEC Galaxy Crash? So far, VIT has lost all three battles and is temporarily ranked first from the bottom

LPL and LEC Galaxy battleships have fallen one after another, and TES and VIT are suffering from the same disease?

Take stock of the most disappointing teams from 4 major divisions years ago! TES deserves it

Perkz slammed the former owner C9 on this team is really not good, VIT's goal is to win

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

He Kaiming's team's new work: only use ordinary ViT, do not do layered design can also do the target detection

No training required, the auto-expanding visual transformer is coming