The SOTA algorithm RepVGG, which uses only 3x3 convolution and ReLU, is too long to look at the model to define why VGG model structure reparameterization to make VGG great experimental results again

Author: Ding Xiaohan

This article is reproduced from Zhihu and has been reprinted with the author's permission.

Link: https://zhuanlan.zhihu.com/p/344324470

The annual barrage of Station B in 2020 is "Ye Qing Hui". There must be many moments that make you feel that "Grandpa's youth is back". In this era when the various hypermetric parameters of the convolutional network are accurate to three decimal places, do you remember the pastoral era five or six years ago, and the happiness of stacking a few convolutional layers can increase the point?

Our recent work, RepVGG, uses structural re-parameterization to implement a VGG-style single-way minimalist architecture, all the way to the end of the 3x3 volume, in terms of speed and performance to reach soTA levels, on ImageNet more than 80% accuracy.

Can SOTA performance be achieved without NAS, without attention, without various novel activation functions, without even a branching structure, with only 3x3 convolution and ReLU?

The open source pre-trained model and code (PyTorch version) has been released for 300 stars in two days, and the model has been downloaded hundreds of times, according to peer feedback, it works well in real business.

<h1 class="pgc-h-center-line" > too long to read the version</h1>

How easy is the method? After reading the article at 5 p.m., you can finish writing the code before dinner and start training, and you can see the results the next day. If you don't have time to read this article, just click on the code below and read the first 100 lines to fully understand.

Here's a closer look.

< h1 class="pgc-h-center-line" > model definition</h1>

By "VGG" we mean:

1. There is no branching structure. This is commonly referred to as a plain or feed-forward architecture.

2. Use only 3x3 convolution.

3. Use ReLU only as the activation function.

The basic architecture of the RepVGG model is introduced in one sentence: the 20-layer 3x3 convolutional stack is stacked into 5 stages, the first layer of each stage is a downsampling of stride=2, and each convolutional layer uses ReLU as the activation function.

Another sentence to introduce the detailed structure of the RepVGG model: RepVGG-A's 5 stages have [1, 2, 4, 14, 1] layers, repVGG-B's 5 stages have [1, 4, 6, 16, 1] layers, and the width is several times [64, 128, 256, 512]. The multiples here are arbitrarily specified "neat" numbers such as 1.5, 2.5, which have not been finely tuned.

Another sentence to introduce the training settings: 120 epochs on ImageNet, no trick, or even directly with the training code of the official PyTorch example can be trained!

Why design this minimalist model, and how does such a simple manual design model achieve SOTA on ImageNet?

<h1 class= "pgc-h-center-line" > why the VGG model should be used</h1>

In addition to our belief that simplicity is beauty, the VGG minimalist model has at least five realistic advantages (see the paper for details).

1. 3x3 convolution is very fast. On GPUs, the computational density of a 3x3 convolution (the theoretical amount of computation divided by the elapsed time) can be up to four times that of the 1x1 and 5x5 convolutions.

2. The one-way architecture is very fast because of the high degree of parallelism. With the same amount of computation, the "large and whole" operation efficiency is far more than the "small and broken" operation.

3. Single-way architecture saves memory. For example, ResNet's shortcut, although it does not account for computation, it doubles the video memory footprint.

4. The single-way architecture is more flexible and easy to change the width of each layer (such as pruning).

5. The main part of The RepVGG has only one operator: 3x3 convolutional ReLU. When designing specialized chips, given the chip size or cost, we can integrate massive 3x3 convolutional-ReLU compute units to achieve high efficiency. Don't forget, the memory-saving features of the single-socket architecture can also help us do less storage units.

<h1 class="pgc-h-center-line" > structure reparameterization makes VGG great again</h1>

Compared with various multi-branch architectures (such as ResNet, Inception, DenseNet, various NAS architectures), the VGG model has received little attention in recent years, mainly due to poor performance.

For example, one explanation for ResNet's good performance is that ResNet's shortcut produces an implicit sense of a large quantum model (because the total path is doubled every time a branch is encountered), which the single-way architecture clearly does not have.

Since the multi-branch architecture is beneficial for training, and the model we want to deploy is a one-way architecture, we propose a decoupling training-time and inference-time architecture. The way we usually use models is:

1. Train a model

2. Deploy the model

But here we propose a new approach:

1. Train a multi-branch model

2. Convert the multi-branch model equivalent to a one-way model

3. Deploy the one-way model

This allows you to take advantage of both the advantages of multi-branch model training (high performance) and the benefits of single-pass model inference (fast, memory-saving). The key here is clearly in the way this multi-branch model is constructed and transformed.

Our implementation is to add parallel 1x1 volume integral branches and identity map branches to each 3x3 convolutional layer during training to form a RepVGG Block. This design is based on ResNet's approach, the difference being that ResNet adds a branch every two or three layers, while we add every layer.

After the training is complete, we do an equivalent transformation of the model to get the deployment model. This conversion is also very simple, as the 1x1 convolution is a special (there are many 0s in the convolutional kernel) of the 3x3 convolution, while the identity map is a special (convolutional kernel of the unit matrix) of the 1x1 convolution! Depending on the linearity of the convolution (specifically additiveability), the three branches of each RepVGG Block can be merged into a single 3x3 convolution.

The following diagram depicts this conversion process. In this example, both the input and output channels are 2, so the parameters for the 3x3 convolution are 4 3x3 matrices, and the parameters for the 1x1 convolution are 2x2 matrices. Note that all three branches have a BN (batch normalization) layer whose parameters include the accumulated mean and standard deviation and the learned scaling factor and bias.

This does not hinder the feasibility of the conversion, since the convolutional layer at the time of inference and the BN layer after it can be equivalent to a convolutional layer with bias (commonly referred to as "sucking BN").

After "sucking BN" for the three branches (note that the identity map can be seen as a "convolutional layer", the parameter of which is a 2x2 identity matrix!). ), give the resulting 1x1 convolutional kernel to pad 3x3 with 0. Finally, the convolutional kernels and bias obtained by the three branches can be added separately.

In this way, the output before and after each RepVGG Block conversion is exactly the same, so the trained model can be equivalent to a single model with only 3x3 convolution.

From this conversion process, we see the essence of "structural reparameterization": the structure at the time of training corresponds to one set of parameters, and the structure we want at the time of inference corresponds to another set of parameters; as long as the parameters of the former can be converted to the latter, the structure of the former can be converted to the latter.

< h1 class= "pgc-h-center-line" > experimental results</h1>

Tested on the 1080Ti, the speed-accuracy of the RepVGG model was quite outstanding. At a fair training setup, the RepVGG speed with the same accuracy is 183% of ResNet-50, 201% of ResNet-101, 259% of EfficientNet, and 131% of RegNet.

Note that RepVGG achieves performance that exceeds EficientNet and RegNet without using any NAS or heavy manual iterative design.

This also shows that it is inappropriate to measure the true speed of FLOPs between different architectures. For example, The FloPs of RepVGG-B2 are 10 times that of EficientNet-B3, but the speed on 1080Ti is 2 times that of the latter, which means that the computation density of the former is more than 20 times that of the latter.

Semantic segmentation experiments on Cityscapes showed that the RepVGG model was about 1% to 1.7% higher than the ResNet family at 1% to 1.7% faster at faster speeds, or 62% faster at 0.37% higher mIoU.

A series of ablation studies and comparative experiments show that structural reparameterization is the key to the excellent performance of the RepVGG model (see paper for details).

Finally, it should be noted that RepVGG is an efficient model designed for GPUs and dedicated hardware, pursuing high speed, saving memory, and paying less attention to the amount of parameters and theoretical computation. On low-power devices, it may not be as suitable as mobileNet and ShuffleNet series.

Address of thesis:

https://arxiv.org/abs/2101.03697

Open source pre-trained models and code (PyTorch edition):

https://github.com/DingXiaoH/RepVGG

(MegEngine Edition):

https://github.com/megvii-model/RepVGG

bibliography:

[1] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016. 2, 4, 8