Slaughter list the major CV tasks, how strong is Microsoft's open source Swin Transformer? Microsoft Swin Transformer's official open source implementation method concretely shows that the application prospect of Transformer on CV is likely to replace CNN?

Recently, Microsoft's Swin Transformer code was officially open sourced, and in just two days, it obtained a 1.9k Star on GitHub, and related topics have also aroused widespread discussion and attention in Zhihu.

<h1 class="pgc-h-arrow-right" data-track="2" > Microsoft Swin Transformer officially open source</h1>

Swin Transformer can be understood as a general-purpose visual backbone network, Swin Transformer designed a hierarchical representation, starting with small PATCHES, and then gradually merging adjacent Patches into a deeper Transformer layer. With this hierarchical structure, the Swin Transformer model can be upgraded using high-density predictors such as FPNs and UNET, reducing the computational intensity by limiting some of the computations in the non-overlapping window. The number of PATCHES in each window is fixed, so the complexity will grow as the image size increases. Compared to ViT, Swin Transfomer calculates a significant reduction in complexity, calculating complexity linearly with the size of the input image.

The core design of Swin Transformer is to divide the Self-Attention layer into SHIFTs, and the SHIFTED Window is connected to the upper layer of windows, maintaining a connection between the two, thereby significantly enhancing the modeling capabilities. This strategy also helps to effectively reduce latency: all Query Patches windows share the same key set, thereby saving hardware memory; the previous Transformer method tends to have higher latency in actual hardware due to the use of different Query pixels; and the paper experimentally shows that the SHIFTED Window method has lower latency than traditional sliding methods, and the modeling capabilities of the two are basically the same.

Based on this, Swin Transformer has excellent performance in various regression tasks, image classification, object detection, semantic segmentation, etc., and its performance is better than VIT/DEIT and RESNE(X) T.

Slaughter list the major CV tasks, how strong is Microsoft's open source Swin Transformer? Microsoft Swin Transformer's official open source implementation method concretely shows that the application prospect of Transformer on CV is likely to replace CNN?

GitHub Address: https://github.com/microsoft/Swin-Transformer

The original author's team Cao Yue's answer on zhihu: https://www.zhihu.com/question/437495132/answer/1800881612

Address: https://arxiv.org/pdf/2103.14030.pdf

<h1 class="pgc-h-arrow-right" data-track="17" > implementation method</h1>

The architecture of Swin Transformer is shown in the following figure. First, like VIT, it divides the input image into multiple patches, with the author using a patch size of 4 x 4, followed by the embedding layer, and finally the author's Swin Transformer BLOCK.

Swin Transformer Block: This module is a multi-ATTENTION module using Shifted Window in Transformer, with consistency; the Swin Transformer module contains an MSA (multi-head Attention) module SHIFTED WINDOW, followed by 2 layers of MLP, and then adds the Layernorm layer to each MSA module and each MLP layer. Then there are the remaining connections.

SHIFTED Window based on Self-Attension

Standard Transformer uses global self-attention to create relationships between tokens, but this increases image size by a factor of 2 and increases complexity, making it not suitable for high-intensity tasks. In order to improve the modeling efficiency, the author proposes to calculate Self-Attention by "partial window", that is, assuming that each window contains MXM Patches, global MSA modules, and window-based MSA modules, and performs complexity calculations on top of the images of HXW Patches:

(1) The complexity is the secondary growth of the number of PATCH HWs; (2) it is linear growth, where m is a fixed value, but the global self-attention is a large HW, so the window-based Self-Attention calculation amount is not very high.

Splits the SHIFTED Window into contiguous blocks

Windows-based Self-Attention, while increasing in complexity, lacks connectivity between windows, limiting the model's ability to model.

In order to introduce connections on containers for model maintenance, the authors proposed the concept of SHIFTED Window segmentation. The first module uses neutral window segmentation and enlarges to an 8 x 8 feature plot based on the window size M=4. We divide this window into 2 x 2 forms, and then set the next module through Shifted Window, moving it by M/2 pixels.

Using the SHIFTED structure, the calculation of Swin Transformer Blocks is as follows:

W-MSA and SW-MSA respectively represent the use of neat segmentation with SHIFTED SELF-ATTENTION.

Shifted Window can add connections to a model across multiple windows and keep the model running efficiently.

SHIFTED WINDOW efficient batch calculations

SHIFTED WINDOWS also has its own problem, which causes the number of windows to increase - from the original (h / m xw / m) to ((h / m + 1) x (W / m + 1)) numbers, and some of these windows are smaller than MXM. This approach populates smaller windows with MXM and calculates an attention value to mask the fill operation; when the number of windows is small, but greater than 2 x 2, the resulting extra computation is considerable (from 2 x 2 to 3 x 3, the amount of computation increases by up to 2.25 times).

Therefore, the authors propose a more efficient BATCH calculation method, that is, cyclic Shift is performed along the upper left corner. After completing this displacement, Batch WINDOW will have a feature map of non-adjacent child windows, which is equivalent to limiting the number of Self-Attention within child windows using Cyclic-Shift, Batch-Windows, and neat window segmentation, which greatly improves computational efficiency.

Relative position offset:

The author describes the relative positions of the heads in Self-Attention. Offset B:

thereinto:

This will significantly improve performance.

The authors establish a basic model called SWIN-B based on size and computational complexity, and introduce SWIN-T, SWIN-S and SWIN-L, which are 0.25 times, 0.5 times and 2 times the complexity of the models, respectively.

When we set the window size M to 7, each header query D is 32 and the mlp α for each extension layer is 4.

where C is the number of channels for the initial stage of the hidden layer.

<h1 class="pgc-h-arrow-right" data-track="69" > specific performance</h1>

Image classification

Table 1A, Training on ImgeNet-1K from scratch:

Table 1B, first round of training on ImageNet-22k, then migration to ImageNet-1K:

Object detection

Table 2A, using Swin Transformer instead of BackBone on different model frames:

Table 2C, comparison with SOTA:

Semantic segmentation

Table 3, for smaller models, a higher MIOU than the previous SOTA SETR is available:

Ablation experiments

Table 4, Performance improvements for Shifted Window in relative position offsets:

Table 5, SHIFTED WINDOW and CYCLIC efficient execution:

Table 6, Comparison of Different Self-Attentions:

<h1 class="pgc-h-arrow-right" data-track="102" > is it possible for Transformer to replace CNN on CV? </h1>

The emergence of ViT has expanded the use of Transformer, which has also led practitioners in the FIELD of AI to pay attention to the relationship between Transformer and CNN. On Zhihu, a user asked:

There are already transformer-based applications on three major image problems: Classification (ViT), Detection (DETR), and Segmentation (SETR), all of which have achieved good results. So in the future, is it possible for Transformer to replace CNNs, and will Transformer revolutionize the CV field as it does in the NLP field? What are the possible research ideas in the future?

On this question, many zhihu respondents have given their own answers, including a doctor of control science and engineering from Zhejiang University, a master's degree from the School of Microelectronics of Fudan University, and a senior algorithm expert from Alibaba, which is generally summarized as whether a complete replacement will be formed in the future, but the dimensionality reduction blow to CNN has been formed.

Interested users can read the full answers of the answering friends: https://www.zhihu.com/question/437495132/answer/1800881612

Reference links: https://www.programmersought.com/article/61847904617/