YOLO算法改进Backbone系列之DaViT

Abstract: In this work, we introduce the Dual Attention Vision Transformer (DaViT), a simple and effective vision transformer architecture that captures the global environment while maintaining computational efficiency. We propose to approach this from an orthogonal perspective: the self-attention mechanism that utilizes "spatial markers" and "channel markers". For spatial tags, the spatial dimension defines the extent of the tag, while the channel dimension defines the characteristic dimension of the tag. For channel tags, we have the opposite: the channel dimension defines the extent of the tag, while the spatial dimension defines the characteristic dimension of the tag. We further group the space and channel tokens along the sequence direction to maintain the linear complexity of the entire model. We show that these two self-attention are complementary to each other.

(i) Since each channel marker contains the entire graph feature, channel attention naturally captures global connections and features by taking into account all spatial locations when calculating attention scores between channels, and (ii) spatial attention completes local features by making fine-grained connections across spatial locations, which in turn contributes to the global information modeling of channel attention. Numerous experiments have shown that our DaViT achieves state-of-the-art performance with efficient calculations on four different tasks. In the absence of additional data, the DaViT-Tiny, DaViT-Small, and DaViT-Base achieved the highest accuracy of 82.8%, 84.2%, and 84.6% on the ImageNet-1K with 28.3 million, 49.7 million, and 87.9 million parameters, respectively. When we further scaled DaViT with 1.5 billion weakly supervised image and text pairs, DaViT-Gaint achieved the highest accuracy of 90.4% on ImageNet-1K.

The double attention mechanism is self-attention from two orthogonal perspectives:

(1) Self-attention is performed on spatial tokens, in this case, the spatial dimension (HW) defines the number of tokens, and the channel dimension (C) defines the feature size of tokens, which is actually the most common method used by ViT

(2) Self-attention is performed on channel tokens, which is completely opposite to the previous process, where the channel dimension (C) defines the number of tokens, and the spatial dimension (HW) defines the feature size of tokens

It can be seen that the two kinds of self-attention are completely opposite ideas. In order to reduce the amount of computation, both kinds of self-attention use grouped attention: for spatial tokens, it is divided into different windows in the spatial dimension, which is the window attention proposed in Swin, which is called spatial window attention in the paper, and for channel Tokens, which can also be divided into different groups in the channel dimension, are called channel group attention in the paper. These two types of attention are shown in the following figure:

DaViT's dual attention mechanism includes the following two self-attention methods:

l Spatial self-attention: For tokens in the spatial dimension, DaViT divides them into different windows, which is similar to spatial window attention in Swin. By self-paying in the spatial dimension, DaViT is able to capture the relationships between different locations

● Channel self-attention: For tokens in the channel dimension, DaViT divides them into different groups (channel group attention). By self-paying at the channel level, DaViT is able to capture the relationships between different features

Dual attention block: It contains two transformer blocks: the space window self-attention block and the channel group self-attention block. By interspersing these two types of attention mechanisms, the Davit model enables local fine-grained and global image-level interactions.

DaViT has a pyramidal structure consisting of 4 stages, each of which has a patch embedding layer inserted at the beginning. The author superimposes a dual attention block on each stage, which is to stack the two types of attention (including FFN) alternately on top of each other, and their resolution and feature dimensions remain the same. A 7x7 conv with stride=4 is used, followed by 4 stages, and each stage is downsampled with a 2x2 conv with stride=2.

Davit-Tini, David-Small和Davit-Base三个模型的配置如下所示:

Tutorial for adding a model as a backbone in a YOLOv5 project:

(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel

(2) Create a new Davit.py in the models/backbone file and add the following code:

(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):

(4) Create a new configuration file under the model: yolov5_davit.yaml

(5) Run verification: specify the --cfg parameter as the new yolov5_davit.yaml in the models/yolo.py file

YOLO算法改进Backbone系列之DaViT

Read on

Time-honored Chinese delicacies gather at Huangpu Riverside to inject vitality into the Olympic Qualification Series!

In the Olympic qualification series, the Huangpu police have made every effort to ensure that they are in progress

Honor 200 Series: The perfect blend of design and technology

The magical elf belongs to me | Cute three-dimensional cartoon series wallpapers

Don't take the nice words seriously, and don't take the ugly words too seriously Van Gogh painting series wallpapers

iPhone 7 series users can receive up to 2,522 yuan in compensation from Apple

iPhone 7 series users can get $349 compensation from Apple! Protection of consumer rights

Details of the OPPO Reno 12 series phones announced: diamond architecture, network-free calls

Amazing gap! The Nuggets set a record and lost the match point, and the Timberwolves fought hard to usher in the series tiebreaker!

Time-honored Chinese delicacies gathered at the Huangpu Riverside to inject vitality into the Olympic Qualification Series

iPhone 7 series users can receive compensation from Apple up to $349 and 2,522 yuan

iPhone 7 series users can be compensated by Apple up to $349

The appearance is fully exposed, and the rear camera module is very recognizable The Honor 200 series was officially announced on May 27

HarmonyOS 4.2 update, Huawei Mate60 series can understand dialects, saving waste films

Sun Yingsha's series of operations on the field made Coach Qiu Yike, who was sitting in the background, look melancholy and had difficulty sitting up

全系标配等深微曲直屏! OPPO Reno12 Series Ding档May 23