laitimes

YOLO算法改进Backbone系列之DaViT

author:Nuist Object Detection

Abstract: In this work, we introduce the Dual Attention Vision Transformer (DaViT), a simple and effective vision transformer architecture that captures the global environment while maintaining computational efficiency. We propose to approach this from an orthogonal perspective: the self-attention mechanism that utilizes "spatial markers" and "channel markers". For spatial tags, the spatial dimension defines the extent of the tag, while the channel dimension defines the characteristic dimension of the tag. For channel tags, we have the opposite: the channel dimension defines the extent of the tag, while the spatial dimension defines the characteristic dimension of the tag. We further group the space and channel tokens along the sequence direction to maintain the linear complexity of the entire model. We show that these two self-attention are complementary to each other.

(i) Since each channel marker contains the entire graph feature, channel attention naturally captures global connections and features by taking into account all spatial locations when calculating attention scores between channels, and (ii) spatial attention completes local features by making fine-grained connections across spatial locations, which in turn contributes to the global information modeling of channel attention. Numerous experiments have shown that our DaViT achieves state-of-the-art performance with efficient calculations on four different tasks. In the absence of additional data, the DaViT-Tiny, DaViT-Small, and DaViT-Base achieved the highest accuracy of 82.8%, 84.2%, and 84.6% on the ImageNet-1K with 28.3 million, 49.7 million, and 87.9 million parameters, respectively. When we further scaled DaViT with 1.5 billion weakly supervised image and text pairs, DaViT-Gaint achieved the highest accuracy of 90.4% on ImageNet-1K.

The double attention mechanism is self-attention from two orthogonal perspectives:

(1) Self-attention is performed on spatial tokens, in this case, the spatial dimension (HW) defines the number of tokens, and the channel dimension (C) defines the feature size of tokens, which is actually the most common method used by ViT

(2) Self-attention is performed on channel tokens, which is completely opposite to the previous process, where the channel dimension (C) defines the number of tokens, and the spatial dimension (HW) defines the feature size of tokens

It can be seen that the two kinds of self-attention are completely opposite ideas. In order to reduce the amount of computation, both kinds of self-attention use grouped attention: for spatial tokens, it is divided into different windows in the spatial dimension, which is the window attention proposed in Swin, which is called spatial window attention in the paper, and for channel Tokens, which can also be divided into different groups in the channel dimension, are called channel group attention in the paper. These two types of attention are shown in the following figure:

YOLO算法改进Backbone系列之DaViT

DaViT's dual attention mechanism includes the following two self-attention methods:

l Spatial self-attention: For tokens in the spatial dimension, DaViT divides them into different windows, which is similar to spatial window attention in Swin. By self-paying in the spatial dimension, DaViT is able to capture the relationships between different locations

● Channel self-attention: For tokens in the channel dimension, DaViT divides them into different groups (channel group attention). By self-paying at the channel level, DaViT is able to capture the relationships between different features

Dual attention block: It contains two transformer blocks: the space window self-attention block and the channel group self-attention block. By interspersing these two types of attention mechanisms, the Davit model enables local fine-grained and global image-level interactions.

YOLO算法改进Backbone系列之DaViT

DaViT has a pyramidal structure consisting of 4 stages, each of which has a patch embedding layer inserted at the beginning. The author superimposes a dual attention block on each stage, which is to stack the two types of attention (including FFN) alternately on top of each other, and their resolution and feature dimensions remain the same. A 7x7 conv with stride=4 is used, followed by 4 stages, and each stage is downsampled with a 2x2 conv with stride=2.

Davit-Tini, David-Small和Davit-Base三个模型的配置如下所示:

YOLO算法改进Backbone系列之DaViT

Tutorial for adding a model as a backbone in a YOLOv5 project:

(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel

YOLO算法改进Backbone系列之DaViT
YOLO算法改进Backbone系列之DaViT

(2) Create a new Davit.py in the models/backbone file and add the following code:

YOLO算法改进Backbone系列之DaViT

(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):

YOLO算法改进Backbone系列之DaViT

(4) Create a new configuration file under the model: yolov5_davit.yaml

YOLO算法改进Backbone系列之DaViT

(5) Run verification: specify the --cfg parameter as the new yolov5_davit.yaml in the models/yolo.py file

YOLO算法改进Backbone系列之DaViT

Read on