laitimes

YOLO算法改进Backbone系列之:HAT-Net

author:Nuist Object Detection

This paper aims to solve the problem of high computational/spatial complexity associated with multi-head self-attention (MHSA) in ViT. To this end, we propose hierarchical multi-head self-attention (H-MHSA), a new method for calculating self-attention in a hierarchical manner. Specifically, we first divide the input image into patches in the usual way, with each patch being treated as a marker. The proposed H-MHSA then learns the labeled relationships within the local patches and models them as local relationships. The small patches are then merged into large patches, and the H-MHSA models the global dependencies of the few markers that are merged. Finally, the local and global attention features were summarized to obtain features with strong representational ability. Since we only count a limited number of markers for attention at each step, the computational load is greatly reduced.

As a result, H-MHSA can effectively simulate global relationships between markers without sacrificing fine-grained information. With the H-MHSA module, we have built a family of transformer networks based on hierarchical attention, namely HAT-Net. In order to prove the superiority of HAT-Net in scene understanding, we conducted a large number of experiments on basic visual tasks such as image classification, semantic segmentation, object detection, and instance segmentation. As a result, HAT-Net offers a new perspective for vision converters.

Current problems and solutions: Transformer has become the de facto standard for dealing with long-distance dependencies in the NLP field, but it relies on a self-attention mechanism to model the global relationship of sequence data. With the emergence of ViT, the representative work of visual Transformer, the method of building Transformer model based on pixel patch has become the mainstream paradigm of visual Transformer, but due to the long length of patch sequence in visual data, the self-attention operation it relies on still faces the problems of high computational cost and spatial complexity in practical applications.

Some recent works are mainly trying to reduce the sequence length through various means to improve the computational efficiency of visual Transformer, mainly as follows:

  • Local Attention: Swin Transformer uses fixed-size windows with Shift Window and stacks multiple layers to simulate global modeling, which is still suboptimal because it still continues the idea of long-distance dependence of CNN stacking simulation
  • Pooling Attention: PVT downsamples the feature map, thereby reducing the sequence length. However, due to the downsampling of the key and value, the local detail is lost, and a fixed-size downsampling scale is used, which is achieved using a step-step convolution with the same step size as the convolution kernel size, and if the configuration needs to be adjusted, it will need to be retrained
  • Channel Attention: CoaT calculates attention in the form of channels, which may not be as effective as simulating global feature dependence

A more effective and flexible variant of MHSA is proposed, Hierarchical Multi-Head Self-Attention (H-MHSA). By disassembling the MHSA that directly calculates the global similarity relationship into multiple steps, and modeling the similarity between short sequences with different granularity in each step, it retains both fine-grained information and the efficiency of short sequence calculation. Moreover, H-MHSA's operations involving shortened sequences are parameter-free, so it is more flexible for downstream tasks and does not need to be re-pre-trained due to adjustments. Specifically, the H-MHSA consists of the following steps:

  • For the input query, key, and patch tokens corresponding to value, they are grouped into several grids that do not overlap
  • Attention is computed between patches within the grid to capture local relationships and produce more discriminative local representations. Here is the form based on residuals
  • Merge these small patches to get a larger tier of patch tokens. This allows us to simulate global dependencies directly based on these smaller numbers of coarse-grained tokens. In this calculation, k and v are compressed using average pooling.
  • Finally, features from the local and global levels are integrated, resulting in features with richer granularity
YOLO算法改进Backbone系列之:HAT-Net

The following table summarizes the list of different configurations for the HAT-Net model

YOLO算法改进Backbone系列之:HAT-Net

Tutorial for adding a model as a backbone in a YOLOv5 project:

(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel

YOLO算法改进Backbone系列之:HAT-Net
YOLO算法改进Backbone系列之:HAT-Net

(2) Create a new HAT_Net.py in the models/backbone file and add the following code:

YOLO算法改进Backbone系列之:HAT-Net

(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):

YOLO算法改进Backbone系列之:HAT-Net

(4) Create a new configuration file under the model: yolov5_hatnet.yaml

YOLO算法改进Backbone系列之:HAT-Net

(5) Run verification: specify the --cfg parameter as the newly created yolov5_hatnet.yaml in the models/yolo.py file

YOLO算法改进Backbone系列之:HAT-Net

Read on