YOLO算法改进Backbone系列之:HAT-Net

This paper aims to solve the problem of high computational/spatial complexity associated with multi-head self-attention (MHSA) in ViT. To this end, we propose hierarchical multi-head self-attention (H-MHSA), a new method for calculating self-attention in a hierarchical manner. Specifically, we first divide the input image into patches in the usual way, with each patch being treated as a marker. The proposed H-MHSA then learns the labeled relationships within the local patches and models them as local relationships. The small patches are then merged into large patches, and the H-MHSA models the global dependencies of the few markers that are merged. Finally, the local and global attention features were summarized to obtain features with strong representational ability. Since we only count a limited number of markers for attention at each step, the computational load is greatly reduced.

As a result, H-MHSA can effectively simulate global relationships between markers without sacrificing fine-grained information. With the H-MHSA module, we have built a family of transformer networks based on hierarchical attention, namely HAT-Net. In order to prove the superiority of HAT-Net in scene understanding, we conducted a large number of experiments on basic visual tasks such as image classification, semantic segmentation, object detection, and instance segmentation. As a result, HAT-Net offers a new perspective for vision converters.

Current problems and solutions: Transformer has become the de facto standard for dealing with long-distance dependencies in the NLP field, but it relies on a self-attention mechanism to model the global relationship of sequence data. With the emergence of ViT, the representative work of visual Transformer, the method of building Transformer model based on pixel patch has become the mainstream paradigm of visual Transformer, but due to the long length of patch sequence in visual data, the self-attention operation it relies on still faces the problems of high computational cost and spatial complexity in practical applications.

Some recent works are mainly trying to reduce the sequence length through various means to improve the computational efficiency of visual Transformer, mainly as follows:

Local Attention: Swin Transformer uses fixed-size windows with Shift Window and stacks multiple layers to simulate global modeling, which is still suboptimal because it still continues the idea of long-distance dependence of CNN stacking simulation
Pooling Attention: PVT downsamples the feature map, thereby reducing the sequence length. However, due to the downsampling of the key and value, the local detail is lost, and a fixed-size downsampling scale is used, which is achieved using a step-step convolution with the same step size as the convolution kernel size, and if the configuration needs to be adjusted, it will need to be retrained
Channel Attention: CoaT calculates attention in the form of channels, which may not be as effective as simulating global feature dependence

A more effective and flexible variant of MHSA is proposed, Hierarchical Multi-Head Self-Attention (H-MHSA). By disassembling the MHSA that directly calculates the global similarity relationship into multiple steps, and modeling the similarity between short sequences with different granularity in each step, it retains both fine-grained information and the efficiency of short sequence calculation. Moreover, H-MHSA's operations involving shortened sequences are parameter-free, so it is more flexible for downstream tasks and does not need to be re-pre-trained due to adjustments. Specifically, the H-MHSA consists of the following steps:

For the input query, key, and patch tokens corresponding to value, they are grouped into several grids that do not overlap
Attention is computed between patches within the grid to capture local relationships and produce more discriminative local representations. Here is the form based on residuals
Merge these small patches to get a larger tier of patch tokens. This allows us to simulate global dependencies directly based on these smaller numbers of coarse-grained tokens. In this calculation, k and v are compressed using average pooling.
Finally, features from the local and global levels are integrated, resulting in features with richer granularity

The following table summarizes the list of different configurations for the HAT-Net model

Tutorial for adding a model as a backbone in a YOLOv5 project:

(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel

(2) Create a new HAT_Net.py in the models/backbone file and add the following code:

(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):

(4) Create a new configuration file under the model: yolov5_hatnet.yaml

(5) Run verification: specify the --cfg parameter as the newly created yolov5_hatnet.yaml in the models/yolo.py file

YOLO算法改进Backbone系列之:HAT-Net

Read on

Today, the Mavericks lost to the Clippers 111-116, and the series was tied at 2-2. The game

With both performance, battery life and thinness, the iQOO Z9 series is officially on sale from 1149 yuan

Murray's injury was exposed! The jump shot was painful to hold his calf, and the series really ushered in a turning point

Liangyuan Laser's independent research and development of the "Bai Ze Series" new laser ranging module was shockingly launched

The Knicks beat the 76ers 97-92 3-1 in the series, Brunson scored 47 points on his career night, and Embiid was deflated in a row

Starting at a minimum of 1149 yuan, the battery life superhuman iQOO Z9 series is on sale today!

2024 NAB Zhiyun unveils a new series of film and television lights, and the accumulation of innovative technologies has formed a leading advantage

Meyer Microvision S Series: Reliable obstacle avoidance solutions for mobile robots

YY live broadcast of Kyushu - ancient capital spring series of live broadcasts opened, and more than 400,000 netizens visited the capital "cloud".

China Construction Eighth Bureau carried out a series of activities of "Model Workers and Craftsmen Help Enterprises Prosper" and caring for employees

iQOO Z9 / Turbo 系列手机开售：骁龙 7 Gen3/8s Gen3，1499 元起

36 volunteer teams were set up, youth face-to-face discussions, and a series of activities of "Sub-center with Me Youth in Action" was launched!

"New Phone" Honor 200 Series Exposure: Snapdragon 8Gen3+100W Fast Charging Tentatively Released during the May Day Holiday

The new iQOO Z9 series is officially launched, with a minimum of only 1149 yuan

Harden hit 33+6+7 and dreamed of returning to the top, Irving scored 40 points!

Tang Anuo, who was elected president, adopted a series of anti-China measures, and the son of a traitor became president