laitimes

YOLO算法改进Backbone系列之MogaNet

Convolutional neural networks (ConvNets) have always been the preferred method for computer vision. Inspired by the primate visual system, convolutional layers can encode the neighborhood correlation of observed images with variance constraints such as regional dense connectivity and translation. Through staggered stratification, ConvNets obtains passively increasing receptive fields and is adept at identifying underlying semantic patterns. However, the representations extracted by ConvNets have been shown to have a strong bias towards the regional texture, resulting in a significant loss of global contextual information for the visual target. In contrast, by relaxing local inductive biases, ViT and its variant models quickly outperformed ConvNets on a broad visual benchmark. ViT's capabilities are primarily derived from self-attention mechanisms, which facilitate long-distance interactions regardless of topological distance. However, the secondary complexity in the self-attention mechanism limits the computational efficiency of ViT and its potential for application in fine-grained downstream tasks.

However, there is still a representational bottleneck in existing methods: the implementation of self-attention mechanisms or large kernels hinders the modeling of discriminative contextual information and global interactions, leading to a cognitive gap between DNNs and human visual systems. As with feature integration theory, the human brain not only extracts local features, but simultaneously aggregates these features for global perception, which is more compact and efficient than DNNs. In order to address this challenge, the authors study the representation ability of DNNs from the perspective of feature interaction complexity. To this end, the authors design a macro ConvNet framework with corresponding basic operations, and further develop a new family of ConvNets called Multi-Order Gated Aggregation Networks (MogaNets) for accelerating contextual information with multiple interaction complexities. In MogaNet, a multi-order feature aggregation module is introduced based on human vision. The authors' design encapsulates local perception and context aggregation into a unified spatial aggregation block, in which composite multi-order associations are efficiently aggregated and contextualized through a parallel gating mechanism.

From the channel aspect, because the existing methods are easy to achieve high channel information redundancy, a simple and efficient channel aggregation block is customized, which performs adaptive channel redistribution of input features and significantly outperforms mainstream counterparts (e.g., SE modules) at a lower computational cost.

The overall framework of MogaNet is shown in the figure below, and the architecture is very similar to that of a general Transformer network, with two modules at its core: spatial aggregation (instead of attention) and channel aggregation (instead of FFN).

YOLO算法改进Backbone系列之MogaNet
YOLO算法改进Backbone系列之MogaNet

Spatial aggregation is shown in the figure below, and the blue part is called Feature Decomposition, which is used for Exclude Trivial Interactions. Here's the Moga module, which is a multi-level DWConv, which the author considers to be multi-level gating.

YOLO算法改进Backbone系列之MogaNet

Channel aggregation is shown in the following figure. The current mainstream method, FFN, includes only two FC layers. Therefore, the author has made the following improvements, which are similar to attention in a spatial position.

YOLO算法改进Backbone系列之MogaNet

Other variants of the MogaNet model are as follows:

YOLO算法改进Backbone系列之MogaNet

Tutorial for adding a model as a backbone in a YOLOv5 project:

(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel

YOLO算法改进Backbone系列之MogaNet
YOLO算法改进Backbone系列之MogaNet

(2) Create a new moganet.py in the models/backbone file and add the following code:

YOLO算法改进Backbone系列之MogaNet

(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):

YOLO算法改进Backbone系列之MogaNet

(4) Create a new configuration file under model: yolov5_moganet.yaml

YOLO算法改进Backbone系列之MogaNet

(5) Run verification: specify the --cfg parameter as the newly created yolov5_moganet.yaml in the models/yolo.py file

YOLO算法改进Backbone系列之MogaNet

Read on