YOLO算法改进Backbone系列之MogaNet

2024-04-22 20:16:00

Convolutional neural networks (ConvNets) have always been the preferred method for computer vision. Inspired by the primate visual system, convolutional layers can encode the neighborhood correlation of observed images with variance constraints such as regional dense connectivity and translation. Through staggered stratification, ConvNets obtains passively increasing receptive fields and is adept at identifying underlying semantic patterns. However, the representations extracted by ConvNets have been shown to have a strong bias towards the regional texture, resulting in a significant loss of global contextual information for the visual target. In contrast, by relaxing local inductive biases, ViT and its variant models quickly outperformed ConvNets on a broad visual benchmark. ViT's capabilities are primarily derived from self-attention mechanisms, which facilitate long-distance interactions regardless of topological distance. However, the secondary complexity in the self-attention mechanism limits the computational efficiency of ViT and its potential for application in fine-grained downstream tasks.

However, there is still a representational bottleneck in existing methods: the implementation of self-attention mechanisms or large kernels hinders the modeling of discriminative contextual information and global interactions, leading to a cognitive gap between DNNs and human visual systems. As with feature integration theory, the human brain not only extracts local features, but simultaneously aggregates these features for global perception, which is more compact and efficient than DNNs. In order to address this challenge, the authors study the representation ability of DNNs from the perspective of feature interaction complexity. To this end, the authors design a macro ConvNet framework with corresponding basic operations, and further develop a new family of ConvNets called Multi-Order Gated Aggregation Networks (MogaNets) for accelerating contextual information with multiple interaction complexities. In MogaNet, a multi-order feature aggregation module is introduced based on human vision. The authors' design encapsulates local perception and context aggregation into a unified spatial aggregation block, in which composite multi-order associations are efficiently aggregated and contextualized through a parallel gating mechanism.

From the channel aspect, because the existing methods are easy to achieve high channel information redundancy, a simple and efficient channel aggregation block is customized, which performs adaptive channel redistribution of input features and significantly outperforms mainstream counterparts (e.g., SE modules) at a lower computational cost.

The overall framework of MogaNet is shown in the figure below, and the architecture is very similar to that of a general Transformer network, with two modules at its core: spatial aggregation (instead of attention) and channel aggregation (instead of FFN).

Spatial aggregation is shown in the figure below, and the blue part is called Feature Decomposition, which is used for Exclude Trivial Interactions. Here's the Moga module, which is a multi-level DWConv, which the author considers to be multi-level gating.

Channel aggregation is shown in the following figure. The current mainstream method, FFN, includes only two FC layers. Therefore, the author has made the following improvements, which are similar to attention in a spatial position.

Other variants of the MogaNet model are as follows:

Tutorial for adding a model as a backbone in a YOLOv5 project:

(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel

(2) Create a new moganet.py in the models/backbone file and add the following code:

(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):

(4) Create a new configuration file under model: yolov5_moganet.yaml

(5) Run verification: specify the --cfg parameter as the newly created yolov5_moganet.yaml in the models/yolo.py file

YOLO算法改进Backbone系列之MogaNet

Read on

ALDI is now available for 16.5 RMB, and the Oreo Universe Limited Series is on sale Innovation Weekly

From Aaron Kwok and Zhu Yin to Zhang Xiwen, the "Clean Government Action" series has witnessed the development history of TVB's Xiaosheng Huadan

JapanPizza Hut推全球首个AI Pizza全系列披萨产品

This year's Nike "Charity" collection is unveiled! Which pair do you like best?

iPhone 17 series revealed: with 12GB of RAM, the Dynamic Island is further reduced

"Shinan · Love" series of activities - to explore the wonderful footprints of love

How much has the iPhone 16 series changed from the iPhone 15 series? Netizen: I regret buying early!

The most popular in the cold day must be the Northeast cuisine series! The big pot of vegetables is stewed, and the house is full of the aroma of meat!

Jingdong Apple Double 11 Promotion! iPhone 16 series discount of 500 yuan: also get 1 year of AppleCare+

vivo X200 series sales exceeded 2 billion to create history: JD.com, Tmall, and Douyin Android machines were the best

Handa Li Zhong|Spark Liaoyuan series comic book "Ambush Shentouling" Dai Rui painting

Handa Li Zhong|Spark Liaoyuan series of comic books "Killing the "Flower of Famous Generals"" by Ji Yuepeng and Ji Fang

With the rise of domestic power, the Huawei Mate 70 series creates a seamless smart life

The Huawei Pura 80 series is ready to go, and the real pure-blooded Hongmeng King is here!

Overhead series: If North and South Korea are reunified, a monster on the Korean Peninsula is born!

Believe it or not! The Lakers lost the series 1-4, bringing five indisputable facts