YOLO算法改进Backbone系列之:PVT

2024-04-07 21:38:00

Abstract:Although CNNs-based backbones have made significant progress in a variety of vision tasks, this paper proposes a simple CNN-free backbone for intensive prediction tasks, the Pyramid Vision Transformer (PVT). Compared with the design of ViT, which is specifically designed for image classification, PVT introduces the pyramid structure into the transformer, which enables various downstream intensive prediction tasks, such as detection and segmentation. Compared with the existing technology, PVT has the following advantages: (1) Compared with the low resolution output, high computational complexity, and high memory occupation of ViT, PVT can not only train the image intensively to achieve the output resolution (which is important for intensive prediction), but also use a gradually shrinking pyramid to reduce the large feature (2) PVT combines the advantages of CNNs and Transformers, making it a general-purpose convolutional-free backbone that can directly replace CNN-based backbones, and (3) a large number of experiments show that PVT can improve the performance of a variety of downstream tasks, such as object detection, semantic/instance segmentation, etc.

For example, with the same number of parameters, RetinaNet+PVT can reach 40.4 APs on COCO, while RetinNet+ResNet50 can only reach 36.3 APs. The authors hope that PVT can become an alternative backbone for pixel-level prediction tasks and facilitate subsequent research.

CNN learns a hierarchical feature representations through the stack CNN layer, and as the number of layers increases, the receptive field becomes larger and larger, the number of channels becomes larger and larger, and the feature map size becomes smaller and smaller, and then one or more specific task networks are connected to perform specific tasks.

As shown in Figure b, the classic ViT is a column structure, which is actually a stack transformer block, in order to use the Transformer in NLP to Vision, the customary practice is to convert the feature into a sequence of patches through meshing, and the size of each patches is generally 32 x 32.

As shown in Figure c, the proposed Pyramid Vision Transformer (PVT) is the first to transform the figure into Sequence of patches, which is also to learn a hierarchical representation from the structural point of view, but the basic building block has been replaced by Conv. with Attention module.

PVT and ViT are both pure Transformer models without any convolution operations, and the main difference between the two is that PVT introduces a feature pyramid structure. In ViT, a conventional Transformer is used, with the same input and output dimensions. Due to resource limitations, the output of ViT can only be a relatively rough feature map, such as 16 x 16 and 32 x 32, and the corresponding output step length is relatively low, such as 16 steps and 32 steps. As a result, it is difficult for ViT to be used directly for intensive prediction tasks that require high resolution. PVT breaks this limitation of Transformer by introducing progressive reduction pyramids, which can generate multi-scale feature maps like traditional CNN backbones. In addition, a simple and effective attention layer, SRA, is designed to deal with high-resolution feature maps and reduce computational complexity and memory consumption. In general, PVT has the following advantages over ViT:

(1) More flexible: feature maps with different resolutions and channels can be generated on different stages

(2) More general: It can be easily embedded in the model of most downstream tasks

(3) More friendly to computing and memory: It can handle high-resolution feature maps

PVT Model Variant Configuration Letter

Tutorial for adding a model as a backbone in a YOLOv5 project:

(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel

(2) Create a new pvt.py in the models/backbone file and add the following code:

(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):

(4) Create a new configuration file under the model: yolov5_pvt.yaml

(5) Run verification: specify the --cfg parameter as the newly created yolov5_pvt.yaml in the models/yolo.py file

YOLO算法改进Backbone系列之:PVT

Read on

ALDI is now available for 16.5 RMB, and the Oreo Universe Limited Series is on sale Innovation Weekly

From Aaron Kwok and Zhu Yin to Zhang Xiwen, the "Clean Government Action" series has witnessed the development history of TVB's Xiaosheng Huadan

JapanPizza Hut推全球首个AI Pizza全系列披萨产品

This year's Nike "Charity" collection is unveiled! Which pair do you like best?

iPhone 17 series revealed: with 12GB of RAM, the Dynamic Island is further reduced

"Shinan · Love" series of activities - to explore the wonderful footprints of love

How much has the iPhone 16 series changed from the iPhone 15 series? Netizen: I regret buying early!

The most popular in the cold day must be the Northeast cuisine series! The big pot of vegetables is stewed, and the house is full of the aroma of meat!

Jingdong Apple Double 11 Promotion! iPhone 16 series discount of 500 yuan: also get 1 year of AppleCare+

vivo X200 series sales exceeded 2 billion to create history: JD.com, Tmall, and Douyin Android machines were the best

Handa Li Zhong|Spark Liaoyuan series comic book "Ambush Shentouling" Dai Rui painting

Handa Li Zhong|Spark Liaoyuan series of comic books "Killing the "Flower of Famous Generals"" by Ji Yuepeng and Ji Fang

With the rise of domestic power, the Huawei Mate 70 series creates a seamless smart life

The Huawei Pura 80 series is ready to go, and the real pure-blooded Hongmeng King is here!

Overhead series: If North and South Korea are reunified, a monster on the Korean Peninsula is born!

Believe it or not! The Lakers lost the series 1-4, bringing five indisputable facts