Abstract:Although CNNs-based backbones have made significant progress in a variety of vision tasks, this paper proposes a simple CNN-free backbone for intensive prediction tasks, the Pyramid Vision Transformer (PVT). Compared with the design of ViT, which is specifically designed for image classification, PVT introduces the pyramid structure into the transformer, which enables various downstream intensive prediction tasks, such as detection and segmentation. Compared with the existing technology, PVT has the following advantages: (1) Compared with the low resolution output, high computational complexity, and high memory occupation of ViT, PVT can not only train the image intensively to achieve the output resolution (which is important for intensive prediction), but also use a gradually shrinking pyramid to reduce the large feature (2) PVT combines the advantages of CNNs and Transformers, making it a general-purpose convolutional-free backbone that can directly replace CNN-based backbones, and (3) a large number of experiments show that PVT can improve the performance of a variety of downstream tasks, such as object detection, semantic/instance segmentation, etc.
For example, with the same number of parameters, RetinaNet+PVT can reach 40.4 APs on COCO, while RetinNet+ResNet50 can only reach 36.3 APs. The authors hope that PVT can become an alternative backbone for pixel-level prediction tasks and facilitate subsequent research.
CNN learns a hierarchical feature representations through the stack CNN layer, and as the number of layers increases, the receptive field becomes larger and larger, the number of channels becomes larger and larger, and the feature map size becomes smaller and smaller, and then one or more specific task networks are connected to perform specific tasks.
As shown in Figure b, the classic ViT is a column structure, which is actually a stack transformer block, in order to use the Transformer in NLP to Vision, the customary practice is to convert the feature into a sequence of patches through meshing, and the size of each patches is generally 32 x 32.
As shown in Figure c, the proposed Pyramid Vision Transformer (PVT) is the first to transform the figure into Sequence of patches, which is also to learn a hierarchical representation from the structural point of view, but the basic building block has been replaced by Conv. with Attention module.
PVT and ViT are both pure Transformer models without any convolution operations, and the main difference between the two is that PVT introduces a feature pyramid structure. In ViT, a conventional Transformer is used, with the same input and output dimensions. Due to resource limitations, the output of ViT can only be a relatively rough feature map, such as 16 x 16 and 32 x 32, and the corresponding output step length is relatively low, such as 16 steps and 32 steps. As a result, it is difficult for ViT to be used directly for intensive prediction tasks that require high resolution. PVT breaks this limitation of Transformer by introducing progressive reduction pyramids, which can generate multi-scale feature maps like traditional CNN backbones. In addition, a simple and effective attention layer, SRA, is designed to deal with high-resolution feature maps and reduce computational complexity and memory consumption. In general, PVT has the following advantages over ViT:
(1) More flexible: feature maps with different resolutions and channels can be generated on different stages
(2) More general: It can be easily embedded in the model of most downstream tasks
(3) More friendly to computing and memory: It can handle high-resolution feature maps
PVT Model Variant Configuration Letter
Tutorial for adding a model as a backbone in a YOLOv5 project:
(1) Modify the models/yolo.py of the YOLOv5 project to the parse_model function and the _forward_once function of the BaseModel
(2) Create a new pvt.py in the models/backbone file and add the following code:
(3) Import the model in models/yolo.py and modify it in the parse_model function as follows (import the file first):
(4) Create a new configuration file under the model: yolov5_pvt.yaml
(5) Run verification: specify the --cfg parameter as the newly created yolov5_pvt.yaml in the models/yolo.py file