Source 丨The Heart of the Machine

Edited 丨ji City Platform

Transformer has made a decent progression in NLP tasks, and many studies have introduced it into computer vision tasks. It's no exaggeration to say that Transformer is changing the landscape of computer vision, especially when it comes to recognition tasks. For example, Detection transformer was the first end-to-end learning system for object detection, while vision transformer was the first image classification architecture based entirely on transformer. In this paper, an anonymous paper received by ICLR 2022 integrates the Vision and Detection Transformer (ViDT) to build an effective and efficient object detector.

ViDT introduces a reconfigured attention module that extends the Swin Transformer into a standalone object detector, followed by a computationally efficient Transformer decoder that utilizes multi-scale features and auxiliary techniques to improve detection performance without increasing computational load.

An evaluation on the Microsoft COCO benchmark dataset shows that ViDT achieves the best AP and latency trade-offs among existing fully transformer-based object detectors with high scalability for large models up to 49.2AP.

Completely based on transformer object detector, ICLR anonymous papers achieve visual unification

Address of the paper: https://openreview.net/pdf?id=w4cXZDDib1H

ViDT: Vision and Inspection Transformer

The ViDT architecture is shown in Figure 2 (c) below:

First, ViDT introduces an improved attention mechanism called the Reconfigured Attention Module (RAM), which helps ViT variants process additional [DET(detection tokens)] and [PATCH(patch tokens)] tokens for object detection. As a result, ViDT can modify the latest Swin Transformer backbone with RAM to the target detector and take advantage of its local attention mechanism with linear complexity for high scalability;
Second, ViDT uses a lightweight encoder-free neck architecture to reduce computational overhead while still enabling additional optimization techniques on the neck module. Note that the neck encoder is unnecessary because RAM directly extracts the fine-grained representation used for object detection, the [DET] token. As a result, ViDT achieved better performance than the neck-free counterpart;
Finally, the study introduces a new concept of token matching for knowledge distillation, which can bring additional performance gains from large to small models without compromising detection efficiency.

RAM module

The study introduced a RAM module that breaks down the individual global attention associated with [PATCH] and [DET] tokens into three distinct attentions, namely [PATCH] × [PATCH], [DET] × [DET], and [DET] × [PATCH] attention. As shown in Figure 3, by sharing the projection layers of the [DET] and [PATCH] tokens, all parameters of the Swin Transformer are reused and three different attention operations are performed:

ENCODER-FREE neck structure

To take advantage of multi-scale feature maps, ViDT incorporates a multilayer deformable transformer decoder. In the DETR family (Figure 2 (a)), the neck part requires a transformer encoder for converting features extracted from the backbone for image classification into features suitable for object detection; the encoder is often computationally expensive because it involves [PATCH] × [PATCH] attention. However, ViDT retains only one Transformer decoder as its neck, as Swin Transformer with RAM directly extracts fine-grained features suitable for object detection as stand-alone object detectors. Therefore, ViDT's neck structure is computationally efficient.

The decoder receives two inputs from the Swin Transformer with RAM: (1) the [PATCH] token generated from each stage(2) the [DET] token generated from the final stage, as shown in The Neck in Figure 2 (c). In each deformable transformer layer, the [DET] × [DET] attention is performed first. For each [DET] token, apply multiscale deformable attention to generate a new [DET] token, aggregated from the multiscale feature map

A small set of key contents in the sampling:

Token matching knowledge distillation for object detection

Although large models have a high capacity to achieve high performance, it can be computationally expensive in real-world use. Therefore, the study also proposes a simple method of knowledge distillation that can transfer knowledge from large ViDT models through token matching.

Matching all tokens at each layer is very inefficient in training, so the study only matches the tokens that contribute the most to predictions. Two sets of tokens are directly related: (1) P: a collection of [PATCH] tokens used as a multiscale feature map, generated by each stage in the body, and (2) a collection of D:[DET] tokens, which are generated from each decoding layer of the neck. Therefore, the distillation loss formula based on token matching is:

assess

Table 2 compares ViDT with DETR (ViT) and YOLOS ap, FPS, etc., where DETR (ViT) has two variants: DETR and Deformable DETR.

The experimental results show that ViDT achieves the best trade-off between AP and FPS. With its high scalability, its performance is better than the 100 million-parameter Swin-base, and the FPS is 2 times faster than deformable DETR at similar APs. In addition, the ViDT parameter is 16M, resulting in 40.4AP, which is 6.3AP and 12.6AP higher than DETR (swin-nano) and DETR (swin-tiny), respectively.

Table 3 compares the different spatial position encodings with ViDT(w.o. Neck) results. The results show that pre-addition provides a higher performance boost than post-addition, that is, sinusoidal encoding is better than learnable encoding; therefore, the 2D inductive bias of sinusoidal encoding is more helpful in object detection. In particular, pre-addition using sinusoidal encoding increases the AP by 5.0 compared to not using any encoding.

Table 4 summarizes the AP and FPS for cross-attention using different selection strategies, where Swin Transformer consists of a total of four phases. Interestingly, as long as cross-attention is activated in the final stage, all strategies exhibit similar APs. Because features are extracted in a bottom-up manner in each stage, it is difficult to directly obtain useful information about the target object at the lower levels. Therefore, researchers who want to obtain higher AP and FPS, using only the final stage is the best design option, because the number of [PATCH] tokens is minimal.

To thoroughly verify the effectiveness of auxiliary decoding loss and iterative box refinement, the study even extended neck-free detectors such as YOLOS. Table 5 shows the performance of the two neck-free detectors, YOLOS and ViDT (w.o. Neck). Experimental results prove that the use of Neck decoder in ViDT to improve target detection performance is reasonable.

The following diagram shows that the larger the teacher model, the greater the benefit of the student model. From the coefficient point of view, the larger the coefficient value, the better the performance. Model distillation increases the AP by 1.0-1.7 without affecting the inference speed of the student model.

The researchers combine all the proposed components to achieve high accuracy and speed in object detection. As shown in Table 8, there are four components: (1) RAM extends the Swin Transformer into a standalone object detector, (2) the neck decoder utilizes multiscale features and two assistive technologies, (3) benefits from knowledge distillation from large models, and (4) decodes layer drop to further speed up inference. The results show that when Swin-nano is used as its backbone, it achieves 41.7AP and reasonable FPS with only 13M parameters. In addition, when using Swin-tiny, it only loses 2.7 FPS and exhibits 46.4AP.

Completely based on transformer object detector, ICLR anonymous papers achieve visual unification

ViDT: Vision and Inspection Transformer

RAM module

ENCODER-FREE neck structure

Token matching knowledge distillation for object detection

assess