CAE, a new paradigm for self-supervised learning: Why is MIM better suited to downstream tasks than contrast learning?

Heart of the Machine column

Author: Chen Xiaokang

Researchers from Peking University, the University of Hong Kong and Baidu have recently proposed a new MIM method called CAE.

Mask modeling methods are widely used in NLP areas such as BERT. With the introduction and development of ViT, people have also tried to apply mask image modeling (MIM) to the field of vision and have made some progress. Prior to this, visual self-supervised algorithms were mainly designed along the line of contrastive learning, and MIM undoubtedly opened new doors.

Researchers from Peking University, the University of Hong Kong and Baidu have recently proposed a new MIM method called CAE. This method completely separates the functions of "representation learning" and "pretext task", so that the encoder learns better representation, thereby achieving better generalization performance on downstream tasks.

Address of the paper: https://arxiv.org/abs/2202.03026

The study answered several questions:

1.MIM Method, which part of the network structure is learning representation, and which part is to solve the pre-task?

2. Why did the previous typical comparative learning method only achieve similar performance to the supervised pre-training method on downstream tasks (e.g. detection, segmentation)?

3.MIM Why is the method superior to the current comparative learning method?

1. Background

MIM is a self-supervised characterization learning algorithm. Its main idea is to block and randomly mask the input image, and then make some predictions about the mask area. The target of the prediction can be a Token ID (BEiT) or an RGB value (MAE). The encoder can learn a good characterization through MIM, resulting in good generalization performance on downstream tasks.

Recent MIM has two representative jobs: BEiT and MAE.

BEiT does two things using an encoder: (1) learn a good image representation, and (2) solve the pre-task: predict the Token ID of the mask patch. The potential of the encoder is not fully tapped, only part of it is used to learn representations.

The MAE uses an encoder-decoder architecture, where the encoder is responsible for characterizing the visible patch, and the decoder takes the representation of the visible and masked patches (using a learnable vector) as inputs to predict the RGB value of the mask patch. However, the MAE also updates the representation of the visible patch in the decoder, and in fact the decoder is also responsible for some of the functions that learn the representation.

Neither of these methods fully exploits the potential of the encoder, limiting the quality of the representations learned by pre-training.

2. Context Autoencoder (CAE)

The core idea of CAE design is to separate the two functions of "characterization learning" and "solving pre-task". The researchers hope that during pre-training, the encoder is only responsible for representation learning, and the decoder is only responsible for solving the pre-task, so that the potential of the encoder can be tapped as much as possible. CAE consists of 4 parts: (1) Encoder; (2) Latent contextual regressor; (3) Decoder; (4) Alignment module.

CAE, a new paradigm for self-supervised learning: Why is MIM better suited to downstream tasks than contrast learning?

The input image is divided into two parts, the visible patch and the mask patch, by random masking. Specifically:

The Encoder is a ViT model that learns the characterization of visible patches.

Latent contextual regressor is characterized by a predictive mask patch. The Latent contextual regressor consists of a series of cross-attention modules, with query being the representation of the mask patch and key and value representing all patches. When calculating query-key similarity, the method introduces position encoding for each patch. At this stage, it is constantly updated and becomes more accurate, without updating, and the task of extracting the image features is completely left to the encoder.

The Decoder takes only the corresponding position encoding as input, and predicts certain properties of the mask patch, such as Token ID, or the value of RGB. The study's experiment is similar to BEiT, using a DALL-E tokenizer to tokenize the input image to get the decoder's target.

Latent representation alignment by adding constraints, expect the output of the Latin contextual regressor to be in the same encoding space as the output of the encoder. This method also inputs the mask patch of the image into the encoder to obtain a representation of this part. will be the goal of learning. The process of calculation does not calculate gradients.

Loss function. The loss function consists of two parts: (1) supervision of the decoder prediction, using cross-entropy loss; (2) Supervision of alignment of the sum, using MSE loss.

3. Analysis

3.1 CAE is concerned with the characterization of each patch

CAE makes some predictions from randomly sampled mask patches based on the representation of visible patches, which requires CAE to pay attention to the semantics of each patch. This differs from typical contrast learning methods (e.g. MoCo v3, SimCLR), which focuses only on the global semantics of the image and ignores details and non-subject areas of the image (such as background).

3.2 The output of the Latent contextual regressor and the output of the encoder are in the same encoding space

The study constrained the output of the Latent contextual regressor in the hope that it would be as close as possible to the output of the encoder. In this way, the decoder will make predictions based on the encoding space learned by the encoder, and the responsibility of feature extraction of the image will be completely entrusted to the encoder, driving the encoder to learn good representations.

To verify this, the study trained CAEs with RGB values as decoder targets (RGB here given that Token IDs are difficult to visualize). At the time of testing, the study fed all the patches into the encoder, then skipped the Latent contextual regressor and fed the encoder's output directly into the decoder, predicting the value of the RGB for all patches. The following figure shows the prediction results, the first line is the original graph, the second row is the prediction, the researchers found that only the encoder and decoder can be reconstructed from the picture, indicating that the output of the encoder and the output of the Latent contextual regressor belong to the same encoding space.

If you do not do alignment constraints during training, then you cannot rebuild, as shown in the following figure, the output is garbled, indicating that the encoder output and the output of latent contextual regressor are not in the same encoding space. This results in a lack of characterization quality learned by the encoder, which is also verified in the ablation experiment section.

3.3 CaE learns representations that distinguish between different categories of objects/stuff

CAE makes predictions based on the representation of visible patches in the mask patch area, which requires CAE to have a good understanding of the contents of the visible patch. For example, when people see a dog's head, they can predict the part of its body; when they see a small piece of sky, they can also predict that the sky around it is also a large probability. Therefore, the researchers believe that the representations learned by CAE can distinguish between different categories of objects/stuff. To verify this, the researchers randomly sampled some image input from the ADE20K dataset into the encoder. Because the ADE20K provides a class label (150 classes) for each pixel, the study could use t-SNE to visualize the characterization of the encoder output. As shown in the following figure, each color represents a category, the left figure is CAE, and the right figure is a randomly initialized encoder. The researchers found that CAE can effectively distinguish between different categories of objects/stuff (because it is pre-trained in ImageNet-1K, the distinction is not perfect), while randomly initialized encoders cannot do this.

3.4 Why does typical contrast learning only achieve similar results to supervised pre-training in downstream tasks?

In contrast learning, random crop is a very important data augmentation strategy. Typical contrast learning, such as MoCo v3, wants to maximize global semantic similarity between 2 different cuts from the same image and minimize the similarity between cuts from different images.

Why does this work? The researchers first analyzed the nature of random tailoring. In the SimCLR paper, random tailoring is mentioned as a very important data enhancement strategy in comparative learning methods. In the ImageNet-1K dataset, the subject object of the image is mostly in the central area of the image, and the image is randomly cropped, and the central area has a high probability of being included, such as the example shown in the following figure, several clippings basically include the subject object of the image.

Extracting global semantics for different cuts of the same image actually learns the characteristics of the subject object in the original image, which is why different cuts of the same image can be similar. In supervised pre-training, constrained by image classification labels, the network learns the characteristics of the image subject area, which has great similarities with the knowledge learned by comparative learning, so it behaves similarly in downstream tasks.

3.5 The difference between MIM and contrast learning

MIM methods, such as CAE, make predictions about mask patch areas based on the representation of visible patches. When doing a random mask, it is possible that each patch of the image (such as an object/stuff in the background area) will be taken into account, not just the subject area of the image. In order to make predictions for masked patches, CAE learns to characterize each patch.

The study visualized the attention map of CAE and MoCo v3. As shown in the following illustration, the first line is the original image, the second line is MoCo v3, and the third line is CAE. Red indicates higher attention values and blue indicates low attention values. The area inside the blue boundary is filtered by the principle of sorting the attention values from largest to smallest, retaining the portion that accumulates and reaches 50% of the sum of attention values at all locations. As you can see, MoCo v3's attention map is primarily responsive in the subject area of the image, while CAE takes into account almost all patches.

4. Experiment

The study experimented on ImageNet-1K using ViT-small and ViT-base. The input image has a resolution of 224 X 224, a patch size of 16 X 16, and a graph is divided into 14 X 14 patches. Each time 75 patches are randomly masked.

4.1 Pre-training evaluation

Self-supervised learning widely uses linear probing to measure the quality of pre-training representations: fixing the encoder's parameters and then adding a linear classifier to classify images. Researchers believe that linear detection is not suitable for MIM methods, because MIM methods usually learn the representation of each patch, not only contains information about the subject object, but also learns a number of knowledge such as background, which is multi-faceted and heterogeneous, and is not suitable for direct linear classification. Therefore, the researchers have proposed a new test indicator: attentive probing. The study adds a simple cross-attention module (without FFN) and a linear classifier to a fixed-parameter encoder to dynamically select information suitable for image classification through an attention mechanism.

The study visualized the attention map of the cross-attention module used in the attention detection phase and found that the subject object could be focused.

The results of fine tuning, linear detection, and attention detection are shown in the following table.

The researchers found some interesting phenomena. (1) The linear and attention detection results of the contrast learning method (MoCo v3, DINO) are similar. This shows that this type of method has already focused on the subject object of the image during pre-training, and can do a good job of image classification without further dynamic screening, which is also consistent with the previous researchers' analysis of comparative learning. (2) Attention detection of MIM methods (e.g., CAE) is greatly improved compared to linear detection. This shows that the MIM method learns the characteristics of each patch, not just the image subject object, so some screening is required to facilitate image classification.

4.2 Ablation experiments

The study performed ablation experiments on decoders and alignment modules, as shown in the table below. Adding a single decoder improves the results of attention detection, but the improvement on downstream tasks (segmentation, detection) is not noticeable. The use of alignment modules can significantly improve the performance of downstream tasks, indicating that it is important that the output of the constraint encoder and the output of the Latent contextual regressor are in the same encoding space, and that the quality of the encoder's learning can be improved.

4.3 Semantic Segmentation

The study conducted an experiment on semantic segmentation on the ADE20K. The network uses UperNet with 160K iterations and an input image resolution of 512 X 512 using a single-scale test. Comparing learning methods and supervised pre-training methods (DeiT) results are similar, CAE can achieve significantly better results. The results of CAE are also better than other MIM methods, indicating that the encoder is fully utilized during the pre-training phase and the characterization learned is better.

4.4 Object detection, instance segmentation

The study used mask-RCNN and Cascade-RCNN network structures for object detection and instance segmentation. Where 12 epochs are trained on a multiscale, only single-scale tests are used during the testing phase. Experimental results and semantic segmentation are similar: similar and poorer results compared to learning methods and supervised pre-training methods, CAE results are better.

5 Summary

The study proposes CAE, and the core of the design has two points: (1) complete separation of the two functions of "characterization learning" and "solving the pre-task"; (2) Make predictions about mask patches in the representation space learned by the visible patch. Both of the above points are aimed at driving the encoder to learn better representations and thus achieve good generalization capabilities in downstream tasks.

In addition, the study analyzed supervised pretraining methods, contrast learning, and MIM methods, and concluded that comparative learning and supervised pretraining focused primarily on the subject area of the image (such as objects in the ImageNet-1K tag set), while MIM focused on all patches of the image, which was more conducive to downstream tasks.

CAE, a new paradigm for self-supervised learning: Why is MIM better suited to downstream tasks than contrast learning?

Read on