laitimes

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

Selected from arXiv

Author: Chen Wei et al

Machine Heart Compilation

The mask-and-predict approach could become a new genre in the field of computer vision.

Self-supervised pre-training has been surprisingly successful in natural language processing, and its basic idea includes mask prediction tasks. Some time ago, He Kaiming's paper "Masked Autoencoders Are Scalable Vision Learners" proposed a simple and practical self-supervised learning scheme MAE, which uses the mask-and-predict method in the field of NLP to visual problems. Now the research team from Facebook AI Research Institute (FAIR) has proposed a new method of self-supervised visual pre-training MaskFeat.

Address of the paper: https://arxiv.org/pdf/2112.09133.pdf

MaskFeat first randomly masks a portion of the input sequence and then predicts the characteristics of the masked area. By studying 5 different types of features, the researchers found that the direction gradient histogram (HOG) is a good feature description method, performing well in both performance and efficiency. And the researchers also observed that local contrast normalization in HOGs is critical to getting good results, which is consistent with previous efforts to use HOGs for visual recognition.

This approach can learn a wealth of visual knowledge and drive large-scale transformer-based models. Without using additional model weights and supervision, MaskFeat pre-trained on unlabeled video, using MViT-L to achieve an unprecedented 86.7% top-1 accuracy on kinetics-400. In addition, MaskFeat can be further extended to image input and has achieved competitive results on ImageNet.

method

The masked visual prediction task is designed to repair the masked visual content. By modeling mask samples, the model achieves video understanding in the sense of recognizing the part and motion of the object. For example, to complete the image in the following image, the model must first identify the object based on the visible area, and also know how the object typically forms and moves to repair the missing area.

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

A key component of the mission is to forecast targets. In natural language processing tasks, mask language modeling uses the vocabulary tokenize corpus as the target. In the field of vision, the original visual signal is continuous, high-dimensional, and there is no natural "vocabulary" available.

Therefore, MaskFeat proposes to predict the characteristics of the masked area. Supervision with features extracted from the original complete sample. The choice of target features greatly influenced the properties of the pre-trained model, and the study provided a broad interpretation of the features and mainly considered 5 different types of target features.

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

First, the researchers divided the target features into two groups: 1) directly obtainable single-stage targets, including pixel colors and HOGs; and 2) two-stage targets extracted by a trained deep network. Since the prediction two-stage goal is efficiently learned with the help of a trained deep network (similar to model distillation), the additional computational costs of pre-training and inference of the teacher's model are inevitable. The 5 main types of features explored in the study are:

Pixel color;

Directional gradient histogram (HOG);

Discrete Variational Autoencoder (dVAE);

Depth characteristics;

Pseudo-labels.

The study explored the pros and cons of these five traits through a series of analyses. Although mask language modeling was originally used to predict categorical distributions on predefined thesaurus, discretization in BEiT does not require visual information. The analysis results show that continuous unsupervised features and image descriptors are better prediction targets, where the former requires model distillation and the latter does not require additional computational overhead.

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

In addition, the researchers also found that the target features of the supervised training produced poor results, which may be related to class-specific information present in the features, that is, this method is too global for local mask modeling. Overall, considering the trade-off between performance and computational cost, the study ultimately chose HOG as the default feature for MaskFeat.

Directional gradient histogram (HOG) features are a feature description method used for object detection in computer vision and image processing, first proposed in a 2005 CVPR paper, Histograms of Oriented Gradients for Human Detection.

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

The process of HOG feature extraction is as follows: first of all, the sample image is divided into several pixel units, the gradient direction is evenly divided into multiple intervals, and the gradient direction of all pixels in each unit is histogram statistics in each direction interval, a multidimensional feature vector is obtained, each adjacent element constitutes an interval, the feature vector in an interval is connected to obtain a multi-dimensional feature vector, and the sample image is scanned with an interval, and the scanning step is a unit. Finally, the features of all the blocks are concatenated together to get the complete features.

Experiments based on video recognition

The study compared MaskFeat with previous work on the K400 dataset and showed that the MViT-L using MaskFeat achieved a new SOTA on kinetics-400 with 86.7% top-1 accuracy.

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

Transfer learning

To evaluate the transfer learning performance of the method on downstream tasks, the study fine-tuned the MViT-L312,40×3 Kinetics model on AVA v2.2, and the experimental results were shown in Table 3 and Table 4 above, achieving 88.3% top-1 accuracy on K600 and 80.4% on K700, all of which achieved the new SOTA.

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

The study fine-tuned the MViT-L312,40×3 Kinetics model on AVA v2.2, and Table 5 below shows the mean accuracy (mAP) of the MaskFeat model compared to existing methods. MaskFeat achieved an unprecedented 38.8 mAP in full-resolution tests, far exceeding all previous methods.

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

Interested readers can read the original text of the paper to learn more about the research details.

Read on