laitimes

Engaged in multimodality and don't know the latest progress? The first review of visual-language pretraining written by the Institute of Automation, Chinese Academy of Sciences

Selected from arXiv

Author: Feilong Chen et al

Machine Heart Compilation

Editor: Chen Ping

An article to learn about the latest advances and new areas of visual-language pre-training.

Getting machines to react similarly to humans has been the relentless goal of AI research. In order for machines to have the ability to perceive and think, researchers have conducted a series of related studies, such as face recognition, reading comprehension, and human-computer dialogue, through which to train and evaluate the machine's intelligence in specific aspects. In general, domain experts build standard data sets by hand and then train and evaluate relevant models on those data sets. However, due to the limitations of related technologies, training models often requires a large amount of labeled data to obtain better and more powerful models.

A pre-trained model based on the Transformer architecture alleviates this problem. They are first pre-trained through self-supervised learning, training models from large-scale unlabeled data to learn a universal representation. They can achieve surprising results with only a small amount of manually labeled data for fine-tuning on downstream tasks. Since BERT was applied to NLP missions, various pre-trained models have evolved rapidly in single-mode areas, such as Vision Transformer (ViT) and Wave2Vec. A lot of work shows that they favor downstream single-mode tasks and avoid training new models from scratch.

Similar to the single-modal domain, the multimodal domain also has the problem of low high-quality label data. We can't help but ask, can the above pre-training method be applied to multimodal tasks? Researchers have explored this problem and made significant progress.

In this paper, researchers from the Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academy of Sciences investigated the latest advances and new areas of vision-language pre-training (VLP), including image-text pre-training and video-text pre-training. VLP learns semantic correspondence between different modalities by pre-training large-scale data. For example, in image-text pretraining, the investigator expects the model to associate a dog in text with a dog appearance in the image. In video-text pretraining, the researcher expects the model to map objects/actions in text to objects/actions in videos.

Engaged in multimodality and don't know the latest progress? The first review of visual-language pretraining written by the Institute of Automation, Chinese Academy of Sciences

Address of the paper: https://arxiv.org/pdf/2202.09061.pdf

To achieve this, researchers need to cleverly design VLP object and model architectures to allow the model to mine associations between different modes.

In order to give readers a better and comprehensive grasp of VLP, the study first reviews its recent progress in five aspects: feature extraction, model architecture, pre-training targets, pre-training datasets, and downstream tasks. The article then summarizes the specific VLP model in detail. Finally, the article discusses the new frontiers of VLP. It is understood that this is the first survey of the VLP field. The researchers hope that this survey will shed light on future research in the field of VLP.

VLP Review

Review of the five aspects of VLP and its recent progress

In terms of feature processing: The paper focuses on how VLP models can be preprocessed and represent images, videos, and text to obtain corresponding features.

To take advantage of single-mode pre-trained models, VLP randomly initializes the standard transformer encoder to generate visual or textual representations. Visually, VLP encodes ViT-PF with pre-trained visual transformers such as ViT and DeiT. Text-wise, VLP encodes text features using pre-trained text transformers, such as BERT. For simplicity, the study named these transformers Xformer.

In terms of model architecture: The paper introduces the VLP model architecture from two different perspectives: (1) looking at single-stream and dual-stream architectures from the perspective of multimodal fusion, and (2) comparing encoders and encoder-decoders from the overall architecture design.

A single-stream architecture is when text and visual features are combined and fed into a single transformer block, as shown in Figure 1 (a) below. Single-stream architectures fuse multimodal inputs by combining attention. Single-stream architectures are more parameter-efficient because both modes use the same set of parameters.

A dual-stream architecture is when text and visual features are not combined, but are fed independently into two different transformer blocks, as shown in Figure 1 (b). The two transformer blocks do not share parameters. For higher performance, cross-attention, as shown by the dashed line in Figure 1 (b), is used to enable cross-modal interactions. For greater efficiency, cross-focusing can also be achieved without the use of visual transformer blocks and text transformer blocks.

Engaged in multimodality and don't know the latest progress? The first review of visual-language pretraining written by the Institute of Automation, Chinese Academy of Sciences

Many VLP models use only an encoder architecture, with different modal representations fed directly into the output layer. In contrast, other VLP models advocate the use of a transformer-decoder architecture, where different modal representations feed first into the decoder and then into the output layer.

In terms of pre-trained targets: The paper pre-trains the VLP model by using different pre-trained targets and summarizes the pre-trained targets into four categories: completion, matching, time, and specific types.

Completion refers to the use of unmasked parts to reconstruct mask elements. Take Mask Language Modeling (MLM), first proposed by taylor and widely known as a pre-training task by BERT. MLM in a VLP model is similar to MLM in a pre-trained language model (PLM) in that it can predict masked text tokens not only from the rest of the text tokens, but also from visual tokens. As a rule of thumb, the VLP model that follows the BERT randomly masks each text input token at a mask rate of 15%, using a special token [MASK] 80% of the time, a random text token 10% of the time, and the remaining 10% of the time using the original token to replace the masked text. However, in the paper "Should You Mask 15% in Masked Language Modeling?" by Chen Danqi et al., the authors found that under an effective pre-training scheme, they could mask 40-50% of the input text and obtain better downstream performance than the default 15%.

In masked visual modeling (MVM), like MLM, MVM samples visual (image or video) areas or patches and typically masks their visual features with a 15% probability. The VLP model needs to reconstruct the masked visual features given the remaining visual features and all the text features.

Visual - Language matching (VLM) is the most commonly used pre-trained target for aligning vision and language. In a single-stream VLP model, the researchers use a special token [CLS] representation as a fused representation of two modes. In the dual-stream VLP model, the researchers connect the special visual token [CLSV] visual representation and the special text token [CLST] text representation as a fusion representation of the two modes. The VLP model provides a fusion representation of the two modes to the FC layer and the sigmoid function to predict a score between 0 and 1, where 0 indicates a visual and linguistic mismatch and 1 represents a visual and linguistic match. During training, the VLP model samples positive or negative pairs from the dataset at each step.

In terms of pre-trained datasets: Most datasets for VLP are built by combining common datasets across multimodal tasks. Here, some of the mainstream corpora and their details are shown in Table 1 below.

Engaged in multimodality and don't know the latest progress? The first review of visual-language pretraining written by the Institute of Automation, Chinese Academy of Sciences

In terms of downstream tasks: a wide variety of tasks require a fusion of visual and linguistic knowledge. This section paper describes the basic details and goals of such tasks and divides them into five categories: classification, regression, retrieval, generation, and other tasks, where classification, regression, and retrieval tasks are also known as understanding tasks.

Among the classification tasks, they include Visual Question Answering (VQA), Visual Reasoning and Synthetic Question answering (GQA), Visual-Language Reasoning (VLI), Natural Language Visual Reasoning (NLVR), Visual Common Sense Reasoning (VCR), and more. In VQA, where image or video visual input is provided, which is often thought of as a classification task, the model predicts the most appropriate answer from a selection pool; in GQA, we can think of GQA as an upgraded version of VQA designed to advance the study of visual reasoning in natural scenes; in VLI, given a video clip with aligned subtitles as a premise and paired with natural language assumptions based on the video content, the model needs to infer whether that assumption contradicts a given video clip.

In regression tasks, multimodal sentiment analysis (MSA) is designed to use multimodal signals such as vision, language, etc. to detect sentiment in a video. It is used as a continuous intensity variable to predict the emotional direction of discourse.

In a search task, Visual-Language Retrieval (VLR) understands vision (image or video) and language through appropriate matching strategies, which include two subtasks, Visual to Text Retrieval and Text to Visual Retrieval, where Visual to Text Retrieval is to obtain the most relevant text descriptions from a larger pool of descriptions based on visuals, and vice versa.

In a build task, visual captions (VCs) are designed to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input. In addition, the paper describes other downstream tasks such as multimodal machine translation (MMT), visual language navigation (VLN), and optical character recognition (OCR).

SOTA VLP model

Image - Text VLP model. VisualBERT, known as the first image-text pre-trained model, uses Faster R-CNN to extract visual features, connect visual features to text embeddings, and then feed the connected features into a single BERT-initialized transformer. Many VLP models follow a feature extraction and schema similar to VisualBERT when tuning pre-trained targets and pre-trained datasets. Recently, VLMO leveraged image patch embedding and text word embedding to input combined embeddings into a single transformer along with modal experts, with impressive performance. METER explores how to use single-mode pre-trained models and proposes a dual-stream architecture model to handle multimodal fusion, enabling SOTA performance on many downstream tasks.

Video - Text VLP model. VideoBERT is known as the first video-text pre-trained model, which extends the BERT model to work with both video and text. VideoBERT uses pre-trained ConvNet and S3D to extract video features and concatenate them with text word embeddings and feed them to transformers initialized with BERT. When Training VideoBERT, ConvNet and S3D are frozen, indicating that the method is not end-to-end. More recently, inspired by ViT, Frozen and Region-Learner first processed video clips into frames and obtained patch embeddings based on how ViT processed each frame of image. Frozen and Region-Learner optimize themselves and implement SOTA performance in an end-to-end manner.

Table 2 below summarizes more of the existing mainstream VLP models:

Engaged in multimodality and don't know the latest progress? The first review of visual-language pretraining written by the Institute of Automation, Chinese Academy of Sciences

In the future, building on existing work, researchers hope that VLP can be further developed from the following aspects:

Combined with acoustic information, previous multimodal pre-training studies mostly emphasize the joint modeling of language and vision, while ignoring the information hidden in the audio;

Knowledge learning and cognition, although existing VLP models have achieved remarkable performance, they are essentially fitting large-scale multimodal datasets, making VLP models more knowledgeable is important for future VLP;

Hint optimization, by designing discrete or continuous hints and using MLM for specific downstream tasks, these models can reduce the computational cost of fine-tuning a large number of parameters, bridging the gap between pre-training and fine-tuning.

Read on