laitimes

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

author:Polar City Platform

Author丨Technology beast

Editor丨Jishi platform

This article is the original of the contracted author, the reprint must be authorized, and the whitelist account needs to add the business card of Jishi.

Table of contents for this article

1 BLIP-2: Saving Multimodal Training Costs: Freeze Pretrained Visual Language Model Parameters (ICML 2023)

(from Salesforce, ALBEF, BLIP author team)

1.1 Background and motivation

1.2 BLIP-2 architecture

1.3 Q-Former Training Step 1: Joint Vision Encoder Training

1.4 Q-Former Training Step 2: Federated Vision Encoder and Large Language Model Training

1.5 BLIP-2 pre-training method

1.6 Experimental results

1.7 Limitations of BLIP-2

Too long to look at the version

BLIP-2 is a multimodal Transformer model, which mainly addresses the problem that end-to-end training of previous Vision-Language Pre-training (VLP) models has led to high computational costs.

Therefore, if I can use the pre-trained visual model and language model, I freeze the parameters, which should save a lot of computational costs.

This is the case with BLIP-2, which proposes an efficient visual language pre-training method with the help of ready-made pre-trained visual models with frozen parameters and large language models.

However, simply freezing the parameters of a pre-trained visual model or a language model will bring a problem: the space of visual features and the space of text features are not easy to align. So to solve this problem, BLIP-2 proposes a lightweight Querying Transformer, which is pre-trained in two stages. The first stage guides multimodal learning from a frozen vision encoder, and the second stage guides multimodal learning from a frozen text encoder.

As a result of this process, BLIP-2 achieves state-of-the-art performance on a variety of visual language tasks while significantly reducing the parameters required for training.

1BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

论文名称:BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Paper Address:

https://arxiv.org/pdf/2301.12597.pdf

Code Address:

https://github.com/salesforce/LAVIS/tree/main/projects/blip2

Demo Address:

https://huggingface.co/spaces/Salesforce/BLIP2

1.1 Background and motivation

Research in Vision-Language Pre-training (VLP) has grown rapidly over the past few years, with researchers developing increasingly large-scale pre-training models that drive a variety of downstream tasks. However, because large models are used. With large datasets and end-to-end training, most state-of-the-art visual language models incur high computational and economic costs during pre-training.

The study of multimodality belongs to the intersection of vision and language research, so it is natural to expect that visual and language models can be obtained from ready-made pre-trained models of vision and language. In order to save the computational cost of visual language models in the pre-training process, the BLIP-2 proposed in this paper hopes to use ready-made pre-trained single-modal vision models and single-modal text models.

The benefit of this is that pre-trained vision models provide high-quality visual representations. Pre-trained language models, especially large language models (LLMs), provide powerful language generation and zero-sample transfer capabilities. To reduce computational costs and offset the problem of catastrophic forgetting, single-modal pre-trained models remain frozen during pre-training.

However, simply freezing the parameters of a pre-trained visual model or a language model will introduce a problem: the space of visual features and the space of text features are not easy to align. The reason for this problem is that the text model LLM did not see the corresponding image during single-modal pretraining, and the visual model did not see the corresponding text during single-modal pretraining, so this alignment is particularly challenging.

To solve this problem, BLIP-2 proposes a lightweight Querying Transformer, as shown in Figure 1 below. The Transformer is pre-trained in two phases. Q-Former is a lightweight transformer that uses a set of learnable Query vectors to extract visual features from a frozen vision encoder and acts as a bottleneck between the vision encoder and the text encoder. Q-Former passes key visual information to LLM, and the first pre-training phase forces Q-Former to learn the visual representations most relevant to the text. In the second pre-training phase, visual language generation learning is performed by connecting the output of Q-Former to the frozen LLM so that the visual representations of its output can be interpreted directly by the LLM. In this way, Q-Former can effectively utilize frozen pre-trained image models and language models.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 1: BLIP-2 model. Align visual modalities and text modalities by the proposed Q-Former

1.2 BLIP-2 architecture

BLIP-2 consists of a pre-trained, parametric visual model and a text model, plus the proposed trainable Q-Former. The fixed image encoder extracts visual features from the input image, and the Q-Former architecture is composed of 2 Transformer submodules, where Note Self-Attention is shared, which can be understood as 2 inputs to Self-Attention, namely: Queries and Text.

The first Transformer submodule: is the Image Transformer, which interacts with the image encoder for visual feature extraction. Its input is learnable Queries, which first model dependencies on each other through Self-Attention, and then model dependencies on Queries and image features through Cross-Attention. Because the two transformer submodules share parameters, Queries can also interact with text input.

The second Transformer submodule: is the Text Transformer, which acts as both a text encoder and a text decoder.

Q-Former contains a total of 188M parameters, the weights of which are initialized using BERT-Base and the parameters of Cross-Attention are randomly initialized. The author uses 32 Queries, each of which has a dimension of 768. Queries are trained along with pre-training targets, forcing them to extract the visual information that is most relevant to the text.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 2: In the first step of Q-Former training, the architecture is composed of 2 transformer submodules, where Note that Self-Attention is shared, which can be understood as 2 inputs to Self-Attention, namely: Queries and Text

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 3: Masks of Attention corresponding to the three objective functions

1.3 Q-Former Training Step 1: Joint Vision Encoder Training

In the first step of Q-Former training, the author connects Q-Former to an image encoder with frozen parameters and pre-trains with image-text pairs, then the goal of this step is to train Q-Former so that Queries can learn how to better combine text to extract image information.

The objective function of training, the author follows the practice of BLIP, jointly optimizes 3 pre-training targets with the same input format and model parameters, each objective function uses a different mask Attention to control the interaction of attention.

1 Image-Text Contrastive Learning (ITC)

This is a classical objective function in multimodal pretraining, designed to align the representations of images and text to maximize their mutual information.

ITC is implemented by making positive samples (paired image-to-text pairs) as similar as possible, and negative samples (unpaired picture-to-text pairs) as low as possible. Then for Q-Former authors, the way to implement ITC is to calculate the comparative learning loss of the output ZZ of Queries and the output tt of the [CLS] token of the Text Transformer. Because ZZ contains the output of multiple Queries, the author first calculates the pairwise similarity between the output of each Queries and tt, and then selects the highest one as the final graphic similarity. ITC's Attention Mask method, shown on the far right of Figure 3, belongs to the Uni-modal Self-Attention Mask and does not allow Queries and Text to see each other (each other's attention value is 0).

2 Image-Grounded Text Generation (ITG)

ITG is given an input image to train Q-Former to generate a corresponding text description.

To achieve this, there should be interaction between the vision encoder and the text decoder. Q-Former prevents this direct interaction, so Queries plays the role of extracting the information needed to generate text, and then passing it to the Text token through the Self-Attention layer. In short, Queries should have the ability to extract the visual features that capture all textual information. Therefore, ITG's Attention Mask method, shown in the middle of Figure 3, belongs to the Multi-modal Causal Self-Attention Mask, which allows Text to see Queries (Queries have visual information in them), while each Text token can only see the Text token before it (the basic practice of generative tasks). However, Queries are not allowed to see Text's information, only their own. In addition, the author replaces the [CLS] token with a new [DEC] token as the first Text token to indicate the decoding task.

3 Image-Text Matching (ITM)

This is a classical objective function in multimodal pretraining designed to align representations of images and text at a more fine-grained level to maximize their mutual information. ITM is a dichotomous task that requires the model to predict whether an image-text pair will have positive samples (match) or negative samples (mismatch). ITM's Attention Mask method, shown on the far left side of Figure 3, belongs to the Bi-directional Self-Attention Mask, allowing Queries and Text to see each other.

The output ZZ of Queries captures multimodal information, passes the Queries Embedding of each output into a class II linear classifier to obtain logit, and averages the output of logit as the final score.

The ITM objective function of BLIP-2 also uses the hard negative mining strategy in ALBEF.

1.4 Q-Former Training Step 2: Federated Vision Encoder and Large Language Model Training

In the generation pre-training stage, the authors connect Q-Former with LLM with frozen parameters to take advantage of LLM's text generation capabilities. First input the image or directly enter the image encoder of the freezing parameter to get the representation of the image. Then the image representation and Queries are fed into Q-Former to obtain the output ZZ of Queries, which is aligned with the dimensions of the Text token through a fully connected layer and then fed into LLM Decoder. The output of this Queries contains visual information and acts as a Soft Visual Prompt when input to LLM.

After training in Phase 1, Queries has learned how to better combine text to extract image information, so it can effectively provide the most useful image information to LLM while removing irrelevant visual information. This reduces the burden on LLM to learn visual language alignment.

The authors tried two large language models, one based on a pure Decoder architecture and one based on an encoder-Decoder architecture. For models based on pure Decoder architectures, train using the Language Modeling objective function. The task of the LLM for freezing parameters is to generate text based on the visual representation provided by Q-Former. For models based on the Encoder-Decoder architecture, divide the text into two segments, and the prefix is fed to LLM's Encoder with the output of Queries, and the Decoder output suffix is expected.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 4: Step 2 of Q-Former training

1.5 BLIP-2 pre-training method

Pre-training dataset

Using the following 6 datasets like BLIP, the number of images adds up to about 129M.

  • Conceptual Captions
  • SBU Captions
  • COCO
  • Visual Genome
  • Noisier Conceptual 12M dataset (some data failed)
  • Part of an additional web dataset, LAION400M, which contains 115M images with more noisy text

The CapFilt method proposed in BLIP is used to create composite titles for network images.

Pre-trained vision model and LLM

Visual models use:

  1. CLIP[1] trained ViT-L/14
  2. EVA-CLIP[2] trained ViT-g/14

The LLM model uses:

  1. OPT[3]
  2. FlanT5[4]

1.6 Experimental results

Let's take a look at the experimental results of several zero-shot command control graphic and text generation tasks, including visual knowledge reasoning, visual common sense reasoning, visual dialogue, personalized image to text generation, as shown in Figures 5 and 6 below. Simply append a text prompt after the visual cue as input to LLM.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 5: Experimental results of the Zero-shot command control image and text generation task

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 6: Experimental results of the Zero-shot command control image generation task

Zero-Shot visual language task

Figure 7 below provides an overview of the performance of BLIP-2 on various Zero-Shot visual language tasks. Compared to previous state-of-the-art models, BLIP-2 achieves better performance during visual language pre-training while requiring a smaller number of trainable parameters.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 7: BLIP-2 performance on a variety of Zero-Shot visual language tasks

Zero-shot VQA task

The authors performed a quantitative evaluation of the Zero-Shot VQA task, and for the language model of OPT, the Prompt was set to "Question: {} Answer:". For FlanT5 models, Prompt is set to "Question: {} Short answer:". BLIP-2 achieves state-of-the-art results on VQAv2 and GQA datasets. It outperformed Flamingo80B on VQAv2 with a 54x reduction in trainable parameters. On the OK-VQA dataset, BLIP-2 did not beat Flamingo80B. Figure 8 also illustrates how larger visual models or text models can contribute to the performance gains.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 8: Experimental results of the Zero-shot VQA task

The impact of visual language representation learning

The first stage of BLIP-2's representation learning pre-trains Q-Former to learn visual features related to text, which reduces the burden on LLM to learn visual language. Figure 9 below shows the impact of the first stage of representation learning on generative learning. Without representation learning, both types of LLM perform much less on Zero-Shot VQA. Moreover, OPT has the problem of catastrophic forgetting, where performance drops dramatically as training progresses.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 9: Impact of visual language representation learning

Image caption experiment results

The image caption task requires the model to generate a text description for the visual content of the image, and the experimental results are shown in Figure 10 below. The author uses the hint "a photo of" as the initial input to LLM and uses the language modeling loss function to train the model to generate captions. The author keeps LLM frozen during fine-tuning and updates the parameters of the Q-Former along with the image encoder. The fine-tuning dataset uses COCO, which is evaluated on both the COCO test set and the NoCaps validation set. BLIP-2 achieves significant improvements over existing methods on NoCaps, achieving state-of-the-art performance, showing strong generalization of out-of-domain images.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 10: Image caption experiment results

Visual question answering experiment results

Given annotated VQA data, authors fine-tune the parameters of the Q-Former and image encoder while keeping LLM frozen. The VQA-fine-tuned model architecture is shown in Figure 11, where the LLM receives the output of Q-Former and questions as input and wants to generate an answer. To extract image features that are more relevant to the question, the authors also give Q-Former additional input questions to interact with learnable queries with self-attention.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 11: VQA fine-tuned model architecture

The experimental results show in Figure 12 below, BLIP-2 belongs to the open generation model and achieves the best performance.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 12: Visual question-and-answer experiment results

Retrieve the results of the experiment with images and texts

Image text retrieval does not involve language generation, so the authors directly fine-tune the stage 1 pre-trained model. The authors use the same objective functions as pre-training, namely ITC, ITM, and ITG. The dataset uses COCO, fine-tuning Q-Former and image encoder. The evaluation data uses COCO and Flickr30K, and the tasks are image-to-text retrieval and text-to-image retrieval. The experimental results are shown in Figure 13, and BLIP-2 achieves state-of-the-art performance in image text retrieval with Zero-Shot, a significant improvement over existing methods.

BLIP-2: Save multimodal training costs: freeze the parameters of the pre-trained visual language model

Figure 13: Results of the image and text retrieval experiment

1.7 Limitations of BLIP-2

LLM generally has the ability to In-Contet Learning, but in the In-Context VQA scenario, BLIP-2 did not observe good results. For this lack of contextual learning ability, the authors attribute the reason to the fact that each data in the pre-training dataset contains only one image-text pair, from which LLM cannot learn the correlation between multiple image-text pairs in a single sequence.

The graphics and text generation ability of BLIP-2 is not satisfactory, which may be caused by the inaccuracy of LLM knowledge. At the same time, BLIP-2 inherits the risk of LLM freezing parameters, such as exporting offensive language and spreading social bias. The solution is to fine-tune the instructions, or filter out harmful data sets.

reference

  1. ^Learning transferable visual models from natural language supervision
  2. ^EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
  3. ^OPT: Open Pre-trained Transformer Language Models
  4. ^Scaling Instruction-Finetuned Language Models

Read on