Peking University Zou Yuexian: Evolution and Application of Vision-Language Pretraining Model

Author 丨Zou Yuexian

Organize | Victor

Edit | Twilight

Pre-trained models have attracted a lot of attention from academia and industry in the fields of natural language processing and computer vision. Pre-trained models trained on large-scale unsupervised data have very good generalizations, and can be improved on the corresponding task by only fine-tuning on small-scale labeled data. But how is the research progressing? What other questions need to be explored further?

On December 16, 2021, Zou Yuexian, Deputy Secretary of the Party Committee, Professor, Doctoral Supervisor and Director of the Modern Signal and Data Processing Laboratory of Peking University at the China Computer Conference (CNCC 2021) "Industry Talk: Business Applications and Technological Development Directions of Large-Scale Pre-training Models", made a report on "Evolution and Application of Visual-Language Pre-training Models" at the China Computer Conference (CNCC 2021) Forum, discussing the controversies, latest progress and research ideas around large-scale pre-training models, and giving future prospects.

For example, she mentions:

"The 'vision-language' subtasks are very numerous, with their own datasets, which makes the progress of solving NLP tasks very fast, but the pre-trained model approach has a very big problem in the field of vision: the cost of data labeling is very high. The MSCOCO dataset only labeled 120,000 images, each giving 5 markers, for a total of $10.8W. ”

"The technical paths of several of the current mainstream VL-PTMs are similar, using a single Transformer architecture to model visual and text input; visual input is Region-of-Interests (Rols) or Patches, missing global or other high-level visual semantic information..."

The latter shows that mainstream vision-language pre-training models have many limitations, resulting in migration to downstream tasks that only apply to classification tasks and not to generation tasks.

The following is the full text of the speech, AI technology reviews have been sorted out without changing the original meaning.

The title of today's speech is "Evolution and Application of Visual-Language Pre-training Model", which mainly combines team work and my own feelings to explore the current trend of artificial intelligence development. The presentation is divided into four parts: Background Introduction, Vision-Language Pre-Training Model, Vision-Language Pre-Training Model and Applied Research, and Future Prospects.

Artificial intelligence has been in the process of more than 60 years, since 2017, Transformer and BERT (2018) have successively proposed, opening a new chapter in big data, pre-training and transfer learning, and it is no exaggeration to define it as a new era. At present, unlike the work of previous decades, which has become a conclusion, this field needs to be further explored in depth.

Taking natural language processing (NLP) as an example, its evolution process is shown in the figure above, OpenAI released the first generation of GPT models in 2018, and the "big model" has begun to take shape in just a few years. "Big" here has two meanings: the amount of data used for model training is large, and the scale of the parameters contained in the model is large. China also has excellent work in this regard, and the 2021 Enlightenment 2.0 has reached the scale of trillions of parameters.

There is still some controversy about large-scale pre-training models, the main points of contention are:

1. What did the oversized model learn? How do I verify?

2. How to transfer "knowledge" from the supermodel to improve the performance of downstream tasks?

3. Better pre-training task design, model architecture design and training methods?

4. Choose a single-modal pre-trained model or a multimodal training model?

Although controversial, it has to be admitted that the "aesthetics of violence" does have unique features, such as Baidu ERNIE3.0 has refreshed more than 50 NLP task benchmarks. You know, in the industry, countless students and academics are racking their brains for a SOTA, but large-scale pre-training models can "produce" SOTA in batches. On the other hand, more than 50 SOTA also show that this is not accidental.

At present, the academic community has recognized that the development of AI is inseparable from the research results of the human brain, so the multimodal pre-training model, which integrates brain-like mechanisms and machine learning, has naturally become the focus of attention.

However, there are still many mechanisms of brain science discoveries that have not been clarified, such as multi-layer abstraction, attention mechanism, multimodal aggregation mechanism, multimodal compensation mechanism, multi-clue mechanism, synergy mechanism and so on.

About 70% of human information is obtained by vision, and the remaining about 20% to 30% of information depends on hearing and touch. With regard to human intelligence, language has a truly advanced semantics. For example, when the word "Apple" is mentioned, the brain "comes up" with a picture of an apple that can be eaten; when it comes to "Apple", the concept of An Apple brand mobile phone appears in the brain.

Therefore, the brain's mechanism of "visual participation in auditory perception" and the cognitive mechanism of "consistency of visual concepts and language concepts" is one of the reliable bases for our machine learning to adopt multimodal pre-training models.

Is vision-language model development feasible? A study by Chinese Min University showed that the Internet provides 90% of the big data in graphics and text, while text data accounts for only 10%. With the blessing of a large number of data sources, vision-language pre-training models have also become a research hotspot in 2021.

Visual- Language, the English name is "Vision and Language, VL". The VL pre-trained model is designed to allow the machine to handle tasks involving "understanding visual content and text content". VL tasks can be divided into VL generation tasks and VL classification tasks.

These two types of tasks solve different problems and are not the same difficulty. For the VL generation task, not only the visual information needs to be understood, but also the corresponding language description needs to be generated, involving both encoding and decoding; while the VL classification task only needs to understand the information. Obviously, generating tasks is more difficult.

The technical difficulty of the VL generation task is the need to understand the high-order semantics of vision and establish visual-text semantic associations. For example, the Video Captioning task needs to "summarize" the video content, and the Image Captioning task needs to generate a description of each frame of the image.

Currently, Visual Question Answering (VQA) is one of the popular VL classification tasks, which can be understood as: given an image, let the model answer any form of natural language-based question.

As shown in the image above (left), if you ask the machine "What is she eating?" The VL classifier will understand the picture information and then give the correct answer "hamburger".

Currently, there are many subtasks of "vision-language", each with a data set, such as VQA, VCR, NLVR2, and so on. We note that the NLP mission is developing rapidly due to its large data set support. However, for visual-language tasks, the performance of VL models has been slow due to the extremely high cost of labeling large-scale datasets.

Taking the image description task as an example, the MSCOCO dataset only labeled 120,000 images, giving 5 markers per image, costing a total of $10.8W. Therefore, different VL tasks rely on different model frameworks + different labeling datasets, which are expensive to label and the performance has not yet met the application requirements.

Therefore, exploring new VL pre-training agent tasks and reducing the dependence on logarithmic annotation is a very meaningful research direction. In 2019, the academic community began to work on VL-PTMs.

Evolution of visual-language pre-training models

Regarding the VL pre-training model, there has been a lot of great work since 2019, such as the "pioneering" ViLBERT, the 2020 UNIT and the 2021 CLIP. Over time, the amount of data contained in the model also increased, and the capabilities became more and more "superior". The overall technical route can be divided into two main categories: single-tower models and twin-tower models.

UNITER was proposed by Microsoft in 2020, it used 4 agent tasks to train the model, tested in 4 downstream tasks, and obtained performance improvements. The above studies are all research paradigms that use pre-trained models plus "fine-tuning".

In 2021, OpenAI developed CLIP with a dual-stream framework, and the emergence of CLIP was a shock. The principle is very simple, divided into two parts: encoding and decoding, and the encoder uses a typical Transformer. The amazing thing about the CLIP model is that the CLIP pre-trained model can directly have zero-shot learning capabilities, and OpenAI has tested in more than 20 classification tasks of different granularities and found that the CLIP pre-trained model has good zero-sample migration performance and can learn more general visual representations.

Visual-language pre-training models and applied research

We analyze the above mainstream VL pre-training model from six aspects: basic network structure, visual input, text input, mainstream dataset, training strategy and downstream task.

The analysis found that the technical route of mainstream VL-PTMs is similar:

1. Model visual and text input using a single Transformer architecture;

2. The visual input is Region-of-Interests (Rols) or Patches, missing global or other high-level visual semantic information;

3. Most of the proxy tasks used are BLM (bidirectional language model), S2SLM (one-way language model), ISPR (graphic matching), MOP (mask object prediction) and so on.

Therefore, the proposed vision-language pre-training model is more suitable for migration to downstream classification tasks, such as VQA. For downstream generation tasks, such as image description, vision-language pre-trained models are not suitable.

Our team also carried out exploratory research, the research idea is the stacked Transformer + self-attention mechanism, in which the innovative proposed self-attention model to deal with visual mode and text mode differently, that is, using different QKV transformation parameters to model visual and text modes respectively.

At the same time, visual concept information is introduced to alleviate the visual semantic gap. After verification, it was found that our proposed VL-PTM:DIMBERT (2020), based on attention mode decoupling, can be applied to both classification tasks and generation tasks.

Compared with the latest SOTA of that year (2020), the DIMBERT model is smaller (invisible twin towers), only pre-trained on the Conceptual Captions task, with the advantage of data volume demand, the downstream tasks tested have reached SOTA, and in the absence of decoder architecture, it can be migrated to downstream generation tasks.

This work also gives us two lessons:

1. From the perspective of information representation, visual information and text information require different expression methods, after all, text has relatively higher-order semantic information.

2. Try to introduce human high-order semantic information, human beings have a very clear definition of objects, apples are apples, pears are pears, so it is necessary to define object attributes and ease the semantic gap with linguistic information.

In October 2021, Facebook released Video CLIP related work, which belongs to the video VL pre-training model. As can be seen from this model, Video CLIP is quite ambitious, expecting that there is no need for task-related training datasets for downstream tasks, no need to fine-tune, and zero-sample migration based on Video CLIP.

Specifically, it is based on contrast learning combined with the Transformer framework, trying to build a visual-text joint pre-training model, hoping to focus on a more fine-grained structure.

Video CLIP's core work focuses on the construction of a contrast learning framework combined with training data samples, the construction of which is a video segment-matching text description pair. In addition, by performing a nearby search on the positive sample, a difficult negative sample is obtained, resulting in a video-non-matching text pair.

More specifically, the model uses contrast loss to learn fine-grained similarity between matching video-text pairs, and video-text representation with similar semantics is closer through comparative learning. This work is not outstanding in terms of the innovation of the research, but the performance of the model is surprising.

We believe that learning from Video CLIP's research ideas can be improved at a more fine-grained level, and we propose a frame-level text fine-grained matching method.

Experimental results show that fine-grained matching can obtain more accurate and complete spatial modeling capabilities. We conducted recall tests for video retrieval on the ActivityNet dataset and found that under all epochs, the pre-trained model based on the fine-grained matching strategy we proposed performed better than the pre-trained model based on the global matching strategy; in addition, we found that when the same performance was obtained, the model based on fine-grained matching we proposed was trained four times faster based on the global matching method.

In summary, the research on pre-trained models and cross-modal pre-trained models is very worth exploring, whether it is the model structure, training strategy or the design of pre-training tasks, there is still great potential.

In the future, the AI community may explore more modes, such as multilingual, motion, audio, and text; more downstream tasks, such as video descriptions, video summaries; and more transfer learning mechanisms, such as parameter transfer, prompt learning, knowledge transfer, and so on.

Leifeng NetworkLeifeng Network

Peking University Zou Yuexian: Evolution and Application of Vision-Language Pretraining Model

Read on