laitimes

8.3K Stars!

author:Quantum Position

At the end of June last year, we released the industry's first review in the field of multimodal large language models, "A Survey on Multimodal Large Language Models", on arXiv, which systematically sorted out the progress and development direction of multimodal large language models, with 120+ citations and 8.3K Stars for the open-source GitHub project. Since the publication of the paper, we have received a lot of valuable comments from readers, thank you for your support!

Since last year, we have witnessed the rapid development of multimodal large language models (MLLMs) represented by GPT-4V. To this end, we have significantly upgraded the review to provide a comprehensive understanding of the current state of development in the field and the potential development directions.

8.3K Stars!

Context diagram of MLLM development

MLLM was born from the Large Language Model (LLM), which has attracted much attention in recent years, and further introduces multimodal information processing capabilities on the basis of its original powerful generalization and reasoning capabilities. Compared with previous multimodal methods, such as discriminative expressions represented by CLIP or generative expressions represented by OFA, the emerging MLLM exhibits some typical characteristics:

(1) The model is large. MLLM usually has billions of parameters, and more parameters bring more potential;(2) new training paradigms. In order to activate the potential of huge parameter quantities, MLLM adopts new training paradigms such as multimodal pre-training and multimodal instruction fine-tuning, which are matched with the corresponding dataset construction methods and evaluation methods.

With the support of these two qualities, MLLM has emerged some capabilities that were not available in previous multimodal models, such as OCRFree mathematical reasoning given images, story creation given images, and understanding the deep meaning of memes.

8.3K Stars!

This review focuses on the basic form, extension and related research topics of MLLM, including:

  • The basic components of MLLM and related concepts, including architecture, training strategy, data, and measurement;
  • MLLM is extended to include support for input and output granularity, modality, language, and scenarios;
  • MLLM 的相关研究课题,包括多模态幻觉、多模态上下文学习(Multimodal In-Context Learning,M-ICL)、多模态思维链(Multimodal Chain of Thought,M-CoT)、LLM 辅助的视觉推理(LLM-Aided Visual Reasoning,LAVR)。

Architecture

For a typical MLLM with multimodal input-text output, the architecture typically includes encoders, connectors, and LLMs. If you want to support more modal outputs (such as images, audio, and video), you generally need to connect to the generator, as shown in the following figure:

8.3K Stars!

MLLM architecture diagram

Among them, the modal encoder is responsible for encoding the original information (such as pictures) into features, and the connector further processes the features into a form that is easy to understand by LLMs, that is, visual tokens. LLMs act as "brains" that synthesize this information to understand and reason and generate responses. Taking Qwen-VL[1] as an example, the LLM as the "brain" parameter is 7.7B, accounting for about 80.2% of the total parameters, followed by the visual encoder (1.9B, about 19.7%), and the connector parameter is only 0.08B.

For visual encoders, increasing the resolution of the input image is an effective way to improve performance. One way is to directly increase the resolution, in which case the visual encoder needs to be trained to accommodate higher resolutions, such as Qwen-VL[1]. Another way is to slice a large-resolution image into multiple subgraphs, and feed each subimage into a visual encoder at a low resolution, which can indirectly increase the resolution of the input, such as Monkey[2], etc.

For pre-trained LLMs, the commonly used ones include the LLaMA [3] series, the Qwen [4] series, and the InternLM [5] series, the former mainly supports English, while the latter two are better supported in Chinese and English. In terms of performance impact, increasing the number of LLM parameters can bring significant performance gains, such as LLaVA-NeXT[6] and other experiments on 7B/13B/34B LLMs, and found that increasing the size of LLMs can bring significant improvements in each benchmark, and zero-shot Chinese capabilities have emerged on the 34B model. In addition to directly increasing the number of LLM parameters, the recent popularity of the MoE architecture provides the possibility of more efficient implementation, that is, through sparse calculation, without increasing the actual number of calculated parameters.

Connectors are slightly less important than the first two. For example, MM1[7] experimentally found that the type of connector is less important than the number of visual tokens (which determines the visual information available to the LLM later) and the resolution of the image (which determines the amount of input information to the visual encoder).

Data & Training

The training of MLLM can be roughly divided into a pre-training phase, an instruction fine-tuning phase, and an alignment fine-tuning phase. In the pre-training stage, the image information is aligned to the representation space of the LLM through a large amount of pairing data, that is, the LLM can read the visual token. In the instruction fine-tuning phase, the performance of the model on downstream tasks and the ability of the model to understand and obey instructions are improved through a variety of various types of task data. The alignment fine-tuning phase typically uses reinforcement learning techniques to align the model with human values or some specific need (e.g., less hallucination).

In the first phase of the early work, coarse-grained graphic pair data, such as LAION-5B, was mainly derived from images on the Internet and their accompanying captions, so it was large in scale (billions) but noisy and short in text, which easily affected the alignment effect. Later work explores alignment with cleaner, more text-rich data. For example, ShareGPT4V[8] uses the detailed description generated by GPT-4V to do fine-grained alignment, which alleviates the problem of insufficient alignment to a certain extent and obtains better performance. But since GPT-4V is charged, this type of data is usually small in size (in the millions). In addition, due to the limited scale of the data, it contains limited knowledge of the world, such as whether it can identify the building in the image as the Canton Tower. This kind of world knowledge is usually stored in large-scale, coarse-grained graphic-text pairs.

On the one hand, the fine-tuning data in the second stage can come from data from various tasks, such as VQA data, OCR data, etc., or from data generated by GPT-4V, such as Q&A pairs. While the latter is generally capable of generating more complex and diverse instruction data, this approach also significantly increases costs. It is worth mentioning that the second stage of training will generally mix some plain text conversation data, which can be regarded as a means of regularization, retaining the original ability and embedded knowledge of the LLM.

The data in the third stage are mainly for the preference data of the responses. This type of data is often collected by manual annotation, which is costly. Recent efforts have emerged that use automated methods to rank preferences for responses from different models, such as Silkie[9] calling GPT-4V to collect preference data.

Other technical directions

In addition to improving the basic capabilities of the model (e.g., supported input/output forms, performance metrics), there are some interesting questions and directions to explore. In this review, multimodal hallucination, Multimodal InContext Learning (M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR) were introduced.

The study of multimodal hallucinations focuses on questions generated by models that do not match the content of the images. Visual and textual information is inherently heterogeneous, and aligning the two exactly presents considerable challenges in itself. Increasing image resolution and improving the quality of training data are the two most intuitive ways to reduce multimodal hallucinations, and we still need to explore the causes and solutions of multimodal hallucinations in principle. For example, the current tokenization methods of visual information, the paradigm of multimodal alignment, and the conflict between multimodal data and LLM storage knowledge on multimodal illusion still need to be further studied.

The multimodal context learning technique is a few-shot learning method, which aims to improve the few-shot performance of the model by prompting the model with a small number of question and answer examples. The key to improving performance is to make the model effectively contextual and generalize intrinsic problem patterns to new problems. The work represented by Flamingo[10] improves the ability of the model to pay attention to the context by training on interleaved data. At present, the research on multimodal context learning is still relatively preliminary and needs to be further explored.

The basic idea of a multimodal chain of thought is to break down complex problems into simpler sub-problems, which are then solved separately and summarized. Compared with text-only reasoning, multimodal reasoning involves more sources of information and more complex logical relationships, so it is much more complex. At present, there is also relatively little work in this area.

The LLM-assisted visual reasoning approach explores how to leverage the powerful embedded knowledge and capabilities of LLMs and other tools to design various visual reasoning systems to solve real-world problems. Rather than obtaining a single model through end-to-end training, this type of approach generally focuses on how to extend and enhance the capabilities of the LLM in a training-free manner to build a comprehensive system.

Challenges and future directions

In view of the current state of MLLM research, we have conducted in-depth reflections and summarized the challenges and possible future development directions as follows:

  • The existing MLLM has limited ability to deal with multi-modal long contexts, which leads to great challenges for models in tasks such as long video comprehension and image-text interlacing content comprehension. MLLM, represented by Gemini 1.5 Pro, is setting off a wave of long video comprehension, while multimodal interleaved reading comprehension (i.e., long documents with both images and text) is relatively blank and is likely to become the next research hotspot.
  • MLLM lacks the ability to obey complex instructions. For example, GPT-4V can understand complex instructions to generate question-answer pairs and even contain inference information, but other models are obviously lacking in this area, and there is still a lot of room for improvement.
  • The research on context learning and chain of thought of MLLM is still in the preliminary stage, and the related capabilities are also weak, so it is urgent to explore the underlying mechanism and ability improvement.
  • The development of MLLM-based agents is a research hotspot. To implement this type of application, it is necessary to comprehensively improve the perception, reasoning, and planning capabilities of the model.
  • Security issues. MLLM is susceptible to malicious attacks by design, generating biased or bad answers. There is also a lack of research in this area.
  • At present, MLLM usually unfreezes LLMs during training, and although some unimodal text training data is added to the training process, there is still a lack of systematic and in-depth research on whether large-scale multimodal and unimodal data gain or harm each other when they are trained together.

Read more about it

Link to paper: https://arxiv.org/pdf/2306.13549.pdf

Project Link: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on