At the end of June last year, we released the industry's first review in the field of multimodal large language models, "A Survey on Multimodal Large Language Models", on arXiv, which systematically sorted out the progress and development direction of multimodal large language models, with 120+ citations and 8.3K Stars for the open-source GitHub project. Since the publication of the paper, we have received a lot of valuable comments from readers, thank you for your support!

Since last year, we have witnessed the rapid development of multimodal large language models (MLLMs) represented by GPT-4V. To this end, we have significantly upgraded the review to provide a comprehensive understanding of the current state of development in the field and the potential development directions.

Context diagram of MLLM development

MLLM was born from the Large Language Model (LLM), which has attracted much attention in recent years, and further introduces multimodal information processing capabilities on the basis of its original powerful generalization and reasoning capabilities. Compared with previous multimodal methods, such as discriminative expressions represented by CLIP or generative expressions represented by OFA, the emerging MLLM exhibits some typical characteristics:

(1) The model is large. MLLM usually has billions of parameters, and more parameters bring more potential;(2) new training paradigms. In order to activate the potential of huge parameter quantities, MLLM adopts new training paradigms such as multimodal pre-training and multimodal instruction fine-tuning, which are matched with the corresponding dataset construction methods and evaluation methods.

With the support of these two qualities, MLLM has emerged some capabilities that were not available in previous multimodal models, such as OCRFree mathematical reasoning given images, story creation given images, and understanding the deep meaning of memes.

This review focuses on the basic form, extension and related research topics of MLLM, including:

The basic components of MLLM and related concepts, including architecture, training strategy, data, and measurement;
MLLM is extended to include support for input and output granularity, modality, language, and scenarios;
MLLM 的相关研究课题，包括多模态幻觉、多模态上下文学习(Multimodal In-Context Learning，M-ICL)、多模态思维链(Multimodal Chain of Thought，M-CoT)、LLM 辅助的视觉推理(LLM-Aided Visual Reasoning，LAVR)。

Architecture

For a typical MLLM with multimodal input-text output, the architecture typically includes encoders, connectors, and LLMs. If you want to support more modal outputs (such as images, audio, and video), you generally need to connect to the generator, as shown in the following figure:

MLLM architecture diagram

Among them, the modal encoder is responsible for encoding the original information (such as pictures) into features, and the connector further processes the features into a form that is easy to understand by LLMs, that is, visual tokens. LLMs act as "brains" that synthesize this information to understand and reason and generate responses. Taking Qwen-VL[1] as an example, the LLM as the "brain" parameter is 7.7B, accounting for about 80.2% of the total parameters, followed by the visual encoder (1.9B, about 19.7%), and the connector parameter is only 0.08B.

For visual encoders, increasing the resolution of the input image is an effective way to improve performance. One way is to directly increase the resolution, in which case the visual encoder needs to be trained to accommodate higher resolutions, such as Qwen-VL[1]. Another way is to slice a large-resolution image into multiple subgraphs, and feed each subimage into a visual encoder at a low resolution, which can indirectly increase the resolution of the input, such as Monkey[2], etc.

For pre-trained LLMs, the commonly used ones include the LLaMA [3] series, the Qwen [4] series, and the InternLM [5] series, the former mainly supports English, while the latter two are better supported in Chinese and English. In terms of performance impact, increasing the number of LLM parameters can bring significant performance gains, such as LLaVA-NeXT[6] and other experiments on 7B/13B/34B LLMs, and found that increasing the size of LLMs can bring significant improvements in each benchmark, and zero-shot Chinese capabilities have emerged on the 34B model. In addition to directly increasing the number of LLM parameters, the recent popularity of the MoE architecture provides the possibility of more efficient implementation, that is, through sparse calculation, without increasing the actual number of calculated parameters.

Connectors are slightly less important than the first two. For example, MM1[7] experimentally found that the type of connector is less important than the number of visual tokens (which determines the visual information available to the LLM later) and the resolution of the image (which determines the amount of input information to the visual encoder).

Data & Training

The training of MLLM can be roughly divided into a pre-training phase, an instruction fine-tuning phase, and an alignment fine-tuning phase. In the pre-training stage, the image information is aligned to the representation space of the LLM through a large amount of pairing data, that is, the LLM can read the visual token. In the instruction fine-tuning phase, the performance of the model on downstream tasks and the ability of the model to understand and obey instructions are improved through a variety of various types of task data. The alignment fine-tuning phase typically uses reinforcement learning techniques to align the model with human values or some specific need (e.g., less hallucination).

In the first phase of the early work, coarse-grained graphic pair data, such as LAION-5B, was mainly derived from images on the Internet and their accompanying captions, so it was large in scale (billions) but noisy and short in text, which easily affected the alignment effect. Later work explores alignment with cleaner, more text-rich data. For example, ShareGPT4V[8] uses the detailed description generated by GPT-4V to do fine-grained alignment, which alleviates the problem of insufficient alignment to a certain extent and obtains better performance. But since GPT-4V is charged, this type of data is usually small in size (in the millions). In addition, due to the limited scale of the data, it contains limited knowledge of the world, such as whether it can identify the building in the image as the Canton Tower. This kind of world knowledge is usually stored in large-scale, coarse-grained graphic-text pairs.

On the one hand, the fine-tuning data in the second stage can come from data from various tasks, such as VQA data, OCR data, etc., or from data generated by GPT-4V, such as Q&A pairs. While the latter is generally capable of generating more complex and diverse instruction data, this approach also significantly increases costs. It is worth mentioning that the second stage of training will generally mix some plain text conversation data, which can be regarded as a means of regularization, retaining the original ability and embedded knowledge of the LLM.

The data in the third stage are mainly for the preference data of the responses. This type of data is often collected by manual annotation, which is costly. Recent efforts have emerged that use automated methods to rank preferences for responses from different models, such as Silkie[9] calling GPT-4V to collect preference data.

Other technical directions

In addition to improving the basic capabilities of the model (e.g., supported input/output forms, performance metrics), there are some interesting questions and directions to explore. In this review, multimodal hallucination, Multimodal InContext Learning (M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR) were introduced.

The study of multimodal hallucinations focuses on questions generated by models that do not match the content of the images. Visual and textual information is inherently heterogeneous, and aligning the two exactly presents considerable challenges in itself. Increasing image resolution and improving the quality of training data are the two most intuitive ways to reduce multimodal hallucinations, and we still need to explore the causes and solutions of multimodal hallucinations in principle. For example, the current tokenization methods of visual information, the paradigm of multimodal alignment, and the conflict between multimodal data and LLM storage knowledge on multimodal illusion still need to be further studied.

The multimodal context learning technique is a few-shot learning method, which aims to improve the few-shot performance of the model by prompting the model with a small number of question and answer examples. The key to improving performance is to make the model effectively contextual and generalize intrinsic problem patterns to new problems. The work represented by Flamingo[10] improves the ability of the model to pay attention to the context by training on interleaved data. At present, the research on multimodal context learning is still relatively preliminary and needs to be further explored.

The basic idea of a multimodal chain of thought is to break down complex problems into simpler sub-problems, which are then solved separately and summarized. Compared with text-only reasoning, multimodal reasoning involves more sources of information and more complex logical relationships, so it is much more complex. At present, there is also relatively little work in this area.

The LLM-assisted visual reasoning approach explores how to leverage the powerful embedded knowledge and capabilities of LLMs and other tools to design various visual reasoning systems to solve real-world problems. Rather than obtaining a single model through end-to-end training, this type of approach generally focuses on how to extend and enhance the capabilities of the LLM in a training-free manner to build a comprehensive system.

Challenges and future directions

In view of the current state of MLLM research, we have conducted in-depth reflections and summarized the challenges and possible future development directions as follows:

The existing MLLM has limited ability to deal with multi-modal long contexts, which leads to great challenges for models in tasks such as long video comprehension and image-text interlacing content comprehension. MLLM, represented by Gemini 1.5 Pro, is setting off a wave of long video comprehension, while multimodal interleaved reading comprehension (i.e., long documents with both images and text) is relatively blank and is likely to become the next research hotspot.
MLLM lacks the ability to obey complex instructions. For example, GPT-4V can understand complex instructions to generate question-answer pairs and even contain inference information, but other models are obviously lacking in this area, and there is still a lot of room for improvement.
The research on context learning and chain of thought of MLLM is still in the preliminary stage, and the related capabilities are also weak, so it is urgent to explore the underlying mechanism and ability improvement.
The development of MLLM-based agents is a research hotspot. To implement this type of application, it is necessary to comprehensively improve the perception, reasoning, and planning capabilities of the model.
Security issues. MLLM is susceptible to malicious attacks by design, generating biased or bad answers. There is also a lack of research in this area.
At present, MLLM usually unfreezes LLMs during training, and although some unimodal text training data is added to the training process, there is still a lack of systematic and in-depth research on whether large-scale multimodal and unimodal data gain or harm each other when they are trained together.

8.3K Stars!

Architecture

Data & Training

Other technical directions

Challenges and future directions

Read on

Use LM Studio to deploy local AI large language models with one click

With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

Meta Researchers Crack the Curse of Large Model Reversal and Launch "Language Model Physics"

Decoding AI: Demystifying the "brain" of chatbots - large language models

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

Intel has made important progress in the field of artificial intelligence accelerators, and its subsidiary HabanaLabs is in

Researchers propose a new concept of artificial intelligence that allows large language models to interact with the real physical world

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

More than 20 former executives of major Internet companies are vying for the "first brother of the big model"|Titanium Media AGI

The big model "game within a game": the threat comes first

Only a small number of animals such as pandas remained, and a large number of AR animal models were removed from Google searches

As soon as it was released, it was benchmarked against Sora, and this domestic model is so big?

At the first AI theme meeting of the top journal "Nature", Yidu Technology shared the iterative development of smart scientific research empowered by large models

Coverage | Security Vertical Model × Innovative Practice in the Financial Industry - Sangfor Security GPT Won the "Top Ten Excellence Award" for Financial Large Model Application

No data, no model! Amazon Web Services continues to lead the construction of three core capabilities of the data base

Yang Ling, School of Surveying, Mapping and Geoinformatics, Tongji University: Refinement of Stochastic Model for Autonomous Integrity Monitoring of GNSS Advanced Receiver | Journal of Surveying and Mapping, Vol. 53, No. 2, 2024

Open up the two pulses of AI and supervision, and the strength of the domestic self-developed general model "Toushi" is moving towards AGI

Zhu Wei of Wondershare Technology: Sora has not been commercialized so far, and it takes a period for the video model to mature

Tesla Model 2 or next year / iPhone 16 series models revealed

The Apple iPhone 16 series model was revealed, and with the phone case, it became the final result

Xiaohongshu made the agents quarrel! Jointly with Fudan to launch an exclusive group chat tool for large models

The throne of the open source large model changed hands again, Tongyi Qianwen won the SOTA with 100 billion parameters, and 8 models have been launched in March

The mysterious large model swiped the screen overnight, and the ability was too strong to be suspected of GPT4.5, and Ultraman avoided answering the dumb riddle

Transwarp Technology (688031.SH): Revenue surged by 46.00%, embracing the wave of information innovation and large models