laitimes

The time has come for many universities to build an open-source community LAMM and join the family of multimodal language models

author:Heart of the Machine Pro

Heart of the Machine column

Heart of the Machine Editorial Department

LAMM (Language-Assisted Multi-Modal) aims to build a multimodal instruction fine-tuning and evaluation framework for the open source academic community, which includes a highly optimized training framework, a comprehensive evaluation system, and supports a variety of visual modalities.

Since the advent of ChatGPT, large language models (LLMs) have developed by leaps and bounds, and the AI paradigm of human-computer interaction based on natural language has been widely used. However, it is not only text in human interaction with the world, but other modalities such as pictures and depth are equally important. However, most of the current research on multimodal large language models (MLLMs) is closed-source, which is not friendly to students in universities and most research institutions. Moreover, large language models are limited by training knowledge, and often lack current affairs cognition and complex reasoning ability, which is like only being able to answer questions quickly without the ability to "think deeply". AI Agent is the key to solving this problem, which gives LLMs the ability to think deeply and make complex decisions, so that LLMs can develop into intelligent entities with autonomous, reactive, positive and social skills. We believe that the field of AI agent will give birth to more achievements that change the way of life and work, which is an important evolutionary direction for large language models and multimodal large models.

Scholars from Beihang University, Fudan University, University of Sydney, Chinese University of Hong Kong (Shenzhen) and Shanghai Artificial Intelligence Laboratory jointly launched one of the earliest open source communities for multimodal language models - LAMM (Language-Assisted Multi-modal Model). We aim to build LAMM into a growing community ecosystem that supports research in MLLM training and evaluation, MLLM-driven agents, and more. As one of the earliest open source projects in the field of multimodal large language models, LAMM's goal is to build an open research community ecosystem, so that every researcher and developer can conduct research based on it and jointly build an open source community.

The time has come for many universities to build an open-source community LAMM and join the family of multimodal language models
  • Project homepage: https://openlamm.github.io
  • Code address: https://www.github.com/OpenGVLab/LAMM

Here you can:

  • Train and evaluate MLLM with minimal compute resource cost, with as little as 3090 or V100, making it easy to start training and evaluating MLLM.
  • Build an MLLM-based embodied intelligence agent that is able to define tasks and generate data using bots or game simulators.
  • Extend MLLM applications in virtually any area of expertise.

Open-source framework

The LAMM code base implements a unified dataset format, component-based model design, and one-click distributed training, which is convenient for users to start and implement their own exclusive multimodal language models.

The time has come for many universities to build an open-source community LAMM and join the family of multimodal language models
  • Fine-tune datasets using standard dataset formats that are compatible with different instructions. LAMM defines a standardized multi-modal instruction fine-tuning data format, which can be directly and seamlessly adapted to commonly used datasets such as LLaVA, LAMM, and ShareGPT4V for multi-modal instruction fine-tuning, and can be started with one click.
  • The component-based model building process makes it easy to update and modify the model architecture. The model in LAMM consists of a Vision Encoder, a Feature Projector, and a Language Model (LLM). At present, LAMM already supports modal encoders such as Image and Point Cloud, as well as pre-trained language models such as LLaMA/LLaMA2.
  • Train and evaluate MLLM with minimal compute resources. LAMM Repo integrates acceleration frameworks such as Deepspeed, LightLLM, and flash attention to greatly optimize training costs. Fine-tuning of 7B's language model on 4 RTX3090 or newer devices is already supported. At the same time, LAMM is constantly following up with new large language models and optimization frameworks to promote the development of the multimodal field.
  • Build an embodied intelligent AI agent based on MLLM. Once the target task is defined using a bot or simulator and the corresponding instruction data is generated, the MLLM powered by LAMM can act as a powerful AI agent for decision-making and analysis.

For more details, please refer to the project homepage.

Multimodal large language model training and evaluation

A large number of recent works have demonstrated the capabilities of multimodal large models (MLLMs) for visual content understanding and interaction, as well as the ability to solve more complex downstream task applications. In addition to the common image input, LAMM currently supports visual modal input such as point clouds, and users can also add new encoders according to their needs. At the same time, LAMM supports PEFT packages for efficient fine-tuning, and also introduces tools such as flash attention and xformer to further optimize the model calculation cost, so that users can train MLLM at the lowest possible cost. In the face of complex multi-task learning, LAMM also supports MoE and other strategies to unify multiple sets of fine-tuning parameters, further improving the multi-task capability of the model and achieving more versatile MLLM.

However, due to the lack of a standardized, comprehensive evaluation framework, the capabilities and limitations of these models have not been fully explored, and we still cannot confirm exactly what these models are capable of and what exactly they can do. The existing benchmarking work mainly focuses on building multimodal evaluation datasets for multimodal large models, or only evaluating a part of the visual ability dimensions, or trying to establish an evaluation framework but lacking scalability and comprehensiveness, and it is still challenging to comprehensively evaluate each model and make a fair and reliable comparison between different models. LAMM implements a highly scalable and flexible evaluation framework designed to provide a reliable and comprehensive evaluation of large multimodal models.

For details, please refer to https://openlamm.github.io/paper_list/ChEF

The time has come for many universities to build an open-source community LAMM and join the family of multimodal language models

A one-click combined multimodal language model evaluation framework

The capabilities of multimodal models based on the LAMM framework are partially demonstrated as follows:

Q&A based on 2D image content:

The time has come for many universities to build an open-source community LAMM and join the family of multimodal language models
The time has come for many universities to build an open-source community LAMM and join the family of multimodal language models

Visual Q&A based on 3D point clouds:

An embodied agent driven by a multimodal large language model

A lot of recent work has been done to build agents with the powerful inference planning capabilities of large language models (LLMs), such as Voyager and GITM in Minecraft and textual memory to plan the agent's actions, but these works assume that the agent can obtain all the correct environmental perception information when planning and making decisions, directly skipping the perception stage, ignoring the influence of real-time first-person view pictures on the embodied agent's planning of its own actions, which is also impossible in real life.

In order to enable the embodied agent to better perceive the environment in the complex environment of the open world, we propose an MLLM-driven embodied agent MP5, which is characterized by visual perception and active perception capabilities. The visual perception module (the main architecture of the model is LAMM) allows MP5 to solve tasks that have never been seen before, and active perception can actively obtain environmental information to perform appropriate actions. Finally, MP5 has open sensing capabilities and can provide tailored perception results for different purposes, which can complete long-time series and complex environmental information tasks.

summary

Based on the powerful capabilities and broad application prospects of MLLM, multimodal learning has reached a new stage. LAMM aims to build an open-source community to facilitate the research of multimodal large models, and open-source all relevant data including data preparation, model training, and performance evaluation to the community.

As one of the first teams to invest in multimodal language model research, we hope to continue to develop the LAMM toolbox, provide a lightweight and easy-to-use multimodal research framework for the LAMM open source ecosystem, and cooperate with open source forces to facilitate more meaningful research.

The above content will continue to be open source on the LAMM homepage, please follow our homepage and projects, and welcome to submit feedback and PRs for the LAMM codebase.

Read on