laitimes

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

author:Smart stuff
Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

Edit | ZeR0

Rushing to general artificial intelligence, the large model has taken another big step.

Zhidong reported on April 25 that recently, led by Professor Yan Shuicheng, the Kunlun Wanwei 2050 Global Research Institute, the National University of Singapore, and the Nanyang Technological University team in Singapore jointly released and open-sourced the Vitron general pixel-level visual multimodal large language model.

Vitron solves the problem of image/video model fragmentation that has plagued the large language model industry for a long time, supports a series of visual tasks from visual understanding to visual generation, from low level to high level, including tasks such as comprehensive understanding, generation, segmentation and editing of static images and dynamic video content, can handle complex visual tasks, is good at visual understanding and task execution, and supports continuous operation with users, realizing flexible human-computer interaction.

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

Paper link: https://is.gd/aGu0VV

Open source code: https://github.com/SkyworkAI/Vitron

The functional support and key advantages of the model in four vision-related tasks are as follows:

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

This demonstrates the great potential of a more unified vision multimodal universal model, laying the foundation for the ultimate form of the next generation of universal vision models.

1. To solve the key challenges of visual tasks, a unified multimodal large language model is proposed

Building more general, powerful multimodal large language models (MLLMs) is seen as the path to artificial general intelligence (AGI).

In recent years, many research results have emerged that are good at image understanding, such as BLIP-2, LLaVA, MiniGPT-4, etc., by introducing modules that can perform visual perception and extending pure language-based large language models (LLMs) to MLLM. MLLM focusing on video understanding has also been launched, including VideoChat, Video-LLaMA, and Video-LLaVA, among others.

The researchers are trying to further expand the capabilities of MLLM from two main dimensions.

The first is to try to deepen the understanding of vision by MLLMs. Transition from coarse instance-level understanding to pixel-level fine-grained understanding of images, enabling visual region localization capabilities such as GLaMM, PixelLM, NExT-Chat, and MiniGPT-v2.

The second is to try to expand the vision functions that MLLM can support. Some research has already begun to enable MLLM to not only understand the input visual signal, but also support the generation of output visual content. For example, models such as GILL and Emu can flexibly generate image content, and GPT4Video and NExT-GPT enable video generation.

At present, the AI community believes that visual MLLMs will inevitably develop in the direction of high uniformity and stronger capabilities.

There are some key challenges along the way. For example, a large part of the work does not achieve visual understanding at the fine-grained pixel level, or lacks unified support for images and videos, or does not provide sufficient support for various visual tasks, which is far from a general large model.

The following table briefly summarizes the capabilities of some of the existing visual MLLM models:

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

The researchers believe that in the future, MLLM should be a general-purpose large language model that can cover a wider range of visual tasks and operations, achieve unified support for all vision-related tasks, and achieve the ability of "one for all". This is essential for practical applications, especially in visual creation, which often involves a series of iterations and interactions.

For example, users typically start with text, turn an idea into visual content through a textual diagram, then refine the initial idea with further fine-grained image editing to add more detail, then generate video from images to create dynamic content, and finally complete the creation with several rounds of iterative interactions, such as video editing.

In order to bridge the above gaps, the research team proposed a universal pixel-level vision MLLM, Vitron.

2. The Vitron system architecture consists of three modules, and the model training has three stages

Vitron uses a similar architecture to the existing MLLM and consists of three key components: 1) a front-end vision & language coding module, 2) a central LLM understanding and text generation module, and 3) a back-end user response and module invocation for visual manipulation module.

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲Vitron monolithic frame

  • Front-end module: Visual-language coding. In order to perceive image and video modal signals and support fine-grained user visual input, image encoders, video encoders, and area box/sketch encoders are integrated.
  • Central Module: Core LLM. Vicuna (7B, 1.5) was adopted for understanding, reasoning, decision-making, and multiple rounds of user interaction.
  • Backend modules: user responses and module calls. Employing a text-centric calling strategy, several off-the-shelf powerful advanced (SoTA) image and video processing modules are integrated for decoding and executing a range of visual termination tasks from the lower to the higher. By adopting a text-centric module integration call method, not only the system is unified, but also the alignment efficiency and system scalability are ensured.
Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

Based on the above architecture, Vitron is then trained and fine-tuned to achieve stronger visual understanding and task execution capabilities. There are three main stages of model training.

Step 1: Visual-language alignment learning. The input visual language features are mapped into a unified feature space, so that it can effectively understand the input multimodal signals. This is a coarse-grained visual-language alignment learning that allows the system to efficiently process incoming visual signals as a whole. The researchers trained on existing datasets of image-caption pairs (CC3M), video-caption pairs (Webvid), and region-caption pairs (RefCOCO).

Step 2: Fine-grained spatiotemporal visual positioning instructions are fine-tuned. The system uses external modules to perform various pixel-level vision tasks, but the LLM itself has not undergone any fine-grained vision training, which will hinder the system from achieving true pixel-level visual understanding. To this end, the researchers proposed a fine-grained spatiotemporal visual localization instruction fine-tuning training, the core idea of which is to enable LLMs to locate the fine-grained spatiality of images and the specific temporal characteristics of videos.

Step 3: Fine-tune the command on the output side for command invocation. The second phase of training described above gives the LLM and front-end encoder the ability to understand vision at the pixel level. This final step, command call-oriented instruction fine-tuning, is designed to equip the system with the ability to execute commands precisely, allowing the LLM to generate appropriate and correct call text.

Since different terminal vision tasks may require different invocation commands, to unify this, the researchers propose to standardize the response output of the LLM into a structured text format, which includes:

  1. The user responds to the output and replies directly to the user's input.
  2. The name of the module, indicating the function or task that will be performed.
  3. Invoke a command that triggers the meta command of the task module.
  4. Area (optional output) that specifies the fine-grained visual characteristics required for certain tasks, such as in video tracking or visual editing, which are required by the back-end module. For regions, based on LLM's pixel-level understanding, a bounding box described by coordinates will be output.
Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

3. Evaluate the performance of the four main visual tasks and demonstrate the ability of flexible human-computer interaction

Based on Vitron, the researchers conducted extensive experimental evaluations on 22 common benchmark datasets and 12 image/video vision tasks. Vitron has demonstrated outstanding performance in four major visual task clusters (segmentation, comprehension, content generation, and editing) with flexible human-machine interaction capabilities.

Some qualitative comparisons are illustrated below:

Visual Segmentation:

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲Image refers to the result of image segmentation

Fine-grained visual understanding:

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲ Image target refers to the result of understanding

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲Video QA results

Video Generation:

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲ Wensheng diagram

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲ Wensheng video

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲Tusheng video

Visual Editing:

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

▲Image editing results

For more detailed experimental content and details, please refer to the paper.

Conclusion: In the future, three major directions can be explored, system architecture, user interaction, and modal capabilities

Vitron has shown unique advantages and potential in terms of comprehensiveness, technological innovation, human-computer interaction and application potential, which helps to promote the development of multimodal large models and provides a new direction for future research on visual large models.

Kunlun Wanwei 2050 Global Research Institute has been committed to building an excellent scientific research institution for the future world, and working with the scientific community to cross the "singularity", explore the unknown world, and create a better future. The institute has previously released and open-sourced AgentStudio, a digital agent research and development toolkit, and will continue to promote AI technology breakthroughs in the future.

The Vitron system, which was jointly developed by the team, showed great versatility, but there were still some limitations ahead.

The researchers listed three directions for further exploration in the future:

1. System architecture

The Vitron system still uses a semi-federate, semi-proxy approach to invoking external tools. While this call-based approach facilitates the extension and replacement of latent modules, it also means that the back-end modules of this pipelined structure do not participate in the federated learning of the front-end and the LLM core modules. This limitation is not conducive to the overall learning of the system, which means that the upper limit of performance for different vision tasks will be limited by the back-end modules.

Future work should integrate the various visual task modules into a unified unit. Achieving a unified understanding and output of images and videos, while supporting generation and editing capabilities through a single generation paradigm, remains a challenge.

One promising approach is to combine modality-persistent tokenization to improve the unification of the system on different inputs and outputs, as well as various tasks.

2. User interactivity

Unlike previous models that focused on a single vision task (e.g., Stable Diffusion and SEEM), Vitron is designed to facilitate deep interaction between LLMs and users, similar to OpenAI's DALL-E series in the industry, Midjourney, etc. Achieving optimal user interaction is one of the core goals of this effort.

Vitron leverages existing language-based LLMs, combined with appropriate instruction tuning, to achieve a certain level of interaction. For example, the system can flexibly respond to any expected message entered by the user, resulting in the corresponding visual action result, without requiring the user to input a precise match to the back-end module conditions.

However, there is still a lot of room for improvement in terms of enhancing interactivity. For example, taking inspiration from the closed-source Midjourney system, the system should actively provide feedback to the user, regardless of the decision the LLM makes at every step of the way, to ensure that its actions and decisions are consistent with the user's intent.

3. Modal ability

Currently, Vitron integrates a 7B Vicuna model, which may have certain limitations on its ability to understand language, images, and video.

Future explorations could lead to the development of a comprehensive end-to-end system, such as scaling up the model to achieve a more thorough and comprehensive understanding of vision. In addition, efforts should be made to enable LLMs to fully unify the understanding of image and video modalities.

Read on