Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

Edit | ZeR0

Rushing to general artificial intelligence, the large model has taken another big step.

Zhidong reported on April 25 that recently, led by Professor Yan Shuicheng, the Kunlun Wanwei 2050 Global Research Institute, the National University of Singapore, and the Nanyang Technological University team in Singapore jointly released and open-sourced the Vitron general pixel-level visual multimodal large language model.

Vitron solves the problem of image/video model fragmentation that has plagued the large language model industry for a long time, supports a series of visual tasks from visual understanding to visual generation, from low level to high level, including tasks such as comprehensive understanding, generation, segmentation and editing of static images and dynamic video content, can handle complex visual tasks, is good at visual understanding and task execution, and supports continuous operation with users, realizing flexible human-computer interaction.

Paper link: https://is.gd/aGu0VV

Open source code: https://github.com/SkyworkAI/Vitron

The functional support and key advantages of the model in four vision-related tasks are as follows:

This demonstrates the great potential of a more unified vision multimodal universal model, laying the foundation for the ultimate form of the next generation of universal vision models.

1. To solve the key challenges of visual tasks, a unified multimodal large language model is proposed

Building more general, powerful multimodal large language models (MLLMs) is seen as the path to artificial general intelligence (AGI).

In recent years, many research results have emerged that are good at image understanding, such as BLIP-2, LLaVA, MiniGPT-4, etc., by introducing modules that can perform visual perception and extending pure language-based large language models (LLMs) to MLLM. MLLM focusing on video understanding has also been launched, including VideoChat, Video-LLaMA, and Video-LLaVA, among others.

The researchers are trying to further expand the capabilities of MLLM from two main dimensions.

The first is to try to deepen the understanding of vision by MLLMs. Transition from coarse instance-level understanding to pixel-level fine-grained understanding of images, enabling visual region localization capabilities such as GLaMM, PixelLM, NExT-Chat, and MiniGPT-v2.

The second is to try to expand the vision functions that MLLM can support. Some research has already begun to enable MLLM to not only understand the input visual signal, but also support the generation of output visual content. For example, models such as GILL and Emu can flexibly generate image content, and GPT4Video and NExT-GPT enable video generation.

At present, the AI community believes that visual MLLMs will inevitably develop in the direction of high uniformity and stronger capabilities.

There are some key challenges along the way. For example, a large part of the work does not achieve visual understanding at the fine-grained pixel level, or lacks unified support for images and videos, or does not provide sufficient support for various visual tasks, which is far from a general large model.

The following table briefly summarizes the capabilities of some of the existing visual MLLM models:

The researchers believe that in the future, MLLM should be a general-purpose large language model that can cover a wider range of visual tasks and operations, achieve unified support for all vision-related tasks, and achieve the ability of "one for all". This is essential for practical applications, especially in visual creation, which often involves a series of iterations and interactions.

For example, users typically start with text, turn an idea into visual content through a textual diagram, then refine the initial idea with further fine-grained image editing to add more detail, then generate video from images to create dynamic content, and finally complete the creation with several rounds of iterative interactions, such as video editing.

In order to bridge the above gaps, the research team proposed a universal pixel-level vision MLLM, Vitron.

2. The Vitron system architecture consists of three modules, and the model training has three stages

Vitron uses a similar architecture to the existing MLLM and consists of three key components: 1) a front-end vision & language coding module, 2) a central LLM understanding and text generation module, and 3) a back-end user response and module invocation for visual manipulation module.

▲Vitron monolithic frame

Front-end module: Visual-language coding. In order to perceive image and video modal signals and support fine-grained user visual input, image encoders, video encoders, and area box/sketch encoders are integrated.
Central Module: Core LLM. Vicuna (7B, 1.5) was adopted for understanding, reasoning, decision-making, and multiple rounds of user interaction.
Backend modules: user responses and module calls. Employing a text-centric calling strategy, several off-the-shelf powerful advanced (SoTA) image and video processing modules are integrated for decoding and executing a range of visual termination tasks from the lower to the higher. By adopting a text-centric module integration call method, not only the system is unified, but also the alignment efficiency and system scalability are ensured.

Based on the above architecture, Vitron is then trained and fine-tuned to achieve stronger visual understanding and task execution capabilities. There are three main stages of model training.

Step 1: Visual-language alignment learning. The input visual language features are mapped into a unified feature space, so that it can effectively understand the input multimodal signals. This is a coarse-grained visual-language alignment learning that allows the system to efficiently process incoming visual signals as a whole. The researchers trained on existing datasets of image-caption pairs (CC3M), video-caption pairs (Webvid), and region-caption pairs (RefCOCO).

Step 2: Fine-grained spatiotemporal visual positioning instructions are fine-tuned. The system uses external modules to perform various pixel-level vision tasks, but the LLM itself has not undergone any fine-grained vision training, which will hinder the system from achieving true pixel-level visual understanding. To this end, the researchers proposed a fine-grained spatiotemporal visual localization instruction fine-tuning training, the core idea of which is to enable LLMs to locate the fine-grained spatiality of images and the specific temporal characteristics of videos.

Step 3: Fine-tune the command on the output side for command invocation. The second phase of training described above gives the LLM and front-end encoder the ability to understand vision at the pixel level. This final step, command call-oriented instruction fine-tuning, is designed to equip the system with the ability to execute commands precisely, allowing the LLM to generate appropriate and correct call text.

Since different terminal vision tasks may require different invocation commands, to unify this, the researchers propose to standardize the response output of the LLM into a structured text format, which includes:

The user responds to the output and replies directly to the user's input.
The name of the module, indicating the function or task that will be performed.
Invoke a command that triggers the meta command of the task module.
Area (optional output) that specifies the fine-grained visual characteristics required for certain tasks, such as in video tracking or visual editing, which are required by the back-end module. For regions, based on LLM's pixel-level understanding, a bounding box described by coordinates will be output.

3. Evaluate the performance of the four main visual tasks and demonstrate the ability of flexible human-computer interaction

Based on Vitron, the researchers conducted extensive experimental evaluations on 22 common benchmark datasets and 12 image/video vision tasks. Vitron has demonstrated outstanding performance in four major visual task clusters (segmentation, comprehension, content generation, and editing) with flexible human-machine interaction capabilities.

Some qualitative comparisons are illustrated below:

Visual Segmentation:

▲Image refers to the result of image segmentation

Fine-grained visual understanding:

▲ Image target refers to the result of understanding

▲Video QA results

Video Generation:

▲ Wensheng diagram

▲ Wensheng video

▲Tusheng video

Visual Editing:

▲Image editing results

For more detailed experimental content and details, please refer to the paper.

Conclusion: In the future, three major directions can be explored, system architecture, user interaction, and modal capabilities

Vitron has shown unique advantages and potential in terms of comprehensiveness, technological innovation, human-computer interaction and application potential, which helps to promote the development of multimodal large models and provides a new direction for future research on visual large models.

Kunlun Wanwei 2050 Global Research Institute has been committed to building an excellent scientific research institution for the future world, and working with the scientific community to cross the "singularity", explore the unknown world, and create a better future. The institute has previously released and open-sourced AgentStudio, a digital agent research and development toolkit, and will continue to promote AI technology breakthroughs in the future.

The Vitron system, which was jointly developed by the team, showed great versatility, but there were still some limitations ahead.

The researchers listed three directions for further exploration in the future:

1. System architecture

The Vitron system still uses a semi-federate, semi-proxy approach to invoking external tools. While this call-based approach facilitates the extension and replacement of latent modules, it also means that the back-end modules of this pipelined structure do not participate in the federated learning of the front-end and the LLM core modules. This limitation is not conducive to the overall learning of the system, which means that the upper limit of performance for different vision tasks will be limited by the back-end modules.

Future work should integrate the various visual task modules into a unified unit. Achieving a unified understanding and output of images and videos, while supporting generation and editing capabilities through a single generation paradigm, remains a challenge.

One promising approach is to combine modality-persistent tokenization to improve the unification of the system on different inputs and outputs, as well as various tasks.

2. User interactivity

Unlike previous models that focused on a single vision task (e.g., Stable Diffusion and SEEM), Vitron is designed to facilitate deep interaction between LLMs and users, similar to OpenAI's DALL-E series in the industry, Midjourney, etc. Achieving optimal user interaction is one of the core goals of this effort.

Vitron leverages existing language-based LLMs, combined with appropriate instruction tuning, to achieve a certain level of interaction. For example, the system can flexibly respond to any expected message entered by the user, resulting in the corresponding visual action result, without requiring the user to input a precise match to the back-end module conditions.

However, there is still a lot of room for improvement in terms of enhancing interactivity. For example, taking inspiration from the closed-source Midjourney system, the system should actively provide feedback to the user, regardless of the decision the LLM makes at every step of the way, to ensure that its actions and decisions are consistent with the user's intent.

3. Modal ability

Currently, Vitron integrates a 7B Vicuna model, which may have certain limitations on its ability to understand language, images, and video.

Future explorations could lead to the development of a comprehensive end-to-end system, such as scaling up the model to achieve a more thorough and comprehensive understanding of vision. In addition, efforts should be made to enable LLMs to fully unify the understanding of image and video modalities.

Domestic multi-modal large model soars! Yan Shuicheng is in charge of the open source Vitron

1. To solve the key challenges of visual tasks, a unified multimodal large language model is proposed

2. The Vitron system architecture consists of three modules, and the model training has three stages

3. Evaluate the performance of the four main visual tasks and demonstrate the ability of flexible human-computer interaction

Conclusion: In the future, three major directions can be explored, system architecture, user interaction, and modal capabilities

Read on

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

13 Models of Effective Communication Expression

Eat through an industrial chain in one day: NO.37 AI large model industrial chain

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

The most complete interpretation of the MoE hybrid expert model: revealing the key technologies and challenges

Baidu's strongest SOTA: 3DGS based on diffusion model!

Sprint 2024 "Half Year Red" | Sixty percent of AI companies have achieved profitable growth, and large model companies have made money?

Dialogue with UBTECH Jiao Jichao: Large model accelerates humanoid robots to "work in the factory"

iFLYTEK's profit puzzle: high investment and low return in the field of large models

Ali Lin Junyang: Large models are not enough for many people, and building multimodal agents is the key

Li Feifei, the godmother of AI, founded a spatial intelligence company that strives to overcome the existing limitations of large-scale AI technology

"Butterfly Model" classic example class notes

"Second Generation Star" Wang Xiao: Conceal his parents, prove himself with strength, and become a golden supporting role in domestic dramas

Li Feifei, the "godmother of AI", founded a spatial intelligence company in an effort to overcome the existing limitations of AI technologies such as large models

The question of whether President Chu will pay the balance has become the hottest topic at the moment, especially after the reply of the red-clothed boss Zhou Hongyi at three o'clock in the middle of the night, which added a touch of suspense. Rumor has it that President Chu is possible

European and American giants also have to look at China's face! Domestic industrial robots have been laid out for decades and are rapidly localized

Why doesn't China take apart the research on Nvidia graphics cards and create their own domestic graphics cards

Huawei's research and development of Kirin 9610A car machine chip will help domestic intelligent driving hard to carry Qualcomm 865 chip!

Review of domestic classic mobile phones: ten former kings, Huawei and Xiaomi dominate the list, in no particular order

Which electric vehicle brands are of good quality? The ranking list of domestic durable electric vehicles is here and gives the answer

The large model engages in "human flesh search", and the accuracy rate is as high as 95.8%!

"Faint Fire" Qingshui Town's "Xu Qin" became famous in the first battle, and the ranking of domestic dramas in love brain has been refreshed

Is it reimbursed for diabetes critical illness insurance?Author: Sugar Friend Sky Source: Zhihu went to the hospital for treatment when he was sick, and the first time he must have taken out his medical insurance card first, but some chronic diseases you

The first transverse V-cylinder + shaft drive nephew in China, the Yangtze River Guardian 750 is officially on sale!

The box office blowout, breaking 1 record, this is a rare domestic movie that can't be picked on May Day

Durian 100 yuan 5 pieces? The high durian "fell off the altar"? Is the domestic durian coming?

Product Life (4): From "User Story Mapping" to "WOOP Mindset"

Huawei will release Qingyun computer, equipped with 7nm Kirin 9000C to run the domestic system

That's right! 10,000 yuan to buy P70 is partly invested in Kirin research and development, driving the progress of domestic technology

Tesla's cheap models have been finalized, starting at 149,800, or named Model 2

The box office of the Kowloon Walled City has fallen for 3 consecutive days! The first domestic film on Douban on May Day, the simpler it is, the more clumsy it is

Surveying and Mapping Bulletin | Li Yayun: Research and Application of Multi-scale Population Spatial Big Data Aggregation Model in Map Visualization