Scientists propose a novel tuning scheme to enhance the performance of multi-model in downstream multimodal tasks

The advent of large language models, represented by ChatGPT, marks a new milestone in the field of AI.

At the same time, the development of multimodal large models that can process text, images, audio, and video data has added "eyes" and "ears" to large language models, making them comprehensive agents with multiple perception capabilities and super knowledge understanding capabilities.

Due to its excellent generalization and mobility, it can improve the multimodal understanding and generation capabilities of large models, and multimodal large models have become a new track for AI development.

It is understood that the current multimodal large model paradigm, such as LLaVA, usually follows a two-stage training model.

The first stage is the alignment of vision and language. Visual features are matched to the language model's word embedding space through a static projector, allowing the large language model to understand the visual content.

The second stage is the fine-tuning of multimodal instructions. The large language model is fine-tuned through the constructed visual-linguistic instruction set to better respond to diverse user requests involving visual content.

Although these two phases are crucial, there is still relatively little research related to projector structure and large language model tuning strategies.

The existing methods still adopt the multi-modal large model architecture with static parameters, and this parameter sharing mode between different tasks has limitations when dealing with diverse multi-modal tasks.

To overcome this limitation, a team of researchers from Zhejiang University, ShanghaiTech University, Chongqing University, Alibaba Group, and Harbin Institute of Technology proposed HyperLLaVA.

They use the hyperparameter network HyperNetworks and the adapter to build a dynamic expert module, which adaptively generates dynamic parameters according to the sensory input, integrates the static multimodal model architecture with the dynamically adjusted expert module, and realizes the adaptive visual-text projection and dynamic adjustment of the parameters of the large language model in two stages, so as to effectively improve the generalization ability of the multimodal large model in different downstream multimodal tasks.

Specifically:

First, in the visual and language alignment phase, the projector is broken down into static and dynamic layers.

Among them, the parameters of the static layer remain unchanged, and the parameters of the dynamic layer are dynamically generated according to the visual features of the input, which will assist the static projector to complete the adaptive feature modeling based on input perception, and then flexibly convert the visual features into text tokens to achieve fine-grained visual-language semantic space alignment.

Secondly, in the multimodal instruction fine-tuning stage, the large language model is equipped with a language expert to model dynamic parameters for the large language model block.

That is, the intermediate output of a large language model is treated as implicit linguistic prior knowledge, instructing linguists to generate unique parameters for each input.

Linguists can take advantage of the similarity between samples across datasets to avoid potential interference between samples in the dataset, so as to improve the flexibility and generalization of multimodal large models to handle downstream multimodal tasks.

In addition, the linguistic expert can also be used as an efficient parameter fine-tuning method for multi-modal large models, and obtain performance similar to that of full fine-tuning.

"We hope that the proposed HyperLLaVA can provide a more stable and flexible framework for multi-modal large model architectures, and promote the new boundaries of multi-modal multitasking capabilities." Zhang Wenqiao, a researcher at Zhejiang University's "Hundred Talents Program" who participated in the study, said.

Scientists propose a novel tuning scheme to enhance the performance of multi-model in downstream multimodal tasks

Photo丨Zhang Wenqiao (Source: Zhang Wenqiao)

At present, the specific application of HyperLLaVA can be divided into the following two aspects.

First, in the general domain, HyperLLaVA can help large models adapt to the subtle differences between different multimodal inputs in detail through the collaboration of vision experts and language experts, and as a plug-and-play module, enhance the perception, cognition and reasoning capabilities of existing general multimodal large models.

Further improve the performance of multimodal large models on general tasks such as mathematical reasoning, copywriting, and natural language translation.

Second, in the vertical field, visual experts and language experts in HyperLLaVA can accept visual knowledge and text knowledge in additional professional fields, make up for the shortcomings of the "weak professionalism" of the general large model, realize the mutual guidance and promotion between data-driven and knowledge-driven, and then improve the professionalism and credibility of the multi-modal large model when fine-tuning instructions in the vertical field.

For example, in the financial sector, answering investors' questions and providing them with advice to help them make good investment decisions.

In the field of law, it helps users and lawyers to conduct legal consultation and legal affairs respectively; In the medical field, it assists doctors in diagnosis and treatment, reducing their work stress.

Figure丨Compared to LLaVA, HyperLLaVA achieves superior performance in different multimodal large model benchmarks (Source: arXiv)

近日，相关论文以《HyperLLaVA：多模态大型语言模型的动态视觉和语言专家调优》（HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models）为题在预印本平台 arXiv 上发表[1]。

Wenqiao Zhang from Zhejiang University, Tianwei Lin from ShanghaiTech University, and Jiang Liu from Chongqing University are the first authors, while Professors Yueting Zhuang and Juncheng Li from Zhejiang University, and Hao Jiang from Alibaba Group serve as corresponding authors.

Figure丨Related papers (source: arXiv)

According to Zhang, the research began with a comprehensive evaluation of the current multimodal large model.

"Although more and more research tends to adopt the Mixture of Experts (MoE) model, that is, to enhance the overall performance of the model by training specialized experts for different fields and drawing on ensemble learning strategies.

However, how to effectively match a specific corpus with the corresponding experts in the training process is still a tricky problem. He said.

In addition, with the advancement of large model technology, a single static model has certain limitations in dealing with multimodality and multi-tasking, and even the mixed expert model has the problem of knowledge conflict and forgetting between specific experts.

As a result, this fixed static parameter architecture may limit the performance of different downstream tasks.

It is precisely the limitations of the existing static multimodal large model architecture that stimulates the group's interest in exploring dynamic strategies, thus laying a foundation for further research.

Then, during the conceptualization phase, the team closely followed the latest developments in the field and possible solutions and conducted in-depth research on multitasking and multi-domain related work.

"Through extensive thinking and discussion based on the latest research results and literature, we came up with the initial concept of HyperLLaVA, a model that can use hyperparameter networks to dynamically generate visual and linguistic experts to adaptively adjust parameters." Zhang Wenqiao said.

After clarifying the research direction and methodology, the researchers began to work on the practical development and experimentation of HyperLLaVA.

They rigorously evaluated the initial prototype model, and then continuously optimized and iterated based on performance metrics and feedback.

It is understood that this iterative process is very important to promote the extreme development of model performance and verify the feasibility of its practical application.

Subsequently, the improved model was tested in multiple benchmarks and real-world scenarios for extensive experimental validation to evaluate its performance and compare it with existing models.

In addition, they carried out a series of ablation experiments and explored the working principle of the model in depth through comparative analysis, documenting in detail the research process, methodology, experimental results and their explanatory analysis.

Zhang Wenqiao said that in the process of research, when the research team decided to use hyperparameter networks to enhance the performance of vision and language experts, they first tried to adopt a large network structure, but found that this would lead to uncontrollable training of multimodal large models, so as to fail to achieve the expected results.

"According to our analysis, this is due to the scale of the generated network parameters that cannot be fitted with the training data." Zhang Wenqiao said.

Therefore, in many subsequent tests, they spent a lot of time and resources on debugging, but they could not achieve good results.

"We even abandoned the proposed plan for a while." Zhang Wenqiao said frankly.

However, in a fortuitous test, the team found that the model showed unexpected performance advantages and training stability in smaller dimensions.

This led them to decide to combine the up-and-down sampling network structure to further control the scale of the generated network parameters, and finally effectively improve the controllability and generalization of network training.

In addition, researchers have also observed that hyperparameter networks, as a dynamic adjustment mechanism, are similar to meta-learning to some extent.

This not only enhances the model's ability to apply across domains, but also allows the model to use this cross-domain potential to adjust itself immediately during training.

On the basis of this research, the research group will continue to pay attention to the latest technological advances in large models, explore how to further improve HyperLLaVA, and open up new powerful paradigms in the field of multimodal large models.

For example, at the model architecture level, the hybrid expert MoE technology is combined to train general vision/language experts and specific vision/language experts, and the generalization of multimodal large models in downstream tasks is further improved through the collaboration and fusion of the two.

At the model scale level, larger-scale multimodal training instructions are collected and model training is performed on larger pedestal language models (such as 34B and 130B) to build more powerful general-purpose multimodal large models.

In terms of application demonstration, it has achieved preliminary implementation in the medical field, and built multi-modal instruction data based on medical images, medical knowledge graphs, medical consultation databases, etc., to realize fine-grained medical image analysis, basic basic consultation, diagnostic report generation and other functions.

Resources:

1.W.,Zhang,T.,Lin,J.,Liu.et al. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models. arXiv:2403.13447.https://doi.org/10.48550/arXiv.2403.13447

Operation/Typesetting: He Chenlong

Scientists propose a novel tuning scheme to enhance the performance of multi-model in downstream multimodal tasks

Read on

Cultivating young scientists starts with this "A Brief History of the Human Body".

Will Mr. Cai Lei announce the details, vouchers and recipients of the donation of 100 million yuan On January 27, 2023, Mr. Cai Lei announced the Chunlei plan and said that he would donate 10 million yuan, see the attached picture.

Why do people have to work hard in their lives? There is a saying that is very moving: because the most painful thing is not that I could have failed. Whether it's playing games or moving bricks, it can be considered a career, if

Think night mode helps you sleep? Think again, scientists say

Scientists have discovered: time really doesn't exist? Just an illusion created by quantum entanglement?

Aliens are studied by scientists, and no one with your surname is distressed, only suffering. When I worked, I was recognized that my daughter-in-law was a family with my wife, and once there was a fight between my mother-in-law and daughter-in-law, and there was a conflict between husband and wife.

One of the top 10 models for data analysis: the funnel model

Who is the cockpit ceiling of new energy vehicles? The HarmonyOS cockpit is famous, but a new challenger has emerged! #智能座舱#6月12日, Great Wall Motor announced CoffeeO

Summary of today's bidding board (June 13) The structure of the 1-3rd daily line is under great pressure, and the bidding is flawed or suspected of being tempting, so they did not enter the market, but in the end they were all closed. The bid on the 4th is acceptable,

The "price war" of large models has begun, and the AI industry has ushered in great changes?

The Road to Large Model Applications: From Prompt Words to Artificial General Intelligence (AGI)

Tan Jing: Married to an 8-year-old scientist husband, created a high-end jewelry brand, and the rank of deputy division is worth hundreds of millions

The first batch passed! SenseTime Little Raccoon received the highest rating in the Code Model Evaluation of the Academy of Information and Communications Technology

Academician Ren Xiaoping has always had a bold idea: If the head is changed, will the person still be the original person? To confirm this idea, he decided to make an unprecedented attempt on monkeys—

iPhone 16会杀死大模型APP吗？

HUAWEI Developer Conference: HarmonyOS Next, Pangu 5.0 model and other technologies were unveiled

In 1969, Qian Xuesen's father died of illness, and before his death, he left a will: he handed over the 3,000 yuan inheritance to his daughter Qian Yuehua. But when her daughter-in-law Jiang Ying sent money to her sister-in-law, Qian Xuesen was big

The latest research progress and review of large models in the field of continuous learning

Why can the large model pull through the business system?

[Moon] Yang Zhenning's opposition is invalid! He once vigorously opposed spending 200 billion yuan to build a large particle collider, but Wang Yifang, an academician of the Chinese Academy of Sciences, said that China would be 30 years behind if it was not built.

Ali Tongyi Qwen2 won the world's first open source in the latest evaluation of large models Zhou Hongyi issued a message to congratulate: The future open source model will definitely surpass closed source

Hong Kong media: U.S. sanctions are the catalyst, and Chinese scientists are developing materials for memory chips with unlimited lifespan

Pit crying America! abandoned Chinese nationality to enter the United States, and tried to train 100 scientists for 30 years

Under the tuyere of AI mobile phones, Byte chose to be a large model supplier of mobile phone manufacturers

After the publication of the humble article "Discussing the Large Particle Collider Problem with Academician He Zuoxiu", Academician He Zuoxiu immediately left me a message to further emphasize "@刘长玉. The most important question is whether it is "

A model of a Wensheng diagram that produces animated effects: AnimateDiff

5th marriage! 92-year-old Murdoch married a 67-year-old Russian scientist, why does he love marriage so much?

NVIDIA's open-source, the strongest general-purpose model, Nemotron-4 340B: opening a new era of AI synthetic data!

92-year-old Murdoch got married for the fifth time, married a beautiful scientist, and his 23-year-old mixed-race daughter brought her boyfriend to attend

Europa has fish life, and scientists plan to launch a spacecraft to find out in 2030

China's AI Large Model Platform Ranking | May