laitimes

Scientists propose a novel tuning scheme to enhance the performance of multi-model in downstream multimodal tasks

author:DeepTech

The advent of large language models, represented by ChatGPT, marks a new milestone in the field of AI.

At the same time, the development of multimodal large models that can process text, images, audio, and video data has added "eyes" and "ears" to large language models, making them comprehensive agents with multiple perception capabilities and super knowledge understanding capabilities.

Due to its excellent generalization and mobility, it can improve the multimodal understanding and generation capabilities of large models, and multimodal large models have become a new track for AI development.

It is understood that the current multimodal large model paradigm, such as LLaVA, usually follows a two-stage training model.

The first stage is the alignment of vision and language. Visual features are matched to the language model's word embedding space through a static projector, allowing the large language model to understand the visual content.

The second stage is the fine-tuning of multimodal instructions. The large language model is fine-tuned through the constructed visual-linguistic instruction set to better respond to diverse user requests involving visual content.

Although these two phases are crucial, there is still relatively little research related to projector structure and large language model tuning strategies.

The existing methods still adopt the multi-modal large model architecture with static parameters, and this parameter sharing mode between different tasks has limitations when dealing with diverse multi-modal tasks.

To overcome this limitation, a team of researchers from Zhejiang University, ShanghaiTech University, Chongqing University, Alibaba Group, and Harbin Institute of Technology proposed HyperLLaVA.

They use the hyperparameter network HyperNetworks and the adapter to build a dynamic expert module, which adaptively generates dynamic parameters according to the sensory input, integrates the static multimodal model architecture with the dynamically adjusted expert module, and realizes the adaptive visual-text projection and dynamic adjustment of the parameters of the large language model in two stages, so as to effectively improve the generalization ability of the multimodal large model in different downstream multimodal tasks.

Specifically:

First, in the visual and language alignment phase, the projector is broken down into static and dynamic layers.

Among them, the parameters of the static layer remain unchanged, and the parameters of the dynamic layer are dynamically generated according to the visual features of the input, which will assist the static projector to complete the adaptive feature modeling based on input perception, and then flexibly convert the visual features into text tokens to achieve fine-grained visual-language semantic space alignment.

Secondly, in the multimodal instruction fine-tuning stage, the large language model is equipped with a language expert to model dynamic parameters for the large language model block.

That is, the intermediate output of a large language model is treated as implicit linguistic prior knowledge, instructing linguists to generate unique parameters for each input.

Linguists can take advantage of the similarity between samples across datasets to avoid potential interference between samples in the dataset, so as to improve the flexibility and generalization of multimodal large models to handle downstream multimodal tasks.

In addition, the linguistic expert can also be used as an efficient parameter fine-tuning method for multi-modal large models, and obtain performance similar to that of full fine-tuning.

"We hope that the proposed HyperLLaVA can provide a more stable and flexible framework for multi-modal large model architectures, and promote the new boundaries of multi-modal multitasking capabilities." Zhang Wenqiao, a researcher at Zhejiang University's "Hundred Talents Program" who participated in the study, said.

Scientists propose a novel tuning scheme to enhance the performance of multi-model in downstream multimodal tasks

Photo丨Zhang Wenqiao (Source: Zhang Wenqiao)

At present, the specific application of HyperLLaVA can be divided into the following two aspects.

First, in the general domain, HyperLLaVA can help large models adapt to the subtle differences between different multimodal inputs in detail through the collaboration of vision experts and language experts, and as a plug-and-play module, enhance the perception, cognition and reasoning capabilities of existing general multimodal large models.

Further improve the performance of multimodal large models on general tasks such as mathematical reasoning, copywriting, and natural language translation.

Second, in the vertical field, visual experts and language experts in HyperLLaVA can accept visual knowledge and text knowledge in additional professional fields, make up for the shortcomings of the "weak professionalism" of the general large model, realize the mutual guidance and promotion between data-driven and knowledge-driven, and then improve the professionalism and credibility of the multi-modal large model when fine-tuning instructions in the vertical field.

For example, in the financial sector, answering investors' questions and providing them with advice to help them make good investment decisions.

In the field of law, it helps users and lawyers to conduct legal consultation and legal affairs respectively; In the medical field, it assists doctors in diagnosis and treatment, reducing their work stress.

Scientists propose a novel tuning scheme to enhance the performance of multi-model in downstream multimodal tasks

Figure丨Compared to LLaVA, HyperLLaVA achieves superior performance in different multimodal large model benchmarks (Source: arXiv)

近日,相关论文以《HyperLLaVA:多模态大型语言模型的动态视觉和语言专家调优》(HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models)为题在预印本平台 arXiv 上发表[1]。

Wenqiao Zhang from Zhejiang University, Tianwei Lin from ShanghaiTech University, and Jiang Liu from Chongqing University are the first authors, while Professors Yueting Zhuang and Juncheng Li from Zhejiang University, and Hao Jiang from Alibaba Group serve as corresponding authors.

Scientists propose a novel tuning scheme to enhance the performance of multi-model in downstream multimodal tasks

Figure丨Related papers (source: arXiv)

According to Zhang, the research began with a comprehensive evaluation of the current multimodal large model.

"Although more and more research tends to adopt the Mixture of Experts (MoE) model, that is, to enhance the overall performance of the model by training specialized experts for different fields and drawing on ensemble learning strategies.

However, how to effectively match a specific corpus with the corresponding experts in the training process is still a tricky problem. He said.

In addition, with the advancement of large model technology, a single static model has certain limitations in dealing with multimodality and multi-tasking, and even the mixed expert model has the problem of knowledge conflict and forgetting between specific experts.

As a result, this fixed static parameter architecture may limit the performance of different downstream tasks.

It is precisely the limitations of the existing static multimodal large model architecture that stimulates the group's interest in exploring dynamic strategies, thus laying a foundation for further research.

Then, during the conceptualization phase, the team closely followed the latest developments in the field and possible solutions and conducted in-depth research on multitasking and multi-domain related work.

"Through extensive thinking and discussion based on the latest research results and literature, we came up with the initial concept of HyperLLaVA, a model that can use hyperparameter networks to dynamically generate visual and linguistic experts to adaptively adjust parameters." Zhang Wenqiao said.

After clarifying the research direction and methodology, the researchers began to work on the practical development and experimentation of HyperLLaVA.

They rigorously evaluated the initial prototype model, and then continuously optimized and iterated based on performance metrics and feedback.

It is understood that this iterative process is very important to promote the extreme development of model performance and verify the feasibility of its practical application.

Subsequently, the improved model was tested in multiple benchmarks and real-world scenarios for extensive experimental validation to evaluate its performance and compare it with existing models.

In addition, they carried out a series of ablation experiments and explored the working principle of the model in depth through comparative analysis, documenting in detail the research process, methodology, experimental results and their explanatory analysis.

Zhang Wenqiao said that in the process of research, when the research team decided to use hyperparameter networks to enhance the performance of vision and language experts, they first tried to adopt a large network structure, but found that this would lead to uncontrollable training of multimodal large models, so as to fail to achieve the expected results.

"According to our analysis, this is due to the scale of the generated network parameters that cannot be fitted with the training data." Zhang Wenqiao said.

Therefore, in many subsequent tests, they spent a lot of time and resources on debugging, but they could not achieve good results.

"We even abandoned the proposed plan for a while." Zhang Wenqiao said frankly.

However, in a fortuitous test, the team found that the model showed unexpected performance advantages and training stability in smaller dimensions.

This led them to decide to combine the up-and-down sampling network structure to further control the scale of the generated network parameters, and finally effectively improve the controllability and generalization of network training.

In addition, researchers have also observed that hyperparameter networks, as a dynamic adjustment mechanism, are similar to meta-learning to some extent.

This not only enhances the model's ability to apply across domains, but also allows the model to use this cross-domain potential to adjust itself immediately during training.

On the basis of this research, the research group will continue to pay attention to the latest technological advances in large models, explore how to further improve HyperLLaVA, and open up new powerful paradigms in the field of multimodal large models.

For example, at the model architecture level, the hybrid expert MoE technology is combined to train general vision/language experts and specific vision/language experts, and the generalization of multimodal large models in downstream tasks is further improved through the collaboration and fusion of the two.

At the model scale level, larger-scale multimodal training instructions are collected and model training is performed on larger pedestal language models (such as 34B and 130B) to build more powerful general-purpose multimodal large models.

In terms of application demonstration, it has achieved preliminary implementation in the medical field, and built multi-modal instruction data based on medical images, medical knowledge graphs, medical consultation databases, etc., to realize fine-grained medical image analysis, basic basic consultation, diagnostic report generation and other functions.

Resources:

1.W.,Zhang,T.,Lin,J.,Liu.et al. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models. arXiv:2403.13447.https://doi.org/10.48550/arXiv.2403.13447

Operation/Typesetting: He Chenlong

Read on