Editor's note: Welcome to the "New in Research" column! "Research Innovation" brings together the latest innovations and research trends of Microsoft Research Asia. Here, you can quickly browse the highlights of the institute, maintain a keen sense of cutting-edge fields, and find advanced and practical open source tools.
「 Quick Facts of this Issue 」
01 MedVTAB: A Benchmark for Adaptation to Large-Scale Medical Vision Tasks
02 Aligning Visual Models with Human Aesthetics: Algorithms and Evaluation
03 GLC: An extremely low bitrate image codec based on generative feature encoding
04 MH-MoE: Multi-Head Hybrid Expert Network
1️⃣ MedVTAB: A Benchmark for Adaptation to Large-Scale Medical Vision Tasks
Paper link: https://arxiv.org/abs/2404.12876
In recent years, significant advances in deep learning have given a significant boost to the field of computer vision, especially the introduction of Vision Transformer (ViT). After being pre-trained on large-scale datasets, these models demonstrated excellent performance on a variety of vision tasks. By introducing specialized learnable layers or labels, the adaptability of ViT in specific downstream tasks (known as visual task adaptability) opens up new avenues for the optimization of task-specific models. This adaptability allows pre-trained models to be fine-tuned to fit the nuances of a particular task, improving the performance and applicability of the model.
Despite these significant advances, the application of visual task adaptation in the medical field remains underexplored, especially in multiple medical imaging modalities such as color images, X-rays, and CT scans. The field of medical imaging faces unique challenges, including the heterogeneity of data, the urgent need for high accuracy, and the ability of models to generalize across different organs and diseases. In addition, the potential of visual task adaptation to leverage existing knowledge in both medical and non-medical fields in a systematic and large-scale manner has not been fully studied.
To fill this gap, the researchers have introduced Med-VTAB, a comprehensive benchmark dataset for visual task adaptability, in the hope of facilitating the exploration and evaluation of visual task adaptability technologies in the field of medical imaging. Med-VTAB covers 1.68 million medical images, including 10 vital organs and 5 modalities that are challenging in real-world medical scenarios, making it one of the broadest benchmarks of its kind. The purpose of this benchmark is to explore the effectiveness of visual task adaptation strategies and to study the law of expansion in medical image adaptation.
Figure 1: Overview of the Med-VTAB dataset as a large-scale benchmark dataset for medical imaging suitability
Subsequently, the researchers investigated the relationship between the number of adjustable parameters and the model performance in medical cue tuning, as well as the generalization ability of adaptability from medical and non-medical pre-training weights. In addition, the researchers investigated the impact of changes in patient ID distribution on the performance of adaptive models, which is an aspect of the robustness of models to new patient data in medical applications.
In addition to these explorations, the researchers have proposed a new adaptation technique, the Gating Expert Hybrid Adapter (GMoE-Adapter). It leverages insights from medical and general vision pre-training to achieve state-of-the-art performance in medical vision task adaptation. The GMoE-Adapter demonstrates the potential of hybrid adaptation strategies that combine domain-specific knowledge with broad, universal learning from multiple sources.
Figure 2: Gating Expert Hybrid Adapter (GMoE-Adapter) framework vs. standard adapter and MoE-Adapter approaches
Through the Med-VTAB benchmark and the study of adaptation strategies and the law of expansion, this work will set a new standard for medical visual task adaptation research. By emphasizing the importance of customized adaptation techniques and the exploration of novel adaptation methods, the researchers hope to improve diagnostic accuracy and patient outcomes.
2️⃣ Aligning Visual Models with Human Aesthetics: Algorithms and Evaluation
Link to paper: https://arxiv.org/abs/2406.09397
Existing large-scale visual-language models need to be pre-trained on large-scale data at the network scale. However, the data is uneven, and the resulting models often face the problem of value alignment. In text-image retrieval tasks, problems such as low aesthetic quality, inconsistent fine-grained requirements, and harmful biases may be manifested. Due to the highly subjective nature of such problems, there is currently a lack of effective ways to evaluate and improve.
Therefore, the researchers chose one of the most subjective tasks, aesthetics, as a representative to study the problem of value alignment. According to the definition and study of aesthetics, aesthetics can be divided into subjective aesthetic understanding (symbolic, cultural, etc.) and objective visual appeal (color, resolution, saturation, etc.), and other alignment tasks are similar. Researchers have found that human understanding of aesthetics can be learned by large language models because it exists in a large number of literature and works, and further, the aesthetic effect can be greatly improved by using large language models to query and expand aesthetic expectations for users.
The researchers fairly evaluated the large language model and the aesthetic model under various prompt words, and proved the validity of the large language model in bringing aesthetic understanding and the validity and complementarity of the image priors contained in the aesthetic model. In order to achieve an end-to-end efficient retrieval system, the researchers proposed a sorting-based reinforcement learning algorithm to fine-tune the visual model and distill knowledge from the large language model and the aesthetic model.
For the evaluation, the researchers first constructed an aesthetic preference dataset, HPIR, in which each sample was voted on 30 times with confidence due to the subjective nature of aesthetics. Using HPIR, the researchers also validated the feasibility of GPT-4V as an aesthetic evaluator. The final experiment is jointly verified by HPIR, GPT-4V evaluation and human evaluation, and the end-to-end retrieval model after aesthetic alignment and fine-tuning can achieve similar results as the multi-stage system integrating large language model and aesthetic model, which greatly simplifies the complexity of the high-quality retrieval system and reduces the maintenance cost and retrieval delay.
Figure 3: Distilling aesthetic understanding and visual priors with reinforcement learning from large language models and aesthetic models
3️⃣ GLC: A very low bitrate image codec based on generative feature encoding
Link to paper: https://openaccess.thecvf.com/content/CVPR2024/papers/Jia_Generative_Latent_Coding_for_Ultra-Low_Bitrate_Image_Compression_CVPR_2024_paper.pdf
At present, mainstream image codecs usually encode images directly in pixel space. However, the distortion metrics of image pixels are not always consistent with human vision, especially in very low bitrate compression scenarios where image encoding distortion is severe. Therefore, achieving image encoding that is more in line with human vision is a key challenge.
Researchers at Microsoft Research Asia found that generative VQ-VAE features are more suitable for very low bitrate encoding with high subjective quality due to higher subjective visual consistency, lower entropy, and higher robustness than raw pixels. Based on this observation, the researchers proposed a model GLC that can be encoded in the feature space of generative VQ-VAE.
Figure 4: Visual quality of GLC compared to previous SOTA image encoders
Specifically, GLC first uses VQ-VAE's encoder to encode images into generative features, then encodes these features through a transform encoding network, and finally reconstructs the decoded features into images through VQ-VAE's decoder. When encoding images, GLC does not need to use VQ encoding, but designs a transformation encoding network for feature compression. This design not only improves the compression ratio of GLC, but also allows it to support variable bitrate encoding.
In order to improve the compression performance, GLC designed a prior model based on vector codebook in the edge information coding of transformation encoding. Compared with traditional separable priors, this kind of prior can encode stronger semantic information at a lower bit rate. GLC also uses a helper network in training to predict the VQ index corresponding to the original image based on the decoded features, so as to improve the semantic consistency between the decoded features and the original image.
Experimental results show that GLC achieves the highest compression performance across multiple test protocols. GLC enables high-quality image compression with an average compression of 0.03 bits per pixel at a very low compression. Compared to MS-ILLM, a SOTA encoder based on pixel space compression, GLC saves more than 45% of bits under the same FID metrics. In addition, by utilizing its feature space, GLC can perform functions such as image restoration and style transfer while compressing images.
4️⃣ MH-MoE: Multi-Head Hybrid Expert Network
Paper link: https://arxiv.org/abs/2404.15045
A surefire way to further improve the performance of large capacity models such as large language models (LLMs) and large multimodal models (LMMs) is to scale them by increasing the number of parameters. However, the sheer size of these models significantly reduces the speed of inference, further limiting their utility. In this context, the Sparse Hybrid Expert (SMoE) method is proposed, which reduces the computational cost and promotes the scalability of the model, but still faces the shortcomings of low expert activation rate and lack of fine-grained analysis ability.
As a result, researchers at Microsoft Research Asia have proposed an efficient variant called the Multi-Head Mixture of Experts (MH-MoE) to alleviate these problems. The multi-head hybrid expert network uses a multi-head mechanism to split each input token into multiple sub-tokens, which are then assigned to different expert networks for parallel processing, and finally seamlessly reintegrated back into the original token form.
Figure 5: MH-MoE workflow on visual and verbal data
MH-MoE 有如下优点:
- Higher efficiency of expert activation. As shown in Figure 6, SMoE has many unactivated experts (dark), and MH-MoE significantly increases the usage of these experts by 90.71%.
Figure 6: Sparse hybrid expert network layer (left) and multi-head hybrid expert network layer (right)
- More granular comprehension. As shown in Figure 7, the sub-symbols are assigned to more different experts (bright areas) by the MH-MoE, enabling different experts to focus on information from different representation spaces, ultimately enabling better fine-grained understanding.
Figure 7: MH-MoE assignment of sub-symbols, with the bright zone - assigned to different experts and the dark zone - assigned to the same expert.
In addition, MH-MoE is simple to implement, decoupled from other sparse hybrid expert network optimization methods, and easy to integrate into other sparse hybrid expert network models to improve performance.
The researchers also conducted extensive experiments on the three pre-training tasks and their downstream tasks, and the results show that the proposed method not only significantly improves the performance of the expert hybrid network in the upstream pre-training tasks and downstream tasks, but also alleviates the problem of low expert activation rate to a large extent, making the model more efficient.
Microsoft Research Asia, focusing on scientific research for 25 years, is rich in black technology
Welcome to like, forward, and follow