CVPR'24 Open Source!

author：3D Vision Workshop 2024-04-14 21:10:00

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Scan the QR code below to join the 3D vision knowledge planet, which condenses many 3D vision practical problems, as well as learning materials for each module: nearly 20 video courses (free learning for planet members), the latest top papers, computer vision books, high-quality 3D vision algorithm source code, etc. If you want to get started with 3D vision, do projects, and engage in scientific research, welcome to scan the code to join!

论文题目：Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

作者:Chaoqin Huang, Aofan Jiang等

Related: Shanghai Jiao Tong University

Paper link: https://arxiv.org/pdf/2403.12570.pdf

Code link: https://github.com/MediaBrain-SJTU/MVFA-AD

This paper introduces a new approach to anomaly detection in medical images, utilizing a recently developed large-scale visual-linguistic pre-trained model. In this method, multiple residual adapters are integrated into the pre-trained visual encoder and a multi-level pixel-level visual-language feature alignment loss function is used to achieve progressive enhancement of visual features at different levels. This multi-level adaptation allows the model to achieve better generalization capabilities in medical images, even in zero-shot scenarios, to process unseen medical modalities and anatomical regions. Experiments show that the proposed method is significantly better than the current state-of-the-art model in medical anomaly detection, with the improvement of mean AUC by 6.24% and 7.33%, and the improvement of abnormal segmentation by 2.03% and 2.37%, respectively, in the zero-shot and few-shot settings.

In this paper, we introduce a method to migrate a visual-language model from the natural image domain to the medical image domain for anomaly detection. By adaptively training on medical datasets, the model can perform accurate anomaly detection on unseen medical images. In this paper, a multi-level feature adaptation method is proposed to transform natural image features into medical image features, and realize pixel-level anomaly segmentation through feature alignment. Experimental results show that the proposed method performs well in zero/few-sample AC and AS tasks, demonstrating its potential application prospects in medical image analysis.

This paper introduces a new approach to anomaly detection in medical images, utilizing the recently developed large-scale visual-linguistic pre-trained model. This method refines the features of the CLIP model to the anomaly detection needs in the medical context by integrating multiple residual adapters into a pre-trained vision encoder and using a multi-level pixel-level visual-linguistic feature alignment loss function. Experiments on medical anomaly detection benchmarks show that the proposed method is significantly superior to the current state-of-the-art methods in the case of zero and few samples, with an average improvement of 6.24% to 7.33% of the anomaly classification accuracy and 2.03% to 2.37% of the anomaly segmentation accuracy. The main contributions of this method include the proposal of a novel multi-level feature adaptation framework and the demonstration of its ability to generalize abnormalities in medical images.

This article contributes:

A novel multi-level feature adaptation framework is proposed, which, to the best of the authors' knowledge, is the first attempt to adapt a pre-trained visual-language model to medical AD in the zero/few-shot case.
Extensive experiments were conducted on a challenging benchmark for AD in medical images, demonstrating its ability to generalize abnormally across different data modalities and anatomical regions.

This section describes how to adapt a visual-linguistic model originally trained on natural images to a model that performs anomaly detection in medical images. By leveraging medical training datasets, the method is able to transform the model from a natural image to a model suitable for medical images. Specifically, the method includes pre-training using annotated medical datasets and evaluating the model's ability to generalize in unseen cases through zero-shot learning and few-shot learning. Finally, a multi-level adaptation and comparison framework is proposed for anomaly detection in medical images.

This section describes a training method for anomaly detection in medical images, employing a multi-level feature adaptation framework that aims to tune pre-trained natural image visual-language models with minimal data and lightweight multi-level feature adapters. This method achieves adaptation at multiple feature levels by attaching a learnable bottleneck linear layer to the visual branch of the CLIP, keeping its original backbone unchanged. Specifically, the method consists of three feature adapters and a feature projector, and the learned feature adapters are applied at different levels to adjust the focus of the model through a multi-level, pixel-level visual-linguistic feature alignment loss function, thus enabling it to identify anomalies in medical images. Finally, the experimental results show that the proposed method performs well in the detection of medical anomalies and is significantly better than the current state-of-the-art methods.

In the test phase, in order to accurately predict image-level (AC) and pixel-level (AS) anomalies, the method adopts a two-branch multi-level feature comparison architecture, including zero-shot branches and few-shot branches. The zero-shot branch is tested by MVFA processing to generate multi-level adaptive features and compare these features with textual features.

This experiment mainly introduces a method for medical image anomaly detection, which achieves the goal of accurately predicting anomalies at the image level and pixel level through multi-level feature adaptation and comparison. Experiments include datasets, competitive methods and baselines, evaluation protocols, model configuration and training details, comparisons with existing methods, and ablation studies.

Datasets: BMAD-based medical anomaly detection benchmarks were used, covering five different medical areas with a total of six datasets, including brain MRI, liver CT, retinal OCT, chest X-ray, and digital histopathology.
Competitive Methods and Baselines: A variety of state-of-the-art AD methods are considered, including basic methods that use all normal data, minority normal-sample methods, and few-sample methods.
Evaluation protocol: AC and AS were assessed separately using the ROC area under the curve (AUC) metric to measure performance.
Model configuration and training details: Training on an input image with a resolution of 240 using CLIP and ViT-L/14 architecture, using the Adam optimizer, with a learning rate of 1e-3 and a batch size of 16, was performed on an NVIDIA GeForce RTX 3090 GPU for 50 epochs.
Comparison with existing methods: In the few-sample case, the MVFA showed better performance compared to methods such as DRA, BGAD, and April-GAN, especially in AC, outperforming the winner of the VAND workshop at CVPR 2023, April-GAN.
Ablation studies: Ablation studies of feature adaptation and feature alignment were carried out, and the results showed that feature adaptation is essential to improve the ability of cross-modal generalization.

This article describes a method to apply pre-trained vision-language models in the natural domain to medical anomaly detection. Through cross-domain generalization, the method is applicable to different medical image modalities and anatomical regions. Specifically, this paper proposes a multi-level feature adaptation method to guide each adaptation process through visual-linguistic alignment, and realize the transition from high-level semantics to pixel-level segmentation. In addition, combined with a comparison-based anomaly detection strategy, the method can flexibly adapt to datasets with substantial modal and distribution differences. Experimental results show that the proposed method performs well in zero/few-sample AC and AS tasks, demonstrating the potential value of future research.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.

CVPR'24 Open Source!

Read on