laitimes

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

author:ScienceAI
CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

Author | Harvard-NYU team

编辑 | ScienceAI

Equity is a key issue in deep learning, especially in the medical field, where these models influence diagnosis and treatment decisions. Although equity has been studied in the vision-only domain, the fairness of medical VL models remains unexplored due to the lack of medical visual-linguistic (VL) datasets for studying equity.

To fill this research gap, we introduce the first equitable visual-lingual medical dataset (FairVLMed), which provides detailed demographic attributes, real-world labels, and clinical notes for an in-depth examination of fairness in the VL underlying model.

Using FairVLMed, we performed a comprehensive fairness analysis of two widely used VL models (CLIP and BLIP2) that were pre-trained in both the natural and medical image domains, covering four different protected attribute information.

Our results highlight significant biases across all VL models, with Asians, males, non-Hispanics, and Hispanics being the preferred groups for the protected attributes of race, gender, ethnicity, and language, respectively. To mitigate these biases, we propose FairCLIP, an optimal-transport-based approach that achieves a favorable compromise between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and each demographic group.

As the first VL dataset to study fairness, FairVLMed has the potential to enable both ethical and clinically effective machine learning models for research model fairness.

在这里分享一篇哈佛大学和纽约大学研究团队的CVPR 2024论文: 「FairCLIP: Harnessing Fairness in Vision-and-Language Learning」。

In this work, we present a groundbreaking study on the fairness of multimodal visual language large models, for which we collect the first large dataset of visual language medicine with equity ethnicity indicators, and propose FairCLIP, a method for visual language pre-training, to try to improve the fairness of different groups (so that the accuracy of different groups is close).

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

Address: https://arxiv.org/pdf/2403.19949.pdf

Code Address: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP

Dataset website: https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k/

Dataset download link: https://drive.google.com/drive/u/1/folders/1bkeifigwOAfnsLvup9mJOSNeA3WsvA2l

Harvard-Ophthalmology-AI-Lab is committed to providing high-quality fairness datasets and more fairness datasets.

Lab's Dataset Homepage: https://ophai.hms.harvard.edu/datasets/

Background:

In recent years, fairness has received increasing attention in the field of deep learning. This is especially important, especially in the medical field, where these deep learning models influence diagnosis and treatment decisions. Bias in these models related to factors such as race, gender, or socioeconomic status can lead to gaps in health care and adverse patient outcomes.

Therefore, ensuring that these models are unbiased is not only an ethical and legal imperative, but also a necessity to ensure patient safety and medical equity. This makes equity in the field of medical computer vision a critical and urgent issue that is critical to the delivery of equitable health care.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

Previous studies have identified biases in deep learning-based medical image models, with a focus on chest X-ray diagnosis. Unlike these vision-only models, the recent rise of vision-language (VL) foundation models has set new benchmarks across a wide range of task domains. However, despite the excellent performance of these VL models, their fairness remains unclear.

Given the bias of vision-only models and the human-authored nature of clinical reports, VL models may further exacerbate equity issues. Therefore, as the field of deep learning shifts to multimodal basic models, it is becoming increasingly critical to examine how the interaction between visual and text affects the fairness of algorithm results. However, the current context in which such surveys are conducted is limited by the lack of VL datasets that contain comprehensive demographic information, and existing public VL datasets focus primarily on chest X-rays.

Previous studies have highlighted the challenges of using these datasets to study fairness because their true labels are automatically extracted from radiology reports, possibly leading to inaccurate fairness conclusions due to label noise. In addition, because these datasets are not primarily designed for equity, they provide only a few demographic characteristics, limiting the potential for conducting comprehensive equity studies across multiple dimensions. Further, radiology reports focus primarily on direct observation of imaging data and rarely contain additional patient-specific information that does not represent the majority of clinical texts, thus limiting their usefulness in medical VL model equity studies.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

To fill this research gap, we introduce the first visual-verbal health dataset to study equity (FairVLMed), which provides detailed demographic attributes, real-world labels, and clinical medical reports to facilitate an in-depth examination of fairness within the VL base model.

FairVLMed contains records from 10,000 patients, each paired with an SLO retinal image and a clinical medical report for diagnosing glaucoma, along with detailed protected attributes such as age, gender, race, ethnicity, preferred language, and marital status.

Unlike radiology reports, clinical medical reports in our dataset provide more detailed information, including not only image descriptions, but also rich non-imaging clinical information such as medications, non-imaging test results, and family history. Therefore, these clinical medical reports are more representative and more suitable for studying the fairness of the medical VL model.

Glaucoma affects millions of people around the world, and it exemplifies the need for a fair diagnostic model. Timely detection is essential to avoid irreversible vision loss. However, many patients go undiagnosed due to the asymptomatic nature of the disease and barriers to eye care. In addition, undiagnosed problems are particularly prominent among ethnic minorities. For example, previous studies have shown that individuals in the Black community are 4.4 times more likely to have undiagnosed and untreated glaucoma than the white community, highlighting the importance of addressing medical disparities.

Deep learning systems have significant potential to improve healthcare. However, before these deep learning systems can be clinically implemented, addressing potential equity issues is necessary to ensure equitable health care delivery.

In this work, we performed an extensive fairness analysis on FairVLMed using two widely used VL methods, namely CLIP and BLIP2. Our experimental results revealed significant differences in accuracy between various groups based on race, gender, ethnicity, and language.

To address these fairness issues, we have introduced an optimal transport-based approach named FairCLIP. FairCLIP aims to enhance fairness by optimizing the Sinkhorn distance so that the overall sample feature distribution is aligned with the feature distribution for each demographic group.

Our main contributions can be summarized as follows:

  • We present the first Fair Visual-Verbal Medical Dataset (FairVLMed), which has detailed demographic attributes, authentic labels, and clinical medical reports to investigate the fairness of the VL underlying model.
  • Using FairVLMed, we performed a comprehensive fairness analysis of two widely used VL models, CLIP and BLIP2, that were pre-trained in both the natural and medical domains across four different protected attributes.
  • Our results highlight significant biases across all VL models, with Asians, males, non-Hispanics, and Hispanics, respectively, being the preferred subgroups in the protected attributes of race, gender, ethnicity, and language.
  • We propose an optimal transport-based method called FairCLIP, which is significantly superior to CLIP in terms of performance and fairness.

How to obtain a large number of paired visual language medical data

The data in this study are from the Massachusetts Eye and Ear Hospital at Harvard Medical School from 2015 to 2022. This study will include three types of data: (1) scanning laser fundus photography (SLO) fundus images, (2) demographic identity group information, and (3) de-identified clinical notes written by ophthalmologists to provide a summary of the glaucoma diagnosis.

SLO fundus images are a valuable marker for assessing retinal damage caused by diseases such as glaucoma. Each SLO fundus image is associated with six demographic identity attributes, including age, gender, race, ethnicity, preferred language, and marital status. Accompanying clinical notes vary in length and may describe the assessment, treatment plan, and diagnostic strategy in detail and are thought to correspond to visual semantics in the fundus images of SLO.

Figure 1 shows two examples of SLO fundus images and clinical notes. Subjects were divided into non-glaucoma (normal visual function as measured by visual field (VF) testing: VF mean deviation ≥-1 dB and VF glaucoma hemisphere test and pattern standard deviation (PSD) results were normal) and glaucoma categories (abnormal visual function measured by VF testing: VF mean deviation <-3 dB and VF glaucoma hemisphere test and PSD results abnormal).

Protected Information Deidentify

Original clinical notes may contain protected sensitive information such as date of glaucoma diagnosis, patient name, phone number, email address, physical location, institution, etc. We identify this sensitive information through the following three steps.

First, we use Microsoft's Presidio tool to anonymize all clinical notes, replacing sensitive information with appropriate placeholders (e.g., PERSON NAME, PHONE NUMBER, LOCATION) in order to maintain original sentence structure and coherence.

We then use rules to match and de-identify protected information (e.g., physical addresses) that Presidio does not fully recognize.

Finally, the de-identified clinical medical report was further validated by four medical experts. In particular, each clinical note is checked by a specialist and sensitive information is manually replaced with the appropriate placeholder if necessary.

Data characteristics

The FairVLMed dataset includes 10,000 samples from 10,000 subjects. It is divided into 7,000 training samples, 1,000 validation samples, and 2,000 test samples.

The mean age of the dataset was 60.9 ± 16.2 years. The dataset includes samples from three major groups: Asians, 819 samples, Blacks, 1,491 samples, and Whites, 7,690 samples. In terms of gender, women accounted for 56.3% of the participants, while the rest were men. The racial distribution is characterized by 90.6% non-Hispanic, 4.0% Hispanic, and 5.4% unassigned.

In terms of preferred language, 92.5% of participants preferred English, 1.7% preferred Spanish, 0.8% preferred other languages, and 5.0% did not know. In terms of marital status, 57.4 per cent were married or had a partner, 26.4 per cent were single, 6.6 per cent had experienced divorce, 1.0 per cent were legally separated, 6.1 per cent were widowed and 2.5 per cent were not specified. After de-identification, the number of words in the clinical notes ranged from 11 to 332 words, with an average of 147 words.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

FairCLIP, a method used to improve the fairness of the basic model of visual language

As shown in the figure above, our proposed FairCLIP framework aims to improve fairness in the pre-training phase. This is achieved by minimizing differences in the probability distributions of visual and linguistic feature correlations M_{I,i} between different ethnic groups (or other attribute-based groups).

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

where d is a distance function and is a potential distribution that is not computationally feasible. We use a lot-based distribution in the equation, B_a represent that the samples in the batch are from population A.

To optimize the target, a straightforward approach is to minimize the Kullback–Leibler (KL) divergence between the two distributions. However, the KL divergence is asymmetrical and does not satisfy trigonometric inequalities, so it is not a true distance measure. Instead, we follow literature citations, minimizing the Sinkhorn distance between the two distributions. The Sinkhorn distance is a measure of probability and a variant of the Wasserstein distance. The Sinkhorn distance between the two distributions is defined as:

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

The Sinkhorn loss will be added to the loss used by the CLIP during the pre-training phase to optimize the fairness of the CLIP.

experiment

We employ two types of evaluation strategies – linear probing and zero-shot transfer. For linear probing, we follow the official MAE implementation to train a linear classifier on the visual features of CLIP and BLIP2, respectively. Similar to MAE, we use a BatchNorm layer before the linear classifier and use the LARS optimizer, with a base learning rate of 0.1, a weight decay of 0, and a batch size of 512. For zero-shot transfer, we used the same approach as the original text of CLIP.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

Table 2 presents the results of linear probing, examining various performance (AUC) and fairness (DPD, DEOdds, ES-AUC) metrics, as well as reporting cohort AUC scores in individual subpopulations within each of the four protected attributes. We primarily focused on the ES-AUC metric in the subsequent analysis as it captures the concept of overall performance as well as fairness – both of which are important for safety-critical medical applications. Table 2 shows the differences in VL performance across the various protected attributes, as well as the impact of different VL pre-training domains (natural images vs. medical images) and VL pre-training methods (CLIP vs. BLIP2) on model performance and fairness.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

Table 3 compares the accuracy of CLIP versus FairCLIP for zero-shot transfer on two different architectures (ViTB/16 and ViT-L/14) and four different protected attributes. Both CLIP and FairCLIP are fine-tuned by images and clinical notes without supervised information (i.e., labels). The resulting model is then evaluated in a classification task. CLIP showed significant differences in group AUC for attributes such as race, gender, ethnicity, and language, suggesting bias in glaucoma detection. Overall, FairCLIP was significantly better than CLIP in terms of fairness indicators (DPD, DEOdds) and ES-AUC and AUC scores for various population subgroups.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

Table 5 shows more end-to-end fine-tuning results, further validating the effectiveness of FairCLIP. These empirical findings suggest that optimizing the distance between the overall sample distribution and the distribution of specific subgroups effectively improves fairness, indicating a promising direction in addressing and mitigating inherent biases.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

To unravel the benefits of coupling image and text features, we performed linear probing of the BLIP2 pre-trained model, using visual-only or (visual+language) features. Table 4 illustrates the performance-fairness trade-off as measured by ES-AUC. We note that multimodal features consistently improve the performance-fairness trade-off across all protected attributes, except for language. This highlights the VL model's effective use of clinical textual features, especially with the most significant gains observed on ethnic attributes.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

To investigate the impact of different visual encoders on the fairness of the BLIP2 model, we used two different pre-trained encoders—1) CLIP trained in the natural domain, and 2) PMC-CLIP trained in the medical domain. The results in Figure 3b show that PMC-CLIP outperforms CLIP on all four protected attributes, with the most significant gains in the ethnic subgroup. We note that medical-specific LLM summarizers and visual encoders consistently improve the performance-fairness trade-off of VL models, especially in racial attributes.

Beutel et al. introduced a fairness approach that uses adversarial losses to prevent models from inaccurately predicting sensitive attributes. This approach aims to ensure that the model predicts the labels of images without relying on their sensitive attributes, thereby reducing bias in classification. Figure 3c shows the performance comparison between CLIP, CLIP WITH ADVERSARIAL LOSS (CLIP W/Adv), and FairCLIP. The performance of CLIP with adversarial training (CLIP w/Adv) does not consistently outperform the performance of standard CLIP across all attributes. On the contrary, FairCLIP consistently outperforms CLIP. This change in performance can be attributed to the inherent challenge of adversarial training in maintaining the predictive accuracy of equivalence for each attribute. FairCLIP, on the other hand, uses Sinkhorn loss, which effectively encourages uniformity in the distribution of all samples relative to each group.

More results are presented below in the supplementary material of the article.

CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study
CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study
CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study
CVPR 2024|FairCLIP: The first multimodal medical visual language model fairness study

summary

In view of the critical need for equity in the healthcare field, we have introduced the first Visual-Verbal Health Dataset (FairVLMed) to investigate the fairness of the underlying model of healthcare VL.

Our comprehensive fairness analysis of FairVLMed revealed significant biases across all VL models. To address these biases, we propose FairCLIP, an optimal transport-based approach that effectively balances performance and fairness.

Note: The cover is from the Internet

Read on