Reprinted from AI Technology Review
Compile the | Wang Ye
Edit | Twilight

This article is from Nature Machine Intelligence. Author Rohan Shad is a postdoctoral researcher in the Department of Cardiothoracic Surgery in The Heiesinger Lab. He and his team built novel computer vision systems for cardiovascular imaging (echocardiography and cardiac MRI) and used transcriptomics and protein design to study the underlying mechanisms of heart disease and to design devices for patients with severe heart failure.
It explores the unique challenges of high-dimensional clinical imaging data and highlights some of the technical and ethical considerations involved in developing machine learning systems to better reflect the high-dimensional nature of imagery patterns. In addition, they argue that approaches that attempt to address interpretability, uncertainty, and bias should be seen as a core component of all clinical machine learning systems.
Link to the original article: https://www.nature.com/articles/s42256-021-00399-8
In 2018, the National Institutes of Health identified key areas for integrating AI into the future of medical imaging and laid the groundwork for research into image acquisition, algorithms, data standardization, and translational clinical decision support systems.
The report mentions that while the availability of data, the need for new computing architectures, and interpretable AI algorithms have made tremendous progress over the past few years, it is still a key issue.
In addition, early in the development process, issues such as the transformation goals of data sharing, performance validation for regulatory approval, generalizability, and mitigation of unintentional bias must also be taken into account.
1. The main idea
Advances in computing power, deep learning architectures, and expert labeled datasets have spurred the development of medical imaging artificial intelligence (AI) systems.
However, applying AI systems to assist in clinical tasks is very challenging. The purpose of machine learning algorithms is to reduce the time required for clinical inference. However, when applied in the clinic, it is possible to inadvertently delay the treatment of patients. When leaving a controllable laboratory environment, the end user of an AI system must be able to control input quality and be able to address issues such as network latency, devising ways to integrate these systems into established clinical practices.
Early attempts at transformable clinical machine learning have shown that designing systems that work well within a given clinical workflow requires extensive integration efforts from the outset of algorithm development. Because when the system is deployed in the future, the chances of iteration are very limited.
With the proliferation of open source machine learning software libraries and advances in computer performance, it is becoming increasingly easy for researchers to develop complex AI systems for specific clinical problems. In addition to detecting features of disease diagnosis, next-generation AI systems must take into account the system biases of training data, more intuitively alert end-user to the uncertainties inherent in predictions, and allow users to explore and interpret the mechanisms of prediction.
The view builds on these key priority areas to accelerate basic AI research in the medical field. We outline the nuances of datasets and specific architectural considerations for high-dimensional medical imaging machine learning, while discussing the interpretability, uncertainty, and bias of these systems. Along the way, we provided a template for researchers interested in addressing some of the issues and challenges posed by building clinically translatable AI systems.
2. High-dimensional medical imaging data
We expect that the available high-quality "AI-ready" annotated medical datasets will remain unmet for the foreseeable future. Going back and assigning clinical fact tags requires a significant amount of time from clinical experts, and there are significant hurdles in bringing together multi-agency data for public release. In addition to the "diagnostic AI" that needs to be characterized by models trained on hard radiology real labels, "disease prediction AI" trained on potentially more complex clinical synthetic outcome targets is also needed. Prospective data collection with standardized image acquisition protocols and clinical basic factual adjudication is a necessary step in building large-scale multicenter imaging datasets with paired clinical outcomes.
Large-scale multicenter imaging datasets create a number of privacy and liability issues related to potentially sensitive data embedded in files. The Digital Imaging and Communication for Medicine (DICOM) standard is commonly used for workflow management for capturing, storing, and delivering medical images. Imaging files (stored as .dcm files or nested folder structures) contain pixel data and associated metadata. Numerous open source and proprietary tools can help de-identify DICOM files. Back-end hospital informatics frameworks, such as the Google Healthcare API, are a way to clear metadata domains that may contain sensitive information and also support DICOM de-identification through "safe lists."
On the user-facing side, the MIRC Clinical Trial Processor Anonymizer is a popular alternative, although it requires the use of certain legacy software. Well-documented Python software packages such as pydicom can also be used to process DICOM files before they are used or transferred to partner institutions. Imaging data can then be extracted and stored in a variety of machine-readable formats. These datasets can quickly become bulky and unwieldy, and while the details of the data storage format are beyond the scope of this idea, a key consideration for medical imaging AI is the preservation of image resolution.
One disadvantage of automatic de-identification methods or scripts that is often mentioned is the potential for protected health information to be "burned" in image files. Despite the DICOM standard, manufacturers differ, making it difficult to generate simple rules with tools such as the MIRC Clinical Trials Processor to mask areas that may be located in protected health information. We recommend using a simple machine learning system to mask "burned" protected health information.
Take echocardiography as an example, where there is a predefined scan area where the heart can be seen. Other potential options are machine learning-based optical character recognition tools to identify and mask areas with printed text. Dicom tags themselves can be used to extract scan-level information and specific patterns of tags. For example, in the case of echocardiography and cardiac magnetic resonance imaging (MRI), important scan-level information such as acquisition frame rate and date or MRI sequence (T1/T2) can be easily extracted from DICOM metadata.
Figure 1: Collaborative, cloud-based annotation workflow. Cloud-based tools can be used to generate expert annotation datasets and evaluate them with clinical experts over a secure connection. Fig. shows an embodiment of the MD.ai, in which clinical experts perform various 2D tests to evaluate cardiac function.
For research work involving AI systems with clinicians conducting positive benchmarking, or curating large data sets with the help of clinical annotators, we recommend storing scanned copies in DICOM format. This allows for deployment through scalable and easy-to-use cloud annotation tools. There are currently several solutions for distributing scan data for evaluation by clinical experts. Requirements may range from simple scan-grade labels to detailed domain-specific anatomical segmentation masks. At our facility, we deployed MD.ai (New York, New York), a cloud-based annotation system that natively processes DICOM files stored on institutionally approved cloud storage providers (Google Cloud Storage or Amazon AWS). Alternatives offer similar features, such as ePad Lite (Stanford, California), which is free to use. Another advantage of the cloud-based annotation approach is that scans can maintain raw resolution and quality, real-time collaborative simulation of "team-based" clinical decisions, and annotations and labels can be easily exported for downstream analysis. On top of that, many of these tools can be accessed remotely with any web browser and are extremely easy to operate, greatly improving the user experience and reducing the technical burden on clinical collaborators.
Finally, newer machine learning training paradigms, such as federated learning, may help circumvent many of the barriers associated with data sharing. Kaissis et al. reviewed the principles of federated learning, security risks, and implementation challenges. The main feature of this approach is that a local copy of the algorithm is trained at each institution, and the only information shared is the features that the neural network learns during training. At predetermined intervals, the information learned from each institution's algorithm (the weights of the training) is pooled and redistributed, efficiently learning from a large, multicenter dataset without the need to transmit or share any medical imaging data. This helps to quickly train algorithms to detect features of COVID-19 from chest computed tomography.
Although there have been successful demonstrations of joint learning in the field of medical imaging, there are still a number of technical challenges when it comes to applying these methods for routine clinical use. Especially in the context of high-dimensional imaging machine learning systems, network latency introduced from multiple participating centers to transmit and update the weights of the training becomes a basic rate-limiting step for training larger neural networks. The researchers also had to ensure that the transmission of post-training weights between participating institutions was secure and encrypted, which further increased network latency. In addition, when designing a study, curating the quality and consistency of the data set can be extremely challenging without access to the source data. Many conceptually similar federated learning frameworks still assume some degree of access to the source data.
3. Compute architecture
Neural network architectures used in modern clinical machine learning are mainly derived from those optimized for large photo or video recognition tasks 28. Even in other challenging tasks of fine-grained classification, these architectures are robust, where classes have subtle intra-class differences (dog breeds) rather than distinct objects with high class-to-class differences (airplanes versus dogs). With adequate pre-training on large data sets such as ImageNet, these "off-the-shelf" architectures perform better than fine-grained classifiers tailored to them. Many of these architectures can be used in popular machine learning frameworks such as TensorFlow and Pytorch. On top of that, these frameworks often provide ImageNet pre-training weights for a variety of different neural network architectures, allowing researchers to quickly reuse them for specialized medical imaging tasks.
Unfortunately, the vast majority of clinical imaging modalities are not simply static "images". For example, an echocardiogram is a two-dimensional (2D) ultrasound image of the heart. These "videos" can be taken from a number of different perspectives, allowing for a more comprehensive assessment of the heart. CT and MRI scans can be thought of as a bunch of two-dimensional images that must be analyzed in order of images, or doctors risk missing valuable relationships between organs along a certain axis.
As a result, these "imaging" modes are more similar to video. Disassembling it as an image can result in the loss of spatial or temporal background. For example, analyzing each frame of a video as a separate image results in the loss of temporal information between each frame of video. Among the various tasks utilizing echocardiography, CT, and MRI scans, the video-based neural network algorithm is a considerable improvement over its 2D algorithm, but the integration of multiple different view planes brings additional dimensions that make it difficult to incorporate into the current framework.
Unlike the extensive library of pre-trained networks based on images, support for video algorithms is still limited. Researchers interested in deploying new architectures may need to perform pre-training steps themselves on large publicly available video datasets such as Kinetics and UCF101 (University of Central Florida 101--Action Recognition Dataset). In addition, the cost of training calculations for video networks can be orders of magnitude higher. While pre-training using large natural landscape datasets is a recognized strategy for developing machine learning systems for clinical imaging, performance gains are not guaranteed. Reports of performance improvements in pre-training are common, especially when using smaller datasets, but their benefits diminish as the training dataset increases.
In the 2018 National Institutes of Health roadmap, the lack of an architecture specific to medical imaging was identified as a key challenge. We extend further to propose ways to train these architectures, which will play an important role in translating these systems into reality. We believe that the next generation of high-dimensional medical imaging AI will need to be trained on richer, more contextual targets, rather than simple classification labels.
Today, most medical imaging AI systems focus on diagnosing a small number of diseases from a normal background. A typical approach is to assign a numeric label when training these algorithms (Disease: 1; Normal: 0). This is very different from the way clinical trainees learn to diagnose different diseases from imaging scans. To provide more "medical knowledge" rather than simply pre-training on natural images or videos, Taleb et al. proposed a series of novel self-supervised pre-training techniques using large unlabeled medical imaging datasets designed to assist in the development of AI systems based on 3D medical imaging.
The neural network first learns to "describe" the imaging scan as input by performing a set of "proxy tasks.". For example, by having networks "recombine" input scan data like a jigsaw puzzle, they can be trained to "understand" which anatomical structures are consistent with each other in various pathological and physiological states. Pairing data from imaging scans with radiology reports is another interesting strategy, and AI systems based on chest X-rays have been quite successful.
In the spirit of providing a more nuanced clinical context and embedding more "knowledge" into neural networks, the text in the report is processed by state-of-the-art natural language machine learning algorithms, which are subsequently trained to better understand what makes various diseases "different." Most importantly, however, they showed that using this approach could reduce the amount of labeled data for a particular downstream classification task by up to two orders of magnitude. Thus, unlabeled imaging studies, whether individual or combined with paired text reports, can serve as a basis for effective pretraining. Subsequently, smaller, high-quality samples of basic live data are fine-tuned to complete specific supervised learning tasks.
While these steps help adapt existing neural network architectures to medical imaging, designing new architectures for specific tasks requires expertise. The model architecture is similar to the brain, while the trained weights (mathematical functions optimized in training) are similar to the mind. Advances in evolutionary search algorithms leverage machine learning methods to discover new architectures tailored for specific tasks, resulting in architectures that are more efficient and performant than those built by humans. These all provide a unique opportunity for the development of model-specific architectures for imaging.
Training deep learning algorithms relies on graphics processing units (GPUs) to perform large-scale parallel matrix multiplication operations. Cloud computing "pay-as-you-go" GPU resources and the availability of consumer-grade GPUs with high memory capacity both help lower the barrier to entry for researchers interested in developing machine learning systems for medical imaging. Despite these advances, training complex modern network architectures on large video datasets requires multiple GPUs to run continuously for weeks.
Clinical research teams should note that while it may be feasible to train a single model on a relatively inexpensive computer, finding the right combination of settings for optimal performance almost always requires the use of specialized hardware and compute clusters to return results within a reasonable time frame. Powerful layers of abstraction (e.g., Pytorch Lightning) also allow research groups to establish internal standards, building their code in a modular form. With such a modular approach, neural network architectures and datasets can be easily replaced, helping to quickly repurpose systems designed for clinical imaging patterns in the past for new use cases. This approach also helps to extend the functionality of these systems by integrating subcomponents in new ways.
4. Time-event analysis and uncertainty quantification
As medical AI systems move from "diagnostics" to more "prognosis" applications, time-to-event predictions (rather than simple binary predictions) will find more correlations in clinical settings. Time-event analysis is characterized by the ability to predict the probability of events as a function of time, while a binary classifier can only provide a prediction of a predetermined time. Unlike binary classifiers, time-event analysis takes into account the cuts to the data to account for those who have lost follow-up or have not experienced relevant events within the observation time frame. Survival analysis is common in clinical studies and is at the heart of developing evidence-based practice guidelines.
Extending traditional survival models with image- and video-based machine learning can provide powerful insights into the prognostic value of features in tissue slices or medical imaging scans. For example, the expansion of the Cox proportional loss function is integrated into traditional neural network architectures, making it possible to predict cancer outcomes only from histopathological slices. We do not advocate the use of such visual networks to prescribe how care is performed, but rather as a method of labeling cases in which clinicians miss the characteristics of advanced malignancies.
Inclusion time-event analysis will become increasingly important clinically because detectable features that have been in the unstable or early stages of the disease may develop rapidly after a certain period of time.
For example, retinal features that can be diagnosed as macular degeneration often take years to manifest. Patients with initial disease characteristics may be labeled "normal," allowing neural networks to try to predict the risk of future complications of macular degeneration. Incorporating concepts of survival and review may help training systems better separate normal people from individuals with mild, moderate, and rapidly evolving diseases. Similarly, training visual networks for time-event analysis may be used for lung cancer screening, helping to stratify risks based on the expected potential for aggressive spread. The key to this transformational effort is to have a strong, well-validated deep learning extension of Cox regression. Over the past few years, a large number of deep learning implementations of Cox models have been described. Kvamme et al. propose a series of scale and non-scale extensions of Cox models, and have also described implementations of more survival methods in the past, such as DeepSurv and DeepHit46 (Figure 2).
Figure 2: Quantify uncertainty in machine learning output.
As Described by Sensoy et al., machine learning models trained using standard methods can be very confident even in incorrect situations. Left: When a number is rotated 180°, the system confidently assigns a label from "1" to "7". Right: However, using a method that takes into account classification uncertainty, the system assigns an uncertainty score that can help alert clinicians to potentially erroneous predictions.
However, from an operational point of view, time-event predictions can be problematic. In the hypothetical example of lung cancer screening, suspicious nodules in computed tomography of the chest may yield a prediction of median survival with or without appropriate therapeutic intervention. It may be interesting for clinicians to understand how confident machine learning systems are at predicting individual patients. When not sure about a task, humans tend to tread carefully. This is also reflected by machine learning systems, where the output is a "class probability" or "correct probability" in the range of 0 to 1. However, most of the medical imaging machine learning systems described in the current literature lack the implicit ability to say "I don't know" when the input data provided to the model is out of range. For example, even if the input image is an image of a cat, the classifier trained to predict pneumonia from computed tomography (e.g., is also forced to provide output (pneumonia or non-pneumonia) by design.
In their paper on uncertainty quantification in deep learning, Sensoy et al. solve these problems with a series of loss functions that assign a "uncertainty score" to avoid false but sure predictions. During the transformation phase of a project, the benefits of uncertainty quantification emerge when AI systems are deployed in an environment that works with human users. Confidence metrics are a key factor in AlphaFold2, a protein folding machine learning system that achieved unparalleled accuracy in the 14th Annual Key Assessment of Protein Structure Prediction (CASP14) challenge, giving the DeepMind research team a way to gauge how much trust they should place in the predictions being generated. Many implementations of uncertainty quantification methods are made with permission and are compatible with commonly used machine learning frameworks. Incorporating uncertainty quantification may help improve the interpretability and reliability of high-risk medical imaging machine learning systems and reduce the likelihood of automated bias.
5. Interpretable ARTIFICIAL INTELLIGENCE and injury risk
In addition to quantifying the predictive effects of certain machine learning systems, it is more interesting for the engineers who build these systems and the clinicians who use them to understand how these machine learning systems come to their conclusions. Significance graphs and class activation graphs are actually still standards for explaining how machine learning algorithms make predictions.
Recent research by Adebayo et al. suggests that relying solely on the visual appearance of significance maps can be misleading, even if at first glance they are contextually relevant. In a series of extensive tests, they found that many popular methods of generating post-mortem significance maps did not derive real meaning from model weights, but rather were indistinguishable from "edge detectors" (algorithms that simply map sharp transition areas between pixel intensities). Moreover, even if these visualization methods work, they are almost impossible to decipher except for the "location" that machine learning algorithms are looking for. In many examples, the significance graph, whether right or wrong, looks almost the same. These drawbacks are even more pronounced when the difference between the "diseased" state and the "normal" state requires attention to the same area of the image or video.
Figure 3: Misleading interpretation of the after-the-fact model.
A, Experiments conducted by Adebayo et al. with a model trained on real labels in the MNIST dataset (top) and a model trained on random noise (bottom). When evaluated by most visualization methods, models trained on random noise still produce a circular shape. b. Detection of echocardiographic view planes: both erroneous classification (top left) and correct classification (top right) produce a similar significance map (bottom).
Clinicians should note that heat maps alone are not enough to explain the capabilities of AI systems. Caution must be exercised when trying to identify failure modes using the visual approach shown in the figure above. A more elaborate approach might involve a continuous occlusion test, which evaluates the performance of an image after intentionally obscuring the area that clinicians use to make a diagnosis or prediction. The idea is very intuitive: running an algorithm on an image of an area that is known to be important for diagnosing a certain disease, for example, by obscuring the left ventricle when trying to diagnose heart failure, a sharp decline in performance should be visible.
This helps confirm that AI systems are focusing on relevant areas. Particularly in the context of high-dimensional medical imaging studies, activation maps may provide unique insights into the relative importance of certain time phases of video-like imaging studies. For example, some diseases may show pathological features when the heart contracts, while for others may require attention to what happens when the heart is relaxed. Often such experiments may show that machine learning systems identify potential informational features from areas of images that clinicians have not traditionally used. In addition to gathering information about how these machine learning systems produce their outputs, rigorous visualization experiments may provide a unique opportunity to learn biological insights from the machine learning systems being evaluated.
On the other hand, the deviation of activation from clinically known important regions may indicate that the network is learning non-specific features, making them less likely to be well generalized to other datasets.
The characteristics that a machine learning system learns may depend on the design of the architecture. What's more, machine learning systems learn and perpetuate systemic inequalities based on the training data and goals provided to it. As healthcare AI systems continue to evolve toward future disease predictions, greater caution must be taken into account the huge differences in access to health care and outcomes for these groups.
In a recent review, Chen et al. provide an in-depth overview of potential sources of bias from problem selection to the post-deployment phase. Here, we focus on potential solutions early in the development of machine learning systems. Some advocate some ways to interpret other "black box" predictions of modern machine learning systems, while others advocate limiting the use of more interpretable models in the first place. In addition to combining inputs to structured data when training an entire AI system, the intermediate approach also involves training medical imaging neural networks using black-box models.
This can be achieved by building "converged networks" where tabular data is incorporated into an image- or video-based neural network, or other more advanced methods with the same basic goal (autoencoders that generate low-dimensional representations of combined data). Even without incorporating demographic inputs into high-dimensional visual networks, it was important for the team to review their model by comparing the performance of different gender, ethnicity, geography, and income groups.
Machine learning systems may inadvertently learn to further perpetuate and discriminate against minorities and people of color, so it's critical to understand this bias early in the model development process. Trust in machine learning systems is critical for broader adoption, just as exploring how and why a particular feature or variable leads to predictions, through an agnostic approach that combines significance maps and models that estimate feature importance.
Another approach is to limit the machine learning algorithm in the training logic, ensuring that optimization steps occur to control for demographic variables of interest. This is similar to a multivariate regression model, where the influence of risk factors of interest can be studied independently of baseline demographic variables. From a technical perspective, this will involve inserting an additional punitive loss into the training loop, keeping in mind the potential trade-offs with slightly lower model performance. For example, Fairlearn is a popular toolkit for assessing the fairness of traditional machine learning models, and constraint optimization based on fairlearn's algorithm (FairTorch) has been developed, a promising exploratory attempt to integrate bias adjustments during training. There are many open source toolkits that help researchers determine the relative importance of different variables and input streams (image predictions, as well as variables such as gender and ethnicity). These techniques may allow the development of fairer machine learning systems and even uncover hidden biases that are not expected.
6. Summary
While computational architectures and acquiring high-quality data are key to building good models, efforts are still needed to develop transformable machine learning systems for high-dimensional imaging patterns to better represent the "video" nature of the data. There is also a need to establish features that help address bias, uncertainty, and interpretability in the early stages of model development. Skepticism about medical imaging and artificial intelligence is beneficial and, in most cases, makes sense.
We hope that meaningful steps can be made in improving the delivery of AI by establishing capabilities that allow researchers to assess clinical performance, integration in hospital workflows, interactions with clinicians, and downstream risks of socio-demographic harm. We hope researchers will find this perspective useful because it outlines the potential challenges that await them in terms of clinical deployment, and can be instructive in addressing some of these issues.