On February 24, Beijing time, Dr. Shen Wanxiang, the author of this work, will bring online sharing on how to improve the learning effect of biomedical data by enhancing data representation and using convolutional neural networks. See the end of the article for more details.
The main factors that determine the recognition ability of AI models are data and algorithms, and the AI field has long focused on algorithms to improve performance, but the degree of exploration of data is far less than the development of algorithms. The data-centric APPROACH to AI builds AI systems based on high-quality data, primarily to ensure that data representations clearly demonstrate the connotative features that AI needs to learn.
Especially for small sample data (such as disease omics data) with high-dimensional and disorderly characteristics in the biomedical field, the improvement effect of appropriate data representation on the AI model may be far greater than the improvement of the model algorithm itself.
Researchers from the National University of Singapore and the Shenzhen International Graduate School of Tsinghua University collaborated to develop an unsupervised innovative method, AggMap, which structures unordered data and provides follow-up AI learning in the form of representation of 3D images, and then generates a structured feature map, which greatly improves the learning efficiency of the model, especially for the analysis of omics data.
The authors say the work provides a useful set of learning paradigms that could be used in the future for learning data in other areas.
Titled "AggMapNet: enhanced and explainable low-sample omicsdeep learning with feature-aggregated multi-channel networks", the method was published in the internationally renowned bioinformatics journal Nucleic AcidsResearch on January 31, 2022.
The input to the convolutional neural network (CNN) itself is 3D image data, the pixels of the image data are two-dimensional space-related, topologically linked, and the RGB multi-channel color image enriches the data information.
Compared with cnNs, traditional machine learning methods such as SVM, KNN, RF, and DNN do not consider the correlation between pixel feature points when the model is input, so the performance is not as good as CNN. If the pixels of the image are randomly arranged to form a "snowflake map", the learning efficiency of the CNN model is greatly reduced. This shows that high-performance CNN models rely on spatial local correlation of pixels, and CNNs based on multi-channel data perform better than single-channel data.
Based on the above background, the authors developed AggMap, a structured algorithm for high-dimensional, small-sample data, which converts 1-dimensional, disordered vectors into 3D structured, image-like data as input to the convolutional neural network AggMapNet.
To verify AggMap's ability to structure data, the author first randomly arranged the handwritten font MNIST pixels (the pixels were scrambled to form a snowflake map), and then used AggMap to structure the data based on the snowflake map to explore its structured capabilities.
Surprisingly, unsupervised AggMap, based on the intrinsic correlation of feature points, can completely reconstruct these random pixels to form an image that is highly consistent with the original image:
Figure 1: AggMap reconstructs random MNIST pixels, Org1: Original black-and-white image with 1 channel count; OrgRP1: "Snowflake Graph" after the original image is scrambled; RPAgg1: AggMap unsupervised image based on the scrambled "Snowflake Graph"; RPAgg5: AggMap Further based on clustered multi-channel segmented color images with a number of channels of 5.
What exactly is unsupervised AggMap?
Humans are able to perform logical restoration of fragments of broken pictures, such as solving puzzles or restoration of artifacts, as shown in Figure 2A. This ability stems from pre-learned prior knowledge to connect and combine fragments based on their relevance and edge connections. This knowledge is learned through a large number of fragment recovery processes. However, although people are able to repair objects from larger fragments, they cannot reconstruct images in which pixels are randomly arranged (for example, from image "a" to image "b" in Figure 2B). This is because the original information of the image from "b" to "a" has been completely lost.
Nevertheless , one can reconstruct the image from " a " to " c " based on the similarity of pixels ( feature points , FPs ) in " a " . The new image "c" is more structured than the image "a", and even the pattern on various patterns such as flowers, trunks, and leaves is very close to the original image.
The proposed AggMap aims to aggregate and map 1-dimensional disordered feature points into 3D structured feature maps (Fmaps) by mimicking human assembly capabilities (solving puzzle games) in a self-supervising manner. This structured process maps disordered feature points into structured patterns to enhance CNNs' more efficient learning of unordered data.
Figure 2: Example diagram of the reconstruction and structuring process. (A) Restore broken fragments to images of objects with a specific pattern. (B) Reconstruct and structure randomly arranged images into original images and structured images, respectively.
Therefore, the author calls AggMap the "Jigsaw puzzle solver" of feature points, which aims to bring together a series of high-dimensional, disordered feature point puzzles through the intrinsic correlation and topological connectivity of feature points (FP) themselves to form a feature map with a specific pattern, AggMap Feature point puzzles are done through self-supervision.
Figure 3: Feature point puzzle solver AggMap, where each pixel block is equivalent to one feature point.
Specifically, the self-supervised AggMap structures feature points are divided into the following steps (as shown in Figure 4):
Measure the distance and topological connectivity of feature points based on the correlation distance matrix of feature points (the relevance of feature points is measured by samples);
2D embedding of feature points based on this distance matrix.
Assigns 2D embedded feature points (2D scatters) to regular lattice points in 2D.
To enrich the input information, the authors used a hierarchical clustering algorithm to cluster feature points. The number of classes is a hyperparameter, and each class is a separate channel that distributes feature points to different channels to form 3D structured data.
To sum up, self-supervising AggMap uses the UMAP idea to structure disordered feature points by learning the intrinsic structure of its data. Its proxy task is to minimize the differences between two weighted topology maps built in the input data space and embedded in two-dimensional space. Therefore, AggMap uses manifold learning and hierarchical clustering to expose the topology and hierarchy of feature points to generate a structured feature map.
Figure 4: Flowchart of the self-supervised AggMap fitting process.
Unsupervised AggMap reconstructs randomly scrambled MNIST images
In order to test the feature structure ability of AggMap, the author arbitrarily randomly scrambled the MNIST data to generate a "snowflake map", which completely lost the data pattern of the original image. AggMap is then used for data structuring based on a "snowflake map" of pixels out of order.
The process of AggMap data structuring is essentially a process based on the idea of minimizing the cross-entropy loss function CE (minimizing the difference between weighted graphs D and F). The authors took 500 iterations to optimize the layout of the weighted figure F, and the dynamic video is shown below.
(For example code, see: https://github.com/shenwanxiang/bidd-aggmap/blob/master/paper/example/01_mnist_restore/MNIST-AggMap.ipynb)
Video 1: AggMap reconstructs the dynamic process of a MNIST that has been disrupted by pixels.
Video 2: AggMap reconstructs the dynamic process of MNIST that disrupts pixels 2. This video contains numbers 0-9 that change dynamically during the refactoring process.
As the number of iterations (epochs) increases, the generated Fmap becomes more structured and eventually forms a stable pattern as losses converge.
The reason AggMap can restore randomly arranged MNIST is that the authors argue that although the pixels have been randomly arranged (scrambled), the manifold structure of the pixel features of mnist has not changed completely (i.e., the topology can still be approximated by their pairwise correlation), and the manifold structure can be approximated with a low-dimensional weighted graph.
Although AggMap can roughly restore a randomized MNIST to the original image, it cannot restore a randomized F-MNIST. MNIST is curved data, and the correlation between feature pixels is not discrete but more evenly distributed, which is consistent with UMAP's assumption of uniform distribution of data.
The authors compare the cross-entropy CE loss and PCC correlation between MNIST and F-MNIST during the graphical layout optimization phase of AggMap feature recombination. MNIST has lower losses and higher PCC values, indicating that the 2D embedding distribution in MNIST is more similar to the topology of the original data. The final 2D embedding of mnIST FPs is also more evenly distributed than F-MNIST FPs.
Thus, AggMap can reconstruct a random MNIST, in part because although MNIST pixels are randomly displaced, the manifold structure between the pixels does not change completely, and the manifold structure can be approximated by a low-dimensional weighted graph. The randomized F-MNIST was reorganized into a more compact pattern, with some local patches reverting to the original patches. As a result, AggMap can refactor a randomized F-MNIST into a highly structured form, even if it cannot fully revert to the original image.
Since the correlation between feature points is measured in terms of sample size, different sizes of sample sizes may result in differences in the correlation distance between feature points. Very small samples may not accurately measure the intrinsic correlation of feature points. AggMap has separate fit and transform phases, which facilitate distance measurement (or pre-training) of feature points on a large number of unlabeled samples.
The authors used 1/1000 (60 samples), 1/100 (600 samples), 1/10 (6,000 samples), 1/5 (12,000 samples), 1/2 (30,000 samples), and all 60,000 samples of the pixel-randomly arranged MNIST training set for pretraining (Figure 5), and the degree of structure obtained was inconsistent. The larger the sample size of the unsupervised fit, the more structured the feature map it generates, and the closer the randomly scrambled MNIST is to the true number.
Figure 5: AggMap prefits randomly arranged images in different quantities to reconstruct a MINST image (RPAgg1).
Advantages of cluster-based multichannels
AggMap data structuring focuses primarily on spatially correlated and multichannel features in 2D. To further enhance the efficiency of the CNN-based model AggMapNet on unordered data, the authors use a cluster-based multichannel generation strategy.
The idea innovatively clusters feature points into C clusters, each of which will be assigned to a separate channel.
In contrast to single-channel, a multichannel feature map is a non-stacked representation of data. The larger the C, the more fine-grained the feature points are separated, and the authors have proved through experimental results that the method has a significant effect on the improvement of the CNN model AggMapNet.
Multichannel data characterization has clear advantages, as shown in Figure 6, on a cell cycle dataset (5 samples, each sample is a cell cycle, each cycle has 5162 gene expressions, but the expression amount is not the same), through the clustering of multichannels (increasing the number of channels), it is easy to select genes specific to each cycle.
This is equivalent to feature selective learning, for high-dimensional features, avoiding the feature selection process of traditional methods, and realizing automated, multi-level learning.
Figure 6: AggMap's performance on the cell cycle dataset CCTD-U reconfiguration.
The authors further tested the effect of multichannel on model performance on different datasets, as shown in Figure 7. Multi-channel and single-channel (C=1) can significantly improve the performance of the model. The greater the number of channels, the better, but too many channels can cause overfitting. Therefore, the number of channels is a hyperparameter, but overall, multi-channel improvement to the model is very significant.
Figure 7: The performance impact of the number of channels on the AggMapNet model.
Interpretability and application of AggMapNet
To enhance the interpretability of the CNN-based AggMapNet model, the authors also integrated two model-independent feature attribution methods: the Shapley-explainer-based method of kernel Shapley value and the method based on simple feature substitution (Simply-Explainer).
Although the kernel Shapley method is based on a solid theoretical foundation of game theory and is widely used in computational feature importance unknown to models, the authors mention that it has 2 main problems in the measurement of feature importance: first, the computational complexity is exponential when calculating global feature importance, and using Shapley-explainer to calculate feature importance in high-dimensional data is very time-consuming. Second, because the kernel Shapley method considers the amount of contribution of each feature to the model's predicted values (rather than their true values), it may not be able to fully explore the global relationship between features and real results.
The authors developed Simply-explainer to provide an additional approach to AggMapNet model interpretation. Simply-explainer is designed to more quickly calculate the global feature importance of high-dimensional omics features and consider the relationship of features to real labels.
Figure 8: Simply-Explainer in AggMapNet calculates feature importance.
The author compares the interpretation effects of Shapley-explainer and Simply-explainer. In the local interpretation of the MNIST recognition model, Simply-explainer shows higher PCC and SSIM values than Shapley-explainer on the MNIST image recognition model.
In addition, the global feature importance (GFI) calculated by the two interpreters is highly correlated in the global interpretation of the breast cancer diagnostic model. However, simply-explainers have much lower computational complexity than Shapley-explainers. And feature importance scores in Simply-explainer tend to be more discrete than Shapley-explainers, suggesting that Simply-explainers can be a competitive method for identifying key biomarkers.
The authors further used the AggMapNet interpretable module Simply-explainer to identify key metabolites and proteins for COVID-19 severity prediction based on high-dimensional proteome and metabolome data. These key key metabolites and proteins are highly consistent with the literature.
These interpretations suggest that Simply-explainers may be a better choice for revealing important features. Simply-explainer also exhibits a high degree of differentiation of feature importance, with very fast calculations, particularly well suited to revealing key biomarkers in high-dimensional datasets.
Figure 9: Comparison of the interpretation effects of Shapley-explainer and Simply-explainer in the MNIST recognition model in AggMapNet.
Discussion and conclusions
The main idea of this paper is to structure data based on an unsupervised approach, followed by learning the data using convolutional neural networks. Through unsupervised AggMap and supervised training AggMapNet, a high-dimensional, unordered data learning process is provided.
In the unsupervised data structure, the optimization focusing on "local spatial correlation" and "multi-channel" significantly improves the performance of the model, indicating that appropriate data representation plays a great role in the learning of the model.
AggMap of structured data can be used as transfer learning, that is, to precompute the correlation of feature points on a large number of unlabeled samples, and then convert on the labeled data of small samples to generate structured feature maps and improve the learning efficiency of the model.
This method is very beneficial for the learning of tabular data of high-dimensional small samples (tabular data, each row is a sample, each column is a feature). AggMap/AggMapNet provides a useful learning paradigm that could be used in other areas of data learning in the future.
Thesis link: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac010/6517966
Code link: https://github.com/shenwanxiang/bidd-aggmap
On February 24, Beijing time, the latest issue of the Heart of the Machine was shared by Dr. Shen Wanxiang, the author of this work, to introduce how to improve the learning effect of biomedical data by enhancing the representation of data and using convolutional neural networks.
Share topic: Data Characterization and Convolutional Neural Networks in Omics Machine Learning
Guest speaker: Dr. Shen Wanxiang, currently studying at the School of Pharmacy of the National University of Singapore (NUS), is about to graduate with a Ph.D. and join the NuS Department of Chemistry, engaged in AI-assisted drug design and synthesis research. He has worked for AI companies such as Tsinghua Data Innovation Base D-lab, Beijing Megvii Science and Technology Research Institute, etc., and has served as a data scientist and visual algorithm researcher. His main research interests include omics data mining and drug discovery based on machine learning, drug design, and the development of biomedical data learning algorithms, and he has published many interdisciplinary related papers in international journals such as Nature Machine Intelligence and Nucleic Acid Research.
Sharing time: 19:00-20:00 Beijing time on February 24
Live broadcast room: Pay attention to the video number of the mobile group, which will start broadcasting on February 24, Beijing time.
Communication group: This live broadcast has a QA session, welcome to join this live broadcast exchange group to discuss exchanges.
If the group has exceeded the number of people, please add the Machine Heart Assistant: syncedai2, syncedai3, syncedai4 or syncedai5, and note "Omics" to join.
Heart of the Machine · Motorized group
The mobile group is an artificial intelligence technology community initiated by The Heart of Machines, focusing on academic research and technical practice theme content, bringing a series of content such as online technology open courses, academic sharing, technical practice, and approaching top laboratories for community users. The mobile group will also hold offline academic exchange meetings and activities such as organizational talent services and industrial technology docking from time to time, and welcome all technical practitioners in the field of AI to join.