Edit | Radish peel
Deep learning (DL) has recently made unprecedented progress in a major computational biology challenge: the half-century-old problem of predicting protein structure.
In this review, researchers at Rice University discuss the latest advances, limitations, and future prospects of deep learning in five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference.
They discussed each application area and covered the main bottlenecks of the DL approach, such as training data, problem scope, and the ability to leverage existing DL architectures in the new environment. Finally, the specific themes and general challenges faced by DL throughout the biosciences are summarized.
Titled "Current progress and open challenges for applying deep learning across the biosciences," the roundup was published in Nature Communications on April 1, 2022.

AlphaFold2's recent success in predicting the 3D structure of proteins from protein sequences highlights one of the most effective applications of deep learning in computational biology to date. Deep learning (DL) allows complex models consisting of multiple layers of nonlinear computational units to find representations of data with multiple layers of abstraction (Figure 1). The success of deep learning in a wide range of applications has been observed that the efficacy of using deep learning depends on developing specialized neural network architectures that can capture important properties of data, such as spatial locality (convolutional neural networks – CNNs), sequence properties (recurrent neural networks – RNNs), context dependencies (Transformers), and data distribution (autoencoders – AE).
Figure 1 shows the six deep learning architectures that are most widely used in computational biology. The review focuses primarily on computational biology applications; if you want to learn more about the full review of DL methods and architectures, the researchers recommend that readers read the LeCun team's paper.
LeCun Team Paper: https://www.nature.com/articles/nature14539
These DL models have revolutionized speech recognition, visual object recognition, and object detection, and have recently played a key role in solving important problems in computational biology. The application of deep learning in other areas of computational biology, such as functional biology, is growing, while others, such as phylogenetics, are in their infancy.
Given the large differences in the acceptance of DL in different areas of computational biology, some key questions remain unanswered:
(1) What makes a domain the first choice for DL methods?
(2) What are the potential limitations of DL in computational biology applications?
(3) Which DL model is best suited for a particular application area of computational biology?
Figure 1: Overview of machine learning scenarios and common DL schemas.
In this review, the researchers aimed to address these fundamental questions from the perspective of computational biology. However, the answer is highly task-specific and can only be addressed in the context of the corresponding application. Whalen's team has discussed the flaws in applying machine learning (ML) in genomics, but the goal of the review is to provide insights into the impact of DL in five different areas. While DL has had significant success in the biosciences (e.g., DeepVariant, DeepArg, metagenomic bins, and origin laboratories), the goal of this review is to focus only on a few diverse and broad subtopics.
The researchers evaluated DL's improvements to classical ML techniques in computational biology with varying degrees of success to date (Figure 2).
For each area, the limitations and opportunities for improvement of current approaches are explored and practical tips are included. They discussed five broad and distinct areas of computational biology: protein structure prediction, protein function prediction, genomic engineering, systems biology and data integration, and phylogenetic inference (Table 1).
These areas offer a range of levels of impact, from major paradigm shifts (AlphaFold2) to infancy DL applications (phylogenetic inference); overall, they offer a wealth of technical diversity to address the questions raised from this perspective.
The researchers reviewed progress on four computational biology topics.
(i) the ranking of paradigm shifts (where DL is significantly superior to other ML and classical methods and provides broad implications);
(ii) significant success (DL performance is generally higher than other ML and classical methods);
(iii) Moderate success (DL performance is generally comparable to other ML and classical methods);
(iv) minor success (the DL method is not widely adopted or performs poorly compared to other ML and classical methods);
Finally, common challenges facing DL in the biological sciences are discussed.
Figure 2: A summary view of the primary and unlabeled datasets, and the architecture used in computational biology deep learning methods.
The paradigm shift of DL was successful
Protein structure prediction
Protein structure prediction is arguably one of the most successful applications of deep learning in computational biology; this success is a paradigm shift. It is well known that the amino acid sequence of a protein determines its 3D structure, which in turn is directly related to its function (e.g., chemical reaction catalysis, signal transduction, scaffolding, etc.).
The history of protein structure prediction problems dates back to John Kendrew's determination of the 3D structure of myoglobin in the 1950s, a milestone in biochemistry and structural biology. Since then, X-ray crystallography has become the gold standard experimental method for protein structure determinations, as well as a reference for validating computational models for protein structure prediction.
Given the high cost and technical limitations of X-ray crystallography, as well as the growing popularity of biological sequences following the Human Genome Project, predicting the 3D structure of proteins from protein sequences has become The Everest in computational biology; a challenge widely referred to as the "protein folding problem." Initial efforts have focused on using biophysically accurate energy functions and knowledge-based statistical reasoning, but more recent progress has been made with a greater focus on deep learning.
One of the key reasons for DL's recent success in this field is the large amount of unsupervised data in the form of multi-sequence alignment (MSA), which makes it possible to learn nonlinear representations of evolutionary information about proteins.
AlphaFold2's impact on the field of structural biology is undeniable; it successfully demonstrates the use of DL-based implementations for high-precision protein structure prediction. As highlighted by numerous early citations, this achievement is already driving and accelerating further developments in the field.
In addition, DeepMind has partnered with the European Molecular Biology Laboratory (EMBL) to create an open database of protein structure modeled on AlphaFold2. The database already covers 98.5% of human proteins, of which at least 36% of amino acid residues are predicted with high confidence.
Finally, the DL-based approach does not obsolete experimental methods, but can improve the accuracy and scope of experimental methods, as demonstrated by initial applications to solve challenging structures through X-ray crystallography and cryo-EM data. However, many caveats, restrictions and outstanding issues remain. In particular, while AlphaFold2 succeeded in predicting the static structure of proteins, many of the key insights about the biological function of proteins came from their dynamic conformation. In addition, the dynamics of multiple protein interactions still present open challenges in the field. Going forward, it will be important to monitor the application of deep learning in these subsequent research areas.
DL's major achievements
Protein function prediction
Predicting protein function is the natural next step after protein structure prediction. Protein function prediction involves mapping the protein of interest to a curated ontology such as gene ontology (GO) terminology, biological processes (BP), molecular function (MF), and cellular components (CC).
Protein structure can convey a lot of information about these ontologies, but there is no direct mapping between the two, and the mapping is often very complex.
Although the number of protein sequences available in the UniProtKB database has grown dramatically, the functional annotations of the vast majority of proteins remain partially or completely unknown. Limited and unbalanced training examples, large output spaces for possible functions, and hierarchical nature of GO tags are some of the major bottlenecks associated with protein functional annotation.
To overcome some of the problems, recent research methods have leveraged features from different sources, including sequences, structures, interactive networks, scientific literature, homology, domain information, and even combined with one or more DL architectures to handle different stages of predictive tasks (such as feature representation, feature selection, and classification).
As one of the most successful deep learning methods to solve this problem, DeepGO combines CNNs to learn sequence-level embeddings and combines them with knowledge graph embeddings for each protein obtained from protein-protein interaction (PPI) networks. DeepGO was one of the first DL-based models to outperform BLAST and previous methods on functional annotation tasks in three GO categories.
Compared to the other tools in the three GO categories in the CAFA3 Challenge, DeepGOPlus is an improved version of the tool and is one of the best performing tools. DeepGOPlus uses convolutional filters of different sizes and separate maximum pools to learn dense signature representations of protein sequences embedded in the one-hot coding scheme. Studies have shown that combining the output of a CNN with DIAMOND's homology-based predictions can improve prediction accuracy.
Unsupervised methods such as DAE also help to learn dense, robust, and low-dimensional representations of proteins. Chicco's team developed a DAE to represent proteins used to assign missing GO annotations and showed a 6% to 36% improvement on six different GO datasets compared to non-DL methods. Miranda and Hu's team introduced Stacked Denoising Autoencoders (sdAE) to learn more robust representations of proteins. The Gilgorijevic team describes deepNfs that use multimodal DAEs (MDA) to extract features from multiple heterogeneous interaction networks, which outperform methods based on matrix decomposition and linear regression. Methods for learning low-dimensional embedding of proteins are constantly evolving.
In addition to predicting gene ontology tags, the study focused on several other task-specific functional categories, such as identifying specific enzyme functions and potential post-translational modification sites. These studies are a fundamental step toward developing new proteins with special functions or modifying the efficacy of existing proteins, as shown by DL's recent advances in enzyme engineering. Going forward, the application of deep learning in engineered proteins tailored to specific functions could help improve the throughput of candidate proteins in drug applications in other areas.
In addition to the architecture of these specifications, there are other ways to classify functionality using a combination of the above methods. Overall, previous results suggest that models that integrate multimodal data type features are more likely than models that rely on a single data type.
Trends from the literature suggest that relying on task-specific schemas can greatly enhance the representation of features of the respective data types. Future work in this direction is likely to focus on combining DAEs and RNNs for sequence-based representations, and graph convolutional networks (GCNs) for structure-based and PPI-based information. Combining these representations in hierarchical classifiers such as multitasking DNNs with biologically relevant regularization methods can provide an interpretable and computationally feasible DL architecture for protein function predictions.
Genome Engineering
Biomedical engineering, particularly genome engineering, is an important area in biology, where DL models have been increasingly adopted.
The future of DL is geared toward new editing technologies such as CRISPR-Cas12a (cpf1), base editor, and master editor. Although these methods did not introduce DSBs, their efficiency is still improving; in fact, DL has shown promise in predicting the efficiency of adenine base editor (ABE) and cytosine base editor (CBE) and major editor 2 (PE2) activities in human cells.
However, the challenge ahead lies in understanding these models. CRISPRLand is a recent framework that takes the first step towards the interpretation and visualization of DL models in terms of high-order interactions. In addition to interpretability, methods that researchers speculate can make uncertainty estimates of prediction outcomes have become more common in genome editing.
In addition, due to the significant impact of cell types on the efficiency of CRISPR experiments, it is critical to understand the variations in the distribution of DL models deployed in genome engineering. Integrating domain adaptation approaches to limit the impact of this distribution variation is one of the other important future directions.
Moderate success in deep learning
Systems biology and data integration
Systems biology models complex biological processes from a holistic perspective to ultimately unravel the link between genotype and phenotype. The integration of disparate omics data is at the heart of bridging this gap, enabling powerful predictive models that have led to several recent breakthroughs, from basic biology to precision medicine.
A small success for DL
Phylogenetics
Phylogeny is an evolutionary tree that simulates the evolutionary history of a group of taxa. The phylogenetic inference problem involves constructing phylogenetics from data obtained from taxa being studied, usually molecular sequences.
Figure 3: Standard and deep learning methods for phylogenetic inference.
The current success of DL in phylogenetic trees is impressive, but given the challenges, it is difficult to envision an end-to-end deep learning model that estimates phylogenetic trees directly from raw data in the near future. If one is to be developed, given that it relies on (possibly simulated) training data, its suitability to actual biological sequences needs to be carefully verified before replacing traditional phylogenetic methods.
General challenges of DL in the biological sciences
Not all applications of deep learning have been equally successful in computational biology. While DL has had significant success in some areas, such as protein structure prediction and genome editing, it faces significant hurdles in other areas, such as phylogenetic inference. The most common problems faced with DL methods stem from the lack of annotated data, the lack of fundamental facts inherent in non-simulated datasets, the serious differences between the distribution of training data and the distribution of real-world test data, potential difficulties in outcome benchmarking and interpretation, and ultimately overcoming biases and ethical issues in datasets and models. In addition, as data and deep learning models grow, training efficiency has become a major bottleneck to progress.
Specifically, the success of DL in different subfields of computational biology is highly dependent on the availability and diversity of standardized supervised and unsupervised datasets, ML benchmarks with clear biological implications, the computational nature of the problem, and the software engineering infrastructure for training DL models. The remaining challenges of DL in computational biology are related to improving model interpretability, extracting actionable and humanly understandable insights, increasing efficiency and limiting training costs, and ultimately mitigating the growing ethical problems of DL models; innovative solutions are emerging in the deep learning and computational biology communities.
Table: Computational biology and potential solution challenges are often faced when working with DL.
The review focused on two key areas for improvement: (i) interpretability and (ii) training efficiency.
conclusion
All in all, while the success of DL in areas such as protein structure prediction is undergoing a paradigm shift, performance in other areas such as functional prediction, genomic engineering, and multiomics is also rapidly improving compared to traditional methods. For other fields such as phylogenetics, classical computational methods seem to have the upper hand in these areas. Other DL-specific advances applied to the challenges of the biological sciences will further apply domain-specific biological knowledge while striving to improve interpretability and efficiency.
Artificial Intelligence × [ Biological Neuroscience Mathematics Physics Materials ]
"ScienceAI" focuses on the intersection and integration of artificial intelligence with other cutting-edge technologies and basic sciences.
Welcome to follow the stars and click Likes and Likes and Are Watching in the bottom right corner.