Edit | Radish peel
With the advent of high-performance computing (HPC), computational biology has become a scientific discipline that is constantly innovating and accelerating maturity. In recent years, the field of machine learning has also benefited greatly from the practical application of HPC.
The researchers used ORNL's Summit supercomputer and tools developed by Google's DeepMind and Georgia Tech to accelerate the speed of accurately identifying protein structure and function throughout an organism's genome. The team recently released details of the High Performance Computing Toolkit and its deployment on Summit.
They propose a new HPC protocol that combines various machine learning methods for functional annotation of proteins based on structure on a genome-wide scale.
This protocol makes extensive use of deep learning and provides computational insights for best practices for training advanced deep learning models against high-throughput data such as proteomics data. The researchers demonstrated the methods currently supported by the protocol and detailed the future tasks of the protocol, including large-scale sequence comparisons using SAdLSA and using AlphaFold2 to predict protein tertiary structure.
Titled "High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function," the study was published on November 15, 2021 at the 2021 IEEE/ACM High-Performance Machine Learning Symposium in High-Performance Computing Environments (MLHPC) and December 27, 2021 Day added in IEEE Xplore.
These powerful computational tools are a major leap forward in solving major biological challenges
Translating the genetic code into meaningful functions, proteins are a key component in solving this challenge. They are also at the heart of solving many scientific questions about people, ecosystems and the health of the planet. As the primary building blocks of cells, proteins drive almost every process necessary for life—from metabolism to immune defenses to communication between cells.
"Structure determines function" is a motto in the field of protein research; complex 3D shapes guide how they interact with other proteins to do the work of cells.
Understanding the structure and function of proteins based on the long strings of nucleotides that make up DNA, A, C, T, and G, has long been a bottleneck in the life sciences, as researchers rely on well-founded guesses and painstaking experiments to validate structures.
"We're now working on a lot of data that astrophysicists are working on, all because of the genome sequencing revolution." Ada Sedova, a researcher at ORNL, said, "We hope to be able to use high-performance computing to acquire sequencing data and come up with useful inferences to narrow down the scope of the experiment. We want to quickly answer questions like, "What does this protein do, and how does it affect cells?" Problems like that. How can we use proteins to achieve goals, such as making the chemicals, drugs and sustainable fuels we need, or designing organisms that help mitigate the effects of climate change?"

Figure: SAdLSA Overview, a deep learning algorithm for protein sequence alignment. (Source: Thesis)
The research team focused on organisms that are critical to the DOE mission. They modeled the complete proteome of four microbes (all the proteins encoded in the genome of an organism), each with approximately 5,000 proteins. Two of these microorganisms have been found to produce important materials for making plastics. The other two are known to break down and transform metals. Structural data can inform new advances in synthetic biology and strategies to reduce the spread of pollutants such as mercury into the environment.
The team also generated a model of 24,000 proteins that act in peat moss. Peat moss plays a key role in storing large amounts of carbon in peat bogs, which contain more carbon than any forest in the world. The data could help scientists determine which genes are most important in enhancing peat mosses' ability to absorb carbon and withstand climate change.
Accelerate scientific discoveries
To find genes that allow peat moss to tolerate elevated temperatures, ORNL scientists first compared their DNA sequences to the model organism Arabidopsis thaliana, a thoroughly studied mustard plant species.
"Peat moss differs from this model by about 515 million years." Bryan Piatkowski, a researcher at ORNL Liane B. Russell, said, "Even for plants more closely related to Arabidopsis, we don't have a lot of empirical evidence on how these proteins behave. By comparing the nucleotide sequences to the model, we can only infer so many functions."
Being able to see another layer of protein structure added could help scientists find the most promising genetic candidates for experiments.
Piatkowski, for example, has been studying moss populations from Maine to Florida with the goal of identifying differences in their genes that may be adapted to climate. It has a long list of genes that may regulate heat tolerance. Some gene sequences differ by only one nucleotide, or in the language of the genetic code, only one letter differs.
"These protein structures will help us to find out if these nucleotide changes cause changes in protein function, and if so, how?" Will these protein changes ultimately help plants survive extreme temperatures?" Piatkowski said.
Looking for similarities in sequences to determine functionality is only part of the challenge. DNA sequences are translated into the amino acids that make up proteins. Through evolution, some sequences mutate over time, replacing one amino acid with another with another with similar properties. These changes don't always lead to functional differences.
Until recently, scientists did not have the tools to reliably predict the structure of proteins based on genetic sequences. Applying these new deep learning tools is a game-changer.
While the structure and function of proteins still need to be confirmed by methods such as physical experiments and X-ray crystallography, deep learning is changing paradigms, rapidly narrowing the vast field of candidate genes to the most interesting few genes for further study.
Revolutionary tool
One tool in a deep learning protocol is called sequence alignment in structural alignment deep learning, or SAdLSA; it is trained in a similar way to other deep learning models that predict protein structure. SAdLSA is able to compare sequences by implicitly understanding protein structure, even if the sequences are only 10% similar.
"SAdLSA can detect distantly related proteins that may or may not have the same function." Jerry Parks, orNL Computational Chemist and Group Leader, said, "Combining this with AlphaFold, which provides a 3D structural model of proteins, you can analyze active sites to determine which amino acids are acting chemically and how they contribute to that function."
Diagram: Plans for large-scale deployment of SAdLSA. (Source: Thesis)
The researchers demonstrated a new HPC toolbox for protein function annotation using structure-based deep learning methods. At the same time, it demonstrates the large-scale deployment of inference using the alignment method based on SAdLSA DL, as well as the development of distributed trainers and Summit nodes that utilize multiple GPUs, which will be further scaled up to accommodate larger training datasets.
The researchers also reported on the recombination and deployment of the AlphaFold structure prediction program using Singularity containers on Summit and prototype small genome-scale test cases on PACE resources.
Diagram: SAdLSA performance on a PDB70 database on Summit. (Source: Thesis)
The toolbox contains multiple methods for structure-based functional annotations that will be used in protocols to generate such annotations for large proteomes with unknown or low confidence annotations, and even to help validate proteins with known function, predicting their structural properties to provide more detailed information about the catalytic mechanisms and metabolic pathways in which these proteins may be involved.
In future work, the researchers hope to build on the toolbox to support emerging tasks in bioinformatics, including large-scale prediction of the tertiary and quadranial structures of proteins, as well as the development of new protocols using a variety of tools to provide high-confidence assumptions and inform and guide bench experiments.
Thesis link: https://ieeexplore.ieee.org/document/9652872/authors
Related: https://phys.org/news/2022-01-scientists-summit-supercomputer-deep-protein.html
Artificial Intelligence × [ Biological Neuroscience Mathematics Physics Materials ]
"ScienceAI" focuses on the intersection and integration of artificial intelligence with other cutting-edge technologies and basic sciences.
Welcome to follow the stars and click Likes and Likes and Are Watching in the bottom right corner.