Google AI annotated 10% of known protein sequences at one time, more than a decade of human research

2022-02-22 13:15:53

Reports from the Heart of the Machine

Editors: Zenan, Zhang Qian

Unlike AlphaFold, this time Google is exploring the use of deep learning to label proteins with functional labels.

Protein is an important component of all cells and tissues in the human body. All important components of the body need to be involved in proteins.

There are billions of protein types known to exist, but about a third of them are unknowable. We urgently need to explore this uncharted region because they are linked to important issues such as antimicrobial resistance and even climate change. For example, penicillin is the product of a natural reaction between proteins, and plant proteins can be used to reduce carbon dioxide in the atmosphere.

Recently, Google partnered with the European Bioinformatics Institute to develop ProtCNN, a technology that can use neural networks to reliably predict protein function, helping us shrink the last invisible regions of the protein universe.

Google said that this new method allows us to predict protein function, the functional effects of mutations, and design proteins, which can be applied to drug discovery, enzyme design, and even understanding the origin of life.

论文：Using deep learning to annotate the protein universe

Google AI annotated 10% of known protein sequences at one time, more than a decade of human research

Google's proposed method reliably predicts the effects of more proteins, and they're fast, cheap, and easy to try, and its research has increased the number of annotated protein sequences in the mainstream database Pfam by nearly 10 percent, outpacing the growth rate of the past decade and predicting 360 human protein functions.

The Pfam database is a collection of protein families, each represented in the form of multi-sequence alignments and hidden Markov models.

These results suggest that deep learning models will be a core component of future protein annotation tools.

For most people, we're more familiar with the work of DeepMind's previous algorithm for predicting protein structure, AlphaFold. AlphaFold shows us the shape of these mysterious biological machines, and the new research focuses on what these machines do and what they are used for.

Biomedicine is an extremely active field of science, with more than 100,000 protein sequences added to the global sequence database every day. However, unless accompanied by functional notes, these entries are of very limited use to practitioners. While efforts are made to extract annotations from the literature, evaluating more than 60,000 papers each year, the time-consuming nature of this task means that only 0.03% of publicly available protein sequences are manually annotated.

Inferring protein function directly from amino acid sequences is something the scientific community has long been looking into. Since the 1980s, methods such as BLAST have been proposed, which rely on pairwise sequence comparisons, assuming that the query protein has the same function as a highly similar sequence that has been annotated. Later, a signature-based approach was introduced, and the PROSITE database classified the "motifs" of short amino acids found in proteins with specific functions. A key improvement in the signature-based approach is the development of the profile hidden Markov model (pHMM). These models fold the alignment of related protein sequences into a single model that provides a likelihood score for the new sequences, describing how well they match the aligned set.

Crucially here, profile HMM allows for longer signatures and more ambiguous matches, currently used to update popular databases such as Interpro and Pfam. Later improvements have made these technologies more sensitive and computationally efficient, and their high availability as networking tools makes it easy for practitioners to integrate them into workflows.

These computational modeling methods have had a great impact on the academic community. However, to date, one-third of bacterial proteins have not been annotated for function. The reason for this is that the current method makes a completely independent comparison of each comparison series or model, so it may not be possible to take full advantage of the features shared by different functional classes.

Extended annotated protein sequence sets require remote homologous detection, i.e., accurate classification of sequences with low similarity to training data. The benchmark set resulting from the new study contains 21,293 sequences. ProtENN's accuracy in classifying all classes, including those with long-range test sequences, is a key requirement for expanding the coverage of the protein field. To solve the challenge of inferring from several examples, the authors used sequence representations of deep model learning to improve performance.

Performance of the Pfam-seed model.

Architecture of ProtCNN. The central diagram shows the input (red), embedded (yellow), and prediction (green) networks, as well as the residual network ResNet architecture (left), while the right diagram shows ProtCNN and ProtREP utilized by a simple nearest neighbor method. In this representation, each sequence corresponds to a point, and sequences from the same family are usually closer than sequences from other families.

ProtCNN learns a real vector representation of 1100 lengths per sequence, regardless of its misaligned length. To achieve high accuracy, representations from each family must be tightly packed together so that the different families are well separated from each other. To test whether this learning representation can be used to accurately classify sequences of the smallest family, the authors constructed a new method called ProtREP. For ProtREP, the researchers calculated the average learning representation of each family in their training sequence, resulting in a label family representation. Each reserved test sequence is then classified by finding its most recent marker in the learning representation space. For the same computational cost, ProtREP exceeds the accuracy of ProtCNN in clustering segmentation.

The combination of ProtENN and TPHMM improves the performance of remote homologous tasks. A simple combination of TPHMM and ProtENN models reduced error rates by 38.6 percent and improved the accuracy of ProtENN data from 89.0 percent to 93.3 percent.

To explore the deep model's understanding of protein sequence data, the authors trained ProtCNNs on 80% of the misaligned sequences from Pfam-full and calculated the similarity matrix of the learned amino acid representations.

The results show that ProtCNN learns a meaningful representation of protein sequences that can be generalized to unknown parts of the sequence space and can be used to predict and understand the properties of protein sequences. Another challenge is to detect protein domains and their positions in protein sequences. This task is similar to image segmentation, which is exactly what deep learning models are good at. Although ProtCNNs are trained using domains, the study demonstrates the ability of ProtCNNs to split a complete sequence into domains using a simple sliding window method.

Although sequence comparisons are not used, ProtCNN shows excellent accuracy.

Google AI annotated 10% of known protein sequences at one time, more than a decade of human research

Read on