Deep learning predicts protein-protein interactions

2022-01-15 12:25:47

Edit | Radish peel

Professor Lenore Cowen of Tufts University and researchers at the Massachusetts Institute of Technology collaborated to design a structure-driven deep learning approach based on recent advances in neurolinguistic modeling. The team's deep learning model, called D-SCRIPT, was able to predict protein-protein interactions (PPIs) from primary amino acid sequences.

The researchers combined advances in neural language modeling and structure-driven design to develop D-SCRIPT, an interpretable and generalizable deep learning model that uses only its sequences to predict interactions between two proteins and maintain high accuracy with limited training data and across species.

The test results show that the D-SCRIPT model trained on 38,345 human PPIs significantly improves the functional characterization of Drosophila proteins compared to state-of-the-art methods. Evaluating the same D-SCRIPT model on protein complexes with known 3D structures, the researchers found that the protein-to-protein contact map output by D-SCRIPT significantly overlapped with the underlying facts.

The team applied D-SCRIPT to screen the PPI of cows (Bos taurus) genome-wide and focused on rumen physiology, identifying functional gene modules associated with metabolism and immune responses. Predicted interactions can then be used to make large-scale functional predictions, solving the genome-to-phenomenon group challenge, especially in species with little data.

该研究以「D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions」为题，于 2021 年 9 月 17 日刊载在《Cell Systems》。

Deep learning predicts protein-protein interactions

D-SCRIPT is an interpretable method of predicting PPIs from a sequence. D-SCRIPT pursues a structure-based approach, calculating the predicted scores of protein pairs as the binding compatibility of their respective structures. Since structures are more conserved over the course of evolution than sequences, physical models of this interaction can be well generalized to the entire species.

Illustration: D-SCRIPT motivation and workflow. (Source: Thesis)

The intermediate contact plot representation in the model is directly interpretable and can be used to validate predictions or to study protein-binding regions at the residue scale. Thus, D-SCRIPT incorporates a small but growing set of advances in explainable deep learning approaches in computational biology. The team's modular design also supports studying model output at different stages, and the researchers demonstrated that each layer captures incremental structural information.

The advantage of sequence-based approaches such as D-SCRIPT is that input sequence data is almost always available due to the huge advances in low-cost genome sequencing. Compared to PIPR, the most advanced deep learning method that also uses sequences as inputs, D-SCRIPT is more versatile across species; therefore, accurate de novo PPI predictions for proteins that are less studied in non-model organisms or flies are more effective.

Illustration: D-SCRIPT schema. (Source: Thesis)

The researchers suspect that D-SCRIPT is relatively successful between species, but the poor performance in intra-species assessments is due to the simplicity and degree of regularization of the model. These design choices enhance the universality of the D-SCRIPT, guiding it to learn the general structural aspects of interactions, rather than using the network structure or the frequency of any individual protein as interaction partners. However, for some tasks, a balance may need to be struck between the cross-species generalization of D-SCRIPT and the intraspecies specificity of other state-of-the-art methods. Future research directions may be transfer learning, tuning pre-trained D-SCRIPT models to target species, while another approach may be to combine it with associative graph theory PPI predictions.

Diagram: Protein interaction networks in the rumen of cattle. (Source: Thesis)

It is worth noting that D-SCRIPT does not require multi-sequence alignment (MSA). However, the pre-trained language model used in D-SCRIPT is trained on the MSA of the entire protein corpus, allowing its inputs to be characterized implicitly capturing some aspect of evolutionary conservatism. Previously, coevolution-based methods explicitly using MSA have been shown to be very effective in reconstructing single protein contact maps and 3D structures. When extending them to PPI predictions, another challenge is determining the correct order of correspondence between the two MSA rows.

In prokaryotic genomes where homolinear conservatism can provide a lot of information, methods such as ComplexContact, EV Complex, and Gremlin have been shown to perform well and provide details of residue-level interactions. However, the success rate for extending these methods to more complex eukaryotic genomes is lower.

Illustration: D-SCRIPT embedding represents structure and interaction. (Source: Thesis)

The researchers found that the need to compute MSAs is a performance bottleneck that makes eukaryotic genome-scale predictions using them unfeasible, thus limiting the applicability of EV-like methods in this setup. Still, explicitly combining co-evolutionary insights can improve the accuracy of D-SCRIPT, and future work may explore ways to do so without sacrificing speed. Insights from relevant advances in predicting contact maps and the structure of individual proteins can also be incorporated into our model architecture.

D-SCRIPT illustrates that learning the language of individual proteins is a highly successful deep learning effort and also helps decode the language of protein interactions. Leveraging Bepler and Berger's pre-trained language models, you can indirectly benefit from rich data on the 3D structure of individual proteins. In contrast, PPI prediction methods that are directly supervised by the 3D structure of protein complexes need to compete with relatively small corpora in order to learn the physical mechanisms of interactions.

Illustration: D-SCRIPT predicts a biologically significant contact diagram. (Source: Thesis)

Scalable computational methods are urgently needed to infer the function of genes from sequences in non-model organisms. While the sequencing revolution has helped make the genome more widely available, functional data are still lacking. PPI prediction using D-SCRIPT is fast, enabling genome-scale screening. For example, the team was able to evaluate 50 million candidate PPIs for B. taurus on a single GPU over 8 days.

Using D-SCRIPT, a workflow consisting of genome-scale PPI predictions, followed by graph theory analysis of PPI networks to identify functional modules, high-confidence predictions of large-scale gene function can be generated; the team demonstrated this in a dairy rumen case study.

This de novo PPI prediction is useful even in model organisms, such as nematodes, for which known parts of the PPI network are still very sparse. In other organisms where some PPI data does exist, future work could effectively combine this data with D-SCRIPT predictions. The researchers hope that its combination of broad applicability, cross-species accuracy, and speed will make D-SCRIPT a useful community resource for solving genome-to-phenomenon challenges.

Thesis link: https://doi.org/10.1016/j.cels.2021.08.010

Related: https://www.eurekalert.org/news-releases/936669

Artificial Intelligence × [ Biological Neuroscience Mathematics Physics Materials ]

"ScienceAI" focuses on the intersection and integration of artificial intelligence with other cutting-edge technologies and basic sciences.

Welcome to follow the stars and click Likes and Likes and Are Watching in the bottom right corner.

Deep learning predicts protein-protein interactions

Read on