With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

author：ScienceAI 2024-04-08 17:07:00

Edit | Radish peel

Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Detecting remote evolutionary relationships from sequences alone remains challenging compared to structural searches.

A research team from Fudan University, Shandong University, and Shanghai Jiao Tong University proposed PLMSearch (Protein Language Model), a homologous protein search method that uses sequences only as inputs, which can capture remote homology information hidden behind sequences.

Like MMseqs2, PLMSearch can search millions of query target protein pairs in seconds while increasing sensitivity by more than three times, comparable to current state-of-the-art structure search methods. In addition, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs that have different sequences but are structurally similar.

该研究以「PLMSearch: Protein language model powers accurate and fast sequence search for remote homology」为题，于 2024 年 3 月 30 日发布在《Nature Communications》。

Homologous protein search is one of the core technologies in the field of bioinformatics, which predicts the function and interaction of proteins by comparing protein sequences. Although search methods based on sequence similarity are widely used, they still present challenges in identifying remote evolutionary relationships. In addition, while structure search methods provide higher sensitivity, the cost and complexity of obtaining protein structures limit their application scenarios.

Although protein language models (PLMs) have shown advantages in structure-related tasks, how to effectively utilize PLMs to achieve fast and accurate homology detection when dealing with large-scale datasets remains a challenge.

Although the method combining deep learning representation and sequence alignment algorithm improves the accuracy, it still needs to face the problems of computational efficiency and model generalization ability. Therefore, the development of innovative methods that can overcome these limitations is of great significance to advance research in bioinformatics and related fields.

Here, a research team from Fudan University, Shandong University, and Shanghai Jiao Tong University proposed PLMSearch, a tool that uses sequences as input to search for homologous proteins through protein language models and PFAM sequence analysis, which can mine remote homology hidden behind sequences.

Diagram: PLMSearch overview. (Source: Paper)

PLMSearch consists of the following three parts:

(1) PfamClan filters out protein pairs that share the same Pfam family domain.

(2) SS-predictor (Structural Similarity predictor) uses embeddings generated by protein language models to predict the similarity between all query target pairs. Without a structure as input, PLMSearch doesn't lose much sensitivity because it uses a protein language model to capture remote homology information from deep sequence embeddings. In addition, the SS predictor used in this step uses structural similarity (TM-score) as the basic fact of training. This allows PLMSearch to achieve reliable similarity even without a structure as input.

(3) PLMSearch sorts PfamClan's pre-filtered pairs based on predicted similarity and outputs search results for each queried protein accordingly. Next, the protein pairs retrieved by PLMSearch were aligned using PLMAlign to obtain the alignment scores.

Search tests on SCOPe40-test and Swiss-Prot have shown that PLMSearch can search millions of query target protein pairs in a matter of seconds, just like MMseqs2, but with a more than three-fold increase in sensitivity, with performance comparable to current state-of-the-art structure search methods, especially in remote homology pairs. PLMSearch is one of the fastest search methods compared to other baseline methods and offers the best trade-off between accuracy and speed.

Illustration: PLMsearch achieves a sensitivity similar to that of a structural search method. (Source: Paper)

The team discussed at length the differences between search methods (e.g., PLMSearch) and alignment methods (e.g., pLM-BLAST and PLMAlign), noting that residue intercalation-based alignment methods, such as PLMAlign and pLM-BLAST, have good sensitivity.

Currently, the main limitation of these methods is the size of the target dataset. This is particularly evident in two key ways:

(1) Residue intercalation-based alignment requires preserving all residue embeddings for each protein in the target dataset, whereas PLMSearch only needs to retain the intercalations for each protein, resulting in a size difference of more than three orders of magnitude and posing a significant challenge for searching large datasets of 53.6 million proteins such as UniRef50.

(2) Residue embedding-based alignment determines the similarity between protein pairs through pairwise global (local) alignment, while PLMSearch can predict the similarity of millions of query target pairs with only a single forward pass of the SS-predictor network.

It is important to note that PLMSearch can only predict the similarity of protein pairs without providing any alignment recommendations. Therefore, PLMSearch + PLMAlign screens out protein pairs with similarity greater than 0.3 and provides alignment through PLMSearch, which not only compensates for the limitations of PLMSearch, but also avoids a large number of low similarity and meaningless alignments, thus maintaining high efficiency.

Illustration: PLMSearch accurately detects remote homology pairs. (Source: Paper)

In the future, the researchers plan to explore the interaction between queries and target residue embedding, which will provide better global and local sequence alignment results.

In conclusion, the researchers believe that PLMSearch has eliminated the low sensitivity limitations of the sequence search method. Since sequences are easier to obtain and apply than structures, PLMSearch is expected to be a more convenient method for searching for large-scale homologous proteins.

PlumShark:HTDPS://DMIBJJCTU.E.CN/PLUMSHARK

Link to paper: https://www.nature.com/articles/s41467-024-46808-5

With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

Read on