Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

author：ScienceAI 2024-04-17 14:02:00

Edit | Radish peel

Deciphering the relationship between genes and their genomic context is fundamental to understanding and designing biological systems. Machine learning has shown potential in learning the underlying relationships behind sequence-structure-function paradigms from large protein sequence datasets.

Researchers at Harvard University and the Massachusetts Institute of Technology (MIT) trained genomic language models (gLMs) on millions of metagenomic frameworks to analyze the underlying functional and regulatory relationships between genes.

gLM learns "contextualized" protein embeddings, captures genomic context as well as the protein sequence itself, and encodes biologically meaningful and functionally relevant information (e.g., enzyme function, taxonomy).

该研究以「Genomic language model predicts protein co-regulation and function」为题，于 2024 年 4 月 3 日发布在《Nature Communications》。

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

Evolutionary processes have created complex connections between the sequence, structure, and function of proteins that are critical to interpreting genomic data. While advances have been made in neural network (NN)-based protein structure prediction methods and protein language models (pLMs) in unsupervised learning, these models often ignore the interrelationships and context of proteins in the genome.

Especially in bacteria and archaea, evolutionary events such as horizontal gene transfer (HGT) have a significant impact on the organization and diversity of the genome. Therefore, there is a need for a method that captures the evolutionary link between genes, genomic background, and gene function. The existing attempts to model genomic information mainly focus on the prediction of gene function, while ignoring the continuity of genes in the multidimensional space.

Recent studies such as GenSLM have attempted to learn genome-scale information through pre-training and fine-tuning, but there is currently no method that can combine pre-training – different biological lineages, rich and continuous gene representations, and processing long fragments containing multiple genes – to learn genomic background information from different biological lineages.

To bridge the gap between genomic context and gene sequence structure and function, researchers at Harvard University and MIT have developed a genomic language model (gLM) to learn the background representation of genes. gLM uses pLM embeddings as input to encode relational attributes and structural information of gene products.

Figure: Schematic diagram of gLM training and inference. (Source: Paper)

With unsupervised training, the model learns the semantics and syntax of the language and improves performance by predicting masked words in masked language modeling. In particular, the model is based on a 19-layer Transformer architecture and trained with millions of unlabeled metagenomic sequences with masking language modeling goals, and the model learns to predict masked genes based on genomic context, allowing for estimation of up to four different prediction options and their probabilities in a given context.

The performance evaluation employs pseudo-precision metrics and focuses on the E.coli K-12 genome by excluding subfragments that are highly similar to it from the training set. The validation results showed that the gLM achieved a pseudo-precision of 71.9% and an absolute precision of 59.2%, indicating that it was able to learn meaningful confidence metrics, with 75.8% of the high-confidence predictions correct. The performance of gLM is significantly improved compared to bidirectional LSTM models trained with the same tasks and datasets (28% pseudo-precision and 15% absolute accuracy).

Figure: Validation accuracy curves for gLM (A) and biLSTM baseline (B). (Source: Paper)

At the same time, the researchers emphasized the importance of using pre-trained protein language models (pLM) representations, which reduce model performance to the level of random prediction (3% pseudo-precision and 0.02% absolute precision) when replaced with single-hot amino acid characterization.

Figure: Homology of protein-protein interactions predicted by gLM. (Source: Paper)

Overall, gLM provides a promising way to study basic biology, and the researchers have proposed several future optimization directions:

First, the Transformer architecture has proven successful in scaling efficiently, and increasing the number of parameters in the model and the size of the training dataset has been shown to greatly improve performance and versatility in both natural language and protein language processing. The team's model consists of about 1B parameters, which are at least an order of magnitude smaller than state-of-the-art pLM. With further hyperparameter tuning and scaling, the model will have better performance.

Second, the current model uses pLM embeddings to represent proteins in the input. These embeddings are pooled by averaging the amino acid residue-level hiding state across the entire protein sequence, so residue-specific information and synonymous mutational effects may be obscured. Future iterations of the model can use primitive residue-level or codon-level embedding as input to allow modeling of residue-to-residue coevolutionary interactions between proteins and the impact of synonymous mutations on gene function.

Third, the task of reconstructing masked protein embeddings requires modeling the distribution of possible embeddings, which is approximated using a fixed number of predictions. Future work can improve this by using generative methods such as diffusion or GAN models. This can provide better prediction accuracy and greater versatility for unseen datasets.

Fourth, the addition of non-protein modalities (e.g., non-coding regulatory elements) as inputs to gLM can also greatly improve gLM representation of biological sequence data, and protein function and regulation conditioned by other modalities can be learned.

Fifth, the model is mainly trained on bacterial, archaea, and viral genomes, so how this method can be applied to eukaryotic genomes, especially those with a wide range of intergenic regions, remains to be further explored.

Illustration: Linear detection of context-independent, context-only, and contextual gene embedding. (Source: Paper)

The researchers also point to the future direction of applying gLM to advance biological research:

1. Feature-based transfer learning for predicting protein function (e.g., Gene Ontology [GO] terminology), especially those with finite sequence and structural homology.

2. Fine-tune gLM for protein-protein-interactome prediction tasks.

3. Use gLM features to encode genomic context as an additional input for improved and contextualized protein structure prediction.

In conclusion, genomic language models are a powerful tool for impartially condensing important biological information from complete metagenomic sequences. Coupled with advances in long-read sequencing, researchers believe that the quality, quantity, and diversity of input data will improve dramatically. Genomic language modeling provides a pathway to bridge the gap between atomic structure and organism function, bringing scientists closer to biological systems modeling and ultimately manipulating biology with precision (e.g., genome editing, synthetic biology).

Paper link: https://www.nature.com/articles/s41467-024-46947-9

Predicting protein co-regulation and function, Harvard &amp; MIT trained a genomic language model

Read on

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model