laitimes

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Edit | Carrot core

Pretrained language models have shown promising promise in analyzing nucleotide sequences, but there are still challenges for multifunctional models that perform well in different tasks using a single set of pretrained weights.

A team from Baidu's Big Data Lab (BDL) and Shanghai Jiao Tong University developed RNAErnie, an RNA-centric pre-trained model based on the Transformer architecture.

The researchers evaluated the model with seven datasets and five tasks, demonstrating the superiority of RNAErnie in both supervised and unsupervised learning.

RNAErnie outperformed the baseline with a 1.8% improvement in classification accuracy, a 2.2% increase in interaction prediction accuracy, and a 3.3% improvement in the structure prediction F1 score, demonstrating its robustness and adaptability.

该研究以「Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning」为题,于 2024 年 5 月 13 日发布在《Nature Machine Intelligence》。

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

RNA plays a key role in the central law of molecular biology, and it is responsible for passing genetic information from DNA to proteins.

RNA molecules play a vital role in a variety of cellular processes such as gene expression, regulation, and catalysis. Given the importance of RNA in biological systems, there is a growing need for efficient and accurate methods for the analysis of RNA sequences.

Traditional RNA sequence analysis relies on experimental techniques such as RNA sequencing and microarrays, but these methods are often costly, time-consuming, and require a large amount of RNA input.

To address these challenges, a team from Baidu BDL and Shanghai Jiao Tong University developed a pre-trained RNA language model: RNAErnie.

RNAErnie

The model is built on top of the Knowledge Integration Enhanced Representation (ERNIE) framework and contains multi-layer and multi-head Transformer blocks, each with a hidden state dimension of 768. Pre-training was performed using an extensive corpus consisting of approximately 23 million RNA sequences carefully selected from RNAcentral.

The proposed motif-aware pre-training strategy involves basal-level masking, subsequence-level masking, and motif-level random masking, which effectively captures subsequence and motif-level knowledge and enriches the representation of RNA sequences.

In addition, RNAErnie labels coarse-grained RNA types as special vocabularies and appends markers of coarse-grained RNA types to the end of each RNA sequence during pre-training. By doing so, the model has the potential to discern the unique characteristics of various RNA types, facilitating domain adaptation to a variety of downstream tasks.

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Figure: Model overview. (Source: Paper)

Specifically, the RNAErnie model consists of 12 Transformer layers. In the Topic Aware pre-training phase, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database, using self-supervised learning and Topic Aware multi-level random masking.

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Figure: Topic-aware pre-training and type-guided fine-tuning strategies. (Source: Paper)

In the type-guided fine-tuning phase, RNAErnie first uses output embeddings to predict possible coarse-grained RNA types, and then uses the predicted types as auxiliary information to fine-tune the model with task-specific headers.

This approach enables the model to be adapted to a wide range of RNA types and enhances its utility in a wide range of RNA analysis tasks.

More specifically, to accommodate changes in distribution between the pre-trained dataset and the target domain, RNAErnie leverages domain adaptation to combine the pre-trained backbone with downstream modules in three neural architectures: a frozen backbone with trainable headers (FBTH), a trainable backbone with trainable headers (TBTH), and a stack for type-guided fine-tuning (STACK).

In this way, the proposed method can optimize both backbone and task-specific headers end-to-end, or fine-tune task-specific headers using embeddings extracted from the frozen backbone, depending on the downstream application.

Performance evaluation

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Figure: RNAErnie captures a multi-level ontology pattern. (Source: Paper)

The researchers evaluated the method and the results showed that RNAErnie outperformed existing advanced technologies across seven RNA sequence datasets covering more than 17,000 major RNA motifs, 20 RNA types, and 50,000 RNA sequences.

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Figure: RNAErnie performance on RNA secondary structure prediction tasks using the ArchiveII600 and TS0 datasets. (Source: Paper)

Evaluated using 30 leading RNA sequencing technologies, demonstrating the generalizability and robustness of RNAErnie. The team employed accuracy, precision, recall, F1 score, MCC, and AUC as evaluation metrics to ensure a fair comparison of RNA sequence analysis methods.

Currently, there are few studies that apply Transformer architectures with enhanced external knowledge to the analysis of RNA sequence data. The RNAErnie framework from the ground up integrates RNA sequence embedding and self-supervised learning strategies, resulting in superior performance, interpretability, and generalization potential for downstream RNA tasks.

In addition, RNAErnie can be adapted to other tasks by modifying the output and monitoring signals. RNAErnie is publicly available and is an effective tool for understanding type-guided RNA analysis and advanced applications.

Locality

While the RNAErnie model has innovated in RNA sequence analysis, it still faces some challenges.

First, the model is limited by the size of the RNA sequence it can analyze, as sequences longer than 512 nucleotides are discarded, potentially missing important structural and functional information. Chunking methods developed to process longer sequences can lead to further loss of information about remote interactions.

Second, the focus of this study is narrow, focusing only on the RNA domain and not extending to tasks such as RNA protein prediction or binding site identification. In addition, the model encountered difficulties in considering the three-dimensional structural motifs of RNA, such as loops and junctions, which are critical for understanding RNA function.

What's more, there are potential limitations to existing ex-post architecture designs.

epilogue

Nonetheless, RNAErnie has great potential to advance RNA analysis. The model demonstrates its versatility and effectiveness as a universal solution in different downstream tasks.

In addition, the innovative strategies employed by RNAErnie are expected to enhance the performance of other pre-trained models in RNA analysis. These findings make RNAErnie a valuable asset, providing researchers with a powerful tool to unravel the complexities of RNA-related research.

Paper link: https://www.nature.com/articles/s42256-024-00836-4

Read on