Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Edit | Carrot core

Pretrained language models have shown promising promise in analyzing nucleotide sequences, but there are still challenges for multifunctional models that perform well in different tasks using a single set of pretrained weights.

A team from Baidu's Big Data Lab (BDL) and Shanghai Jiao Tong University developed RNAErnie, an RNA-centric pre-trained model based on the Transformer architecture.

The researchers evaluated the model with seven datasets and five tasks, demonstrating the superiority of RNAErnie in both supervised and unsupervised learning.

RNAErnie outperformed the baseline with a 1.8% improvement in classification accuracy, a 2.2% increase in interaction prediction accuracy, and a 3.3% improvement in the structure prediction F1 score, demonstrating its robustness and adaptability.

该研究以「Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning」为题，于 2024 年 5 月 13 日发布在《Nature Machine Intelligence》。

RNA plays a key role in the central law of molecular biology, and it is responsible for passing genetic information from DNA to proteins.

RNA molecules play a vital role in a variety of cellular processes such as gene expression, regulation, and catalysis. Given the importance of RNA in biological systems, there is a growing need for efficient and accurate methods for the analysis of RNA sequences.

Traditional RNA sequence analysis relies on experimental techniques such as RNA sequencing and microarrays, but these methods are often costly, time-consuming, and require a large amount of RNA input.

To address these challenges, a team from Baidu BDL and Shanghai Jiao Tong University developed a pre-trained RNA language model: RNAErnie.

RNAErnie

The model is built on top of the Knowledge Integration Enhanced Representation (ERNIE) framework and contains multi-layer and multi-head Transformer blocks, each with a hidden state dimension of 768. Pre-training was performed using an extensive corpus consisting of approximately 23 million RNA sequences carefully selected from RNAcentral.

The proposed motif-aware pre-training strategy involves basal-level masking, subsequence-level masking, and motif-level random masking, which effectively captures subsequence and motif-level knowledge and enriches the representation of RNA sequences.

In addition, RNAErnie labels coarse-grained RNA types as special vocabularies and appends markers of coarse-grained RNA types to the end of each RNA sequence during pre-training. By doing so, the model has the potential to discern the unique characteristics of various RNA types, facilitating domain adaptation to a variety of downstream tasks.

Figure: Model overview. (Source: Paper)

Specifically, the RNAErnie model consists of 12 Transformer layers. In the Topic Aware pre-training phase, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database, using self-supervised learning and Topic Aware multi-level random masking.

Figure: Topic-aware pre-training and type-guided fine-tuning strategies. (Source: Paper)

In the type-guided fine-tuning phase, RNAErnie first uses output embeddings to predict possible coarse-grained RNA types, and then uses the predicted types as auxiliary information to fine-tune the model with task-specific headers.

This approach enables the model to be adapted to a wide range of RNA types and enhances its utility in a wide range of RNA analysis tasks.

More specifically, to accommodate changes in distribution between the pre-trained dataset and the target domain, RNAErnie leverages domain adaptation to combine the pre-trained backbone with downstream modules in three neural architectures: a frozen backbone with trainable headers (FBTH), a trainable backbone with trainable headers (TBTH), and a stack for type-guided fine-tuning (STACK).

In this way, the proposed method can optimize both backbone and task-specific headers end-to-end, or fine-tune task-specific headers using embeddings extracted from the frozen backbone, depending on the downstream application.

Performance evaluation

Figure: RNAErnie captures a multi-level ontology pattern. (Source: Paper)

The researchers evaluated the method and the results showed that RNAErnie outperformed existing advanced technologies across seven RNA sequence datasets covering more than 17,000 major RNA motifs, 20 RNA types, and 50,000 RNA sequences.

Figure: RNAErnie performance on RNA secondary structure prediction tasks using the ArchiveII600 and TS0 datasets. (Source: Paper)

Evaluated using 30 leading RNA sequencing technologies, demonstrating the generalizability and robustness of RNAErnie. The team employed accuracy, precision, recall, F1 score, MCC, and AUC as evaluation metrics to ensure a fair comparison of RNA sequence analysis methods.

Currently, there are few studies that apply Transformer architectures with enhanced external knowledge to the analysis of RNA sequence data. The RNAErnie framework from the ground up integrates RNA sequence embedding and self-supervised learning strategies, resulting in superior performance, interpretability, and generalization potential for downstream RNA tasks.

In addition, RNAErnie can be adapted to other tasks by modifying the output and monitoring signals. RNAErnie is publicly available and is an effective tool for understanding type-guided RNA analysis and advanced applications.

Locality

While the RNAErnie model has innovated in RNA sequence analysis, it still faces some challenges.

First, the model is limited by the size of the RNA sequence it can analyze, as sequences longer than 512 nucleotides are discarded, potentially missing important structural and functional information. Chunking methods developed to process longer sequences can lead to further loss of information about remote interactions.

Second, the focus of this study is narrow, focusing only on the RNA domain and not extending to tasks such as RNA protein prediction or binding site identification. In addition, the model encountered difficulties in considering the three-dimensional structural motifs of RNA, such as loops and junctions, which are critical for understanding RNA function.

What's more, there are potential limitations to existing ex-post architecture designs.

epilogue

Nonetheless, RNAErnie has great potential to advance RNA analysis. The model demonstrates its versatility and effectiveness as a universal solution in different downstream tasks.

In addition, the innovative strategies employed by RNAErnie are expected to enhance the performance of other pre-trained models in RNA analysis. These findings make RNAErnie a valuable asset, providing researchers with a powerful tool to unravel the complexities of RNA-related research.

Paper link: https://www.nature.com/articles/s42256-024-00836-4

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

Read on

Small tricks make a big difference, "only read twice prompts" makes the loop language model surpass Transformer++

PubMed GPT: A domain-specific large language model for biomedical texts

The current state of large language models: evolving along an S-curve

Carnegie Mellon University launches online graduate certificates in generative AI and large language models

How do I build a large language model from scratch and further train and fine-tune it?

MICROSOFT, NVIDIA AND OPENAI ARE ALL FULLY SUPPORTING, AND THIS IS THE HUMANOID ROBOT CLOSEST TO TRUSS'S "OPTIMUS PRIME" AT PRESENT! On August 6, Figure was officially released

Interpretation of the paper | ACL 2024: Self-distillation bridges distribution differences in language model fine-tuning

Report: Large Language Model Natural Language Processing Job Recruitment Increases by 111% Year-on-Year

Top 10 Global Company News of the Week | Alibaba's large language model is open to the global open source community; The Boeing union strikes 737 to suspend production

大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

Li Shen, Hu Renfen, Wang Lijun丨Construction and application of ancient Chinese large language model

20,000 words: The intersection of large language models, prompt learning, and future technology research and development

Apple issued a question: large language models are simply unable to perform logical reasoning

Institutions are optimistic about the decline of experts and criticize the project for being difficult, will the large language model become an AI bubble that is about to burst?

Millions of robust data training, new SOTA for 3D scene large language models! IIT and others released Robin3D

CNCC | Explore the potential and limitations of large language models: where are the boundaries of the capabilities of large language models

【AASLD2024 Express】Prediction of HBsAg clearance by peginterferon α-2b treatment: a simple model based on baseline HBsAg levels

Large models lead the 6G revolution! The latest review explores the future of communication methods, covering multimodality, RAG, etc

The top CP of the large model turned from sweet to abusive: they were dissatisfied with each other, and they all looked for a spare tire, because the money was unpleasant

Archetype AI released a large model of Newtonian physics to learn physics principles from sensor data

CNCC | The future of multimodal affective computing under large models

The "Fuxi Eye" large model was released! It has the world's largest ophthalmic image database

New car | The AI large model is on the car, 13 new/27 optimizations, and the ZEEKR 009 glorious OTA upgrade

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

Surveying and Mapping Bulletin | Ren Ping: Noise data visualization based on LOD1 city model

The terminal AI grading standard has been implemented, and the "fire" of the mobile phone model has burned to the agent

J Clin Invest丨Yang Weili/Li Shihua/Li Xiaojiang's team used monkey models to reveal new pathological mechanisms of Parkinson's disease

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase

Nearly 1,000 teenagers in the city gathered at Zhonghai Expo to show their skills in the three major model competitions of navigation, aviation and architecture

DeepMind and MIT developed Fluid, which enables autoregressive models to achieve large-scale expansion of Wensheng graphs

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China

ByteDance responded to the attack on the intern for the training of the large model: it has been dismissed and does not affect the online business