laitimes

An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

author:ScienceAI
An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

Edit | Violet

Not long ago, a research team at the University of Toronto released the first large-scale language model based on single-cell biology: scGPT, which is pre-trained on more than 10 million cells.

Now, for the first time, the research team has tried to update scGPT by generative pre-training on more than 33 million cells.

Bo Wang, the corresponding author of the paper and assistant professor at the University of Toronto, tweeted: "Exciting scGPT update: With a lot of community attention since its release in April, we are pleased to announce the first major update to scGPT, the basic model for single-cell multiomics data."

An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

The updated study, titled "scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI," was released on the bioRxiv preprint platform on July 2, 2023.

An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

Address: https://biorxiv.org/content/10.1101/2023.04.30.538439

Open source code and models: https://github.com/bowang-lab/scGPT

Detailed tutorial: https://scgpt.readthedocs.io/en/latest/

Several highlights of the scGPT update

Highlights of this update include:

  • Launch of the first GPT-style base model for single-cell multiomics data, pre-trained on over 33 million human cell atlas data.
  • Its generalist approach enables one model to accomplish multiple tasks in single-cell analysis, including multiomics synthesis analysis and perturbation prediction.
  • Use learned attention weights and gene embedding to discover intergene interactions specific to various conditions.
  • Reveal the scaling law that continuously enhances model performance as the amount of data increases.
  • The scGPT model Zoo (see github) now offers a variety of pre-trained base models and comprehensive pan-cancer models for a variety of solid organs. Start exploring your data using the most appropriate underlying model.

One Twitter user commented: "Absolutely amazing... Good stuff!"

An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

How is it done?

Here, for the first time, researchers attempt to build a single-cell base model scGPT by generative pre-training of more than 33 million cells. Researchers introduce new techniques to solve methodological and engineering challenges in pre-training large-scale single-cell omics data.

The updated scGPT model was pre-trained on more than 33 million cells, up from "over 10 million cells" in the previous time.

How is that done?

To process large-scale data, researchers use in-memory data structures that can quickly access and store hundreds of datasets. Establish a unified generative pre-training workflow specifically for non-sequencemic data, and adjust the Transformer architecture to learn both cellular and gene representations. In addition, generic pipelines with specific task objectives are provided for model fine-tuning, designed to facilitate the application of pre-trained models in a range of downstream tasks.

This update incorporates community feedback and leverages the latest data released by CELLxGENE. The updated scGPT has larger pre-trained data and more robust models, and expands the range of application tasks.

The researchers retrieved more than 10.3 million human PBMC scRNA-seq samples from the CELLxGENE website for pre-training of the underlying model. A total of 65 datasets were collected from CELLxGENE by filtering organisms (i.e., Homo sapiens), tissues (i.e., blood, bone marrow), and diseases.

An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

Data address: https://cellxgene.cziscience.com/

The updated scGPT demonstrates the transformative potential of single-cell basic models in three key areas.

  • First, scGPT represents the first large-scale generative base model that enables transfer learning across a variety of downstream tasks. Demonstrating "universal pre-training, fine-tuning on demand" as a universal solution for single-cell omics computing applications by achieving state-of-the-art performance in cell type annotation, genetic perturbation prediction, batch correction, and multiomics integration. Notably, scGPT is the only basic model that can integrate multiple single-cell omics including scATAC-seq data.
  • Second, by comparing gene embedding and attention weights between the fine-tuned model and the original pre-trained model, scGPT reveals valuable biological insights into gene-gene interactions specific to various conditions, such as cell type and perturbation state.
  • Third, our observations reveal a scaling law: a larger amount of pre-training data can produce excellent pre-training embeddings and further improve the performance of downstream tasks. This discovery highlights the exciting prospect that the underlying model can be continuously improved as the sequencing data available to the research community expands.

Based on these findings, the adoption of pre-trained basic models will greatly expand researchers' understanding of cell biology and lay a solid foundation for future discoveries. The publication of scGPT models and workflows is designed to enhance and accelerate research in these and other areas.

Updated scGPT: Pre-trained on over 33 million cells

scGPT, as the first base model in the single-cell domain, adopts a generative pre-training method. The core model contains stacked tansformer layers with multi-head attention that generate both cell and gene embedding. scGPT consists of two phases: initial general pre-training of large cell atlases and subsequent fine-tuning of smaller datasets for specific applications (Figure 1A-C).

In the pre-training phase, specially designed attention masks and generative training pipelines are introduced to train scGPT in a self-supervised manner to co-optimize cellular and gene representation. This innovative technique successfully addresses the non-sequential nature of gene expression to accommodate the NLG framework of sequence prediction.

During training, the model gradually learns to generate gene expression in cells based on cell states or gene expression cues.

During the fine-tuning phase, the pre-trained model can adapt to new datasets and specific tasks. Researchers offer flexible fine-tuning procedures for a variety of important downstream tasks in single-cell research.

An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

Figure 1: Overview of the scGPT model. (Source: Paper)

To collect diverse and extensive sequencing data for self-supervised pre-training of scGPT, the researchers collected scRNA-seq data from 33 million human cells under normal (non-disease) conditions obtained through the CELLxGENE collection (Figure 1D). This comprehensive dataset covers multiple cell types from 51 organs/tissues and 441 studies, providing a rich representation of cellular heterogeneity across the human body.

After pre-training, scGPT cells on 10% of the 33 million data were embedded for visualization using UMAP visualization techniques (Figure 1E). The resulting UMAP map shows good clarity, with cell types of local regions and clusters accurately represented by different colors. Given that more than 400 studies are included in the dataset, this demonstrates the superior ability of pre-training to mitigate the effects of technical batches.

The findings suggest that scGPT can effectively extract key biological insights about genes and cells. By further adapting to transfer learning, scGPT can be optimized to achieve state-of-the-art performance in a variety of downstream tasks, including multi-batch integration, multiomics integration, cell type annotation, genetic perturbation prediction, and gene network inference.

For the future, the researchers plan to pre-train on larger, more diverse datasets.

Reference content: https://twitter.com/BoWang87/status/1676056025072320512

Read on