An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

Edit | Violet

Not long ago, a research team at the University of Toronto released the first large-scale language model based on single-cell biology: scGPT, which is pre-trained on more than 10 million cells.

Now, for the first time, the research team has tried to update scGPT by generative pre-training on more than 33 million cells.

Bo Wang, the corresponding author of the paper and assistant professor at the University of Toronto, tweeted: "Exciting scGPT update: With a lot of community attention since its release in April, we are pleased to announce the first major update to scGPT, the basic model for single-cell multiomics data."

The updated study, titled "scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI," was released on the bioRxiv preprint platform on July 2, 2023.

Address: https://biorxiv.org/content/10.1101/2023.04.30.538439

Open source code and models: https://github.com/bowang-lab/scGPT

Detailed tutorial: https://scgpt.readthedocs.io/en/latest/

Several highlights of the scGPT update

Highlights of this update include:

Launch of the first GPT-style base model for single-cell multiomics data, pre-trained on over 33 million human cell atlas data.
Its generalist approach enables one model to accomplish multiple tasks in single-cell analysis, including multiomics synthesis analysis and perturbation prediction.
Use learned attention weights and gene embedding to discover intergene interactions specific to various conditions.
Reveal the scaling law that continuously enhances model performance as the amount of data increases.
The scGPT model Zoo (see github) now offers a variety of pre-trained base models and comprehensive pan-cancer models for a variety of solid organs. Start exploring your data using the most appropriate underlying model.

One Twitter user commented: "Absolutely amazing... Good stuff!"

How is it done?

Here, for the first time, researchers attempt to build a single-cell base model scGPT by generative pre-training of more than 33 million cells. Researchers introduce new techniques to solve methodological and engineering challenges in pre-training large-scale single-cell omics data.

The updated scGPT model was pre-trained on more than 33 million cells, up from "over 10 million cells" in the previous time.

How is that done?

To process large-scale data, researchers use in-memory data structures that can quickly access and store hundreds of datasets. Establish a unified generative pre-training workflow specifically for non-sequencemic data, and adjust the Transformer architecture to learn both cellular and gene representations. In addition, generic pipelines with specific task objectives are provided for model fine-tuning, designed to facilitate the application of pre-trained models in a range of downstream tasks.

This update incorporates community feedback and leverages the latest data released by CELLxGENE. The updated scGPT has larger pre-trained data and more robust models, and expands the range of application tasks.

The researchers retrieved more than 10.3 million human PBMC scRNA-seq samples from the CELLxGENE website for pre-training of the underlying model. A total of 65 datasets were collected from CELLxGENE by filtering organisms (i.e., Homo sapiens), tissues (i.e., blood, bone marrow), and diseases.

Data address: https://cellxgene.cziscience.com/

The updated scGPT demonstrates the transformative potential of single-cell basic models in three key areas.

First, scGPT represents the first large-scale generative base model that enables transfer learning across a variety of downstream tasks. Demonstrating "universal pre-training, fine-tuning on demand" as a universal solution for single-cell omics computing applications by achieving state-of-the-art performance in cell type annotation, genetic perturbation prediction, batch correction, and multiomics integration. Notably, scGPT is the only basic model that can integrate multiple single-cell omics including scATAC-seq data.
Second, by comparing gene embedding and attention weights between the fine-tuned model and the original pre-trained model, scGPT reveals valuable biological insights into gene-gene interactions specific to various conditions, such as cell type and perturbation state.
Third, our observations reveal a scaling law: a larger amount of pre-training data can produce excellent pre-training embeddings and further improve the performance of downstream tasks. This discovery highlights the exciting prospect that the underlying model can be continuously improved as the sequencing data available to the research community expands.

Based on these findings, the adoption of pre-trained basic models will greatly expand researchers' understanding of cell biology and lay a solid foundation for future discoveries. The publication of scGPT models and workflows is designed to enhance and accelerate research in these and other areas.

Updated scGPT: Pre-trained on over 33 million cells

scGPT, as the first base model in the single-cell domain, adopts a generative pre-training method. The core model contains stacked tansformer layers with multi-head attention that generate both cell and gene embedding. scGPT consists of two phases: initial general pre-training of large cell atlases and subsequent fine-tuning of smaller datasets for specific applications (Figure 1A-C).

In the pre-training phase, specially designed attention masks and generative training pipelines are introduced to train scGPT in a self-supervised manner to co-optimize cellular and gene representation. This innovative technique successfully addresses the non-sequential nature of gene expression to accommodate the NLG framework of sequence prediction.

During training, the model gradually learns to generate gene expression in cells based on cell states or gene expression cues.

During the fine-tuning phase, the pre-trained model can adapt to new datasets and specific tasks. Researchers offer flexible fine-tuning procedures for a variety of important downstream tasks in single-cell research.

Figure 1: Overview of the scGPT model. (Source: Paper)

To collect diverse and extensive sequencing data for self-supervised pre-training of scGPT, the researchers collected scRNA-seq data from 33 million human cells under normal (non-disease) conditions obtained through the CELLxGENE collection (Figure 1D). This comprehensive dataset covers multiple cell types from 51 organs/tissues and 441 studies, providing a rich representation of cellular heterogeneity across the human body.

After pre-training, scGPT cells on 10% of the 33 million data were embedded for visualization using UMAP visualization techniques (Figure 1E). The resulting UMAP map shows good clarity, with cell types of local regions and clusters accurately represented by different colors. Given that more than 400 studies are included in the dataset, this demonstrates the superior ability of pre-training to mitigate the effects of technical batches.

The findings suggest that scGPT can effectively extract key biological insights about genes and cells. By further adapting to transfer learning, scGPT can be optimized to achieve state-of-the-art performance in a variety of downstream tasks, including multi-batch integration, multiomics integration, cell type annotation, genetic perturbation prediction, and gene network inference.

For the future, the researchers plan to pre-train on larger, more diverse datasets.

Reference content: https://twitter.com/BoWang87/status/1676056025072320512

An updated version of the basic large-scale language model of single-cell biology, pre-trained on more than 33 million cells

Read on

Nature: ChatGPT breaks the Turing test – a race to find new ways to evaluate artificial intelligence is underway. Large language models mimic human conversation, but science

#New forces in finance#With the rapid development of the field of artificial intelligence, large-scale language models and chatbots have become hot spots in the industry. Especially in ChatGPT, which was released by OpenAI

"Why do daughters-in-law and mother-in-law have irreconcilable contradictions?" Today, Xiaobian began to test Baidu's "Wen Xin Yiyan", which is known as the Chinese version of chatGPT! [Like] [Like] [Like

Summary of Popular Large Language Models (LLMs) in 2023

With the rise of telemedicine, online consultation and consultation has become the preferred convenient and efficient medical support method for patients. Recently, Large Language Models (LLMs) have shown powerful nature

Using the python programming language, build a large language model to define knowledge for the robot

Nvidia is one of the hottest companies in the United States, and its market value is now several times that of Intel. The graphics card produced is difficult to find. Especially for superchips for large-scale language computing. One sheet

Amazon Selection and Applications in Operations: Don't panic. With the rapid development of large-scale language models, GenerativeAI has been

Databrick Dolly: A large language model that follows instructions

Large language model: SBERT — sentence BERT

Do large language models know what they don't?

The paper, titled SortedLLaMA, aims to reveal the potential advantages of the middle layer of large language models. The paper proposes a method called SoFT, which utilizes an intermediate layer

Intel CEO Pat Gelsinger unveiled the next era of personal computers: AI PCs generation Intel CEO Pat Gelsinger recently in San Jose

A beginner's guide to "Artificial Intelligence" Large Language Model (LLM).

The Game of Thrones author sued ChatGPT, some of the world's best-known novelists, and this week banded together to sue ChatGPT maker OpenAI, saying it used him

David Chalmers: Large language models predict that conscious AI will be possible in less than a decade

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

The tuyere belonging to major technology companies is here again! This large language model leads to the "new industrial revolution."

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

探索大语言模型：理解Self Attention| 京东物流技术团队

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills