跟着Cell學單細胞轉錄組分析(十二):轉錄因子分析

轉錄因子分析可以了解細胞異質性背後的基因調控網絡的異質性。轉錄因子分析也是單細胞轉錄組常見的分析内容，R語言分析一般采用的是SCENIC包，具體原理可參考兩篇文章。1、《SCENIC : single-cell regulatory networkinference and clustering》。2、《Ascalable SCENIC workflow for single-cell gene regulatory network analysis》。但是說在前頭，SCENIC的計算量超級大，非常耗費記憶體和時間，如非必要，不要用一般的電腦分析嘗試。可以借助伺服器完成分析，或者減少分析細胞數，再或者使用SCENIC的Python版本。這裡我們也是僅僅進行示範，資料沒有實際意義，人為減少了基因與細胞，然而就這也很費時間。重要的是看看流程。

首先開始前，需要做兩件事。第一毫無疑問是安裝和加載R包，需要的比較多，如果沒有請安裝。第二則是下載下傳基因注釋配套資料庫。

library(Seurat)
library(tidyverse)
library(foreach)
library(RcisTarget)
library(doParallel)
library(SCopeLoomR)
library(AUCell)
BiocManager::install(c("doMC", "doRNG"))
library(doRNG)
BiocManager::install("GENIE3")
library(GENIE3)
#if (!requireNamespace("devtools", quietly = TRUE)) 
devtools::install_github("aertslab/SCENIC") 
packageVersion("SCENIC")
library(SCENIC)
#這裡下載下傳人的
dbFiles <- c("https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-500bp-upstream-7species.mc9nr.feather",
             "https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-tss-centered-10kb-7species.mc9nr.feather")
for(featherURL in dbFiles)
{
  download.file(featherURL, destfile=basename(featherURL)) 
}

接着就是建構分析檔案。

#建構分析資料
exprMat <- as.matrix([email protected][email protected])#表達矩陣
exprMat[1:4,1:4]#檢視資料
cellInfo <- [email protected][,c("celltype","nCount_RNA","nFeature_RNA")]
colnames(cellInfo) <- c('CellType', 'nGene' ,'nUMI')
head(cellInfo)
table(cellInfo$CellType)
#建構scenicOptions對象，接下來的SCENIC分析都是基于這個對象的資訊生成的
scenicOptions <- initializeScenic(org = "hgnc", dbDir = "F:/cisTarget_databases", nCores = 1)

建構共表達網絡，最後一步很費時間。

# Co-expression network
genesKept <- geneFiltering(exprMat, scenicOptions)
exprMat.filtered <- exprMat[genesKept, ]
exprMat.filtered[1:4,1:4]
runCorrelation(exprMat.filtered, scenicOptions)
exprMat.filtered.log <- log2(exprMat.filtered + 1)
runGenie3(exprMat.filtered.log, scenicOptions)
#Using 676 TFs as potential regulators...
#Running GENIE3 part 1
#Running GENIE3 part 10
#Running GENIE3 part 2
#Running GENIE3 part 3
#Running GENIE3 part 4
#Running GENIE3 part 5
#Running GENIE3 part 6
#Running GENIE3 part 7
#Running GENIE3 part 8
#Running GENIE3 part 9
#Finished running GENIE3.
#Warning message:
#In runGenie3(exprMat.filtered.log, scenicOptions) :
#  Only 676 (37%) of the 1839 TFs in the database were found in the dataset. Do they use the same gene IDs?

建構基因調控網絡GRN并進行AUC評分。也是耗費時間的過程。運作完成的結果就是整個分析得到的内容，需要按照自己的目的去篩選。

# Build and score the GRN
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions)
exprMat_log <- log2(exprMat + 1)
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions,exprMat_log)
scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
saveRDS(scenicOptions, file = "int/scenicOptions.Rds")

以下是運作記錄

>scenicOptions <- runSCENIC_2_createRegulons(scenicOptions)
13:21  Step 2. Identifying regulons
tfModulesSummary:
                        [,1]
top5perTargetAndtop3sd     1
top5perTargetAndtop50      1
top1sdAndtop10perTarget    2
top50perTargetAndtop1sd    2
top50Andtop10perTarget     3
w0.005                    27
w0.005Andtop1sd          149
top50perTarget           174
top50Andtop3sd           236
top3sd                   436
top50                    436
w0.005Andtop50perTarget  500
top1sd                   523
top5perTarget            617
top10perTarget           671
w0.001                   676
13:21  RcisTarget: Calculating AUC
Scoring database:  [Source file: hg19-500bp-upstream-7species.mc9nr.feather]
Scoring database:  [Source file: hg19-tss-centered-10kb-7species.mc9nr.feather]
Not all characters in C:/Users/liuhl/Desktop/1.R could be decoded using CP936. To try a different encoding, choose "File | Reopen with Encoding..." from the main menu.17:17  RcisTarget: Adding motif annotation
Using BiocParallel...
Using BiocParallel...
Number of motifs in the initial enrichment: 1993247
Number of motifs annotated to the matching TF: 22290
17:38  RcisTarget: Pruning targets
19:04  Number of motifs that support the regulons: 12551
  Preview of motif enrichment saved as: output/Step2_MotifEnrichment_preview.html
There were 13 warnings (use warnings() to see them)
> exprMat_log <- log2(exprMat + 1)
> scenicOptions <- runSCENIC_3_scoreCells(scenicOptions,exprMat_log)
19:06  Step 3. Analyzing the network activity in each individual cell
  Number of regulons to evaluate on cells: 318
Biggest (non-extended) regulons: 
   ELF1 (1760g)
   ETS1 (1734g)
   FLI1 (1604g)
   ELK3 (1493g)
   POLR2A (1453g)
   CHD2 (1251g)
   ETV3 (1249g)
   ELK4 (1148g)
   TAF1 (974g)
   ERG (956g)
Quantiles for the number of genes detected by cell: 
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
   min     1%     5%    10%    50%   100% 
205.00 224.76 276.90 321.40 695.00 997.00 
Warning in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores,  :
  Using only the first 224.76 genes (aucMaxRank) to calculate the AUC.
19:07  Finished running AUCell.
19:07  Plotting heatmap...
19:07  Plotting t-SNEs...
Warning message:
In max(densCurve$y[nextMaxs]) : max裡所有的參數都不存在；回覆-Inf
> scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
Binary regulon activity: 207 TF regulons x 439 cells.
(299 regulons including 'extended' versions)
168 regulons are active in more than 1% (4.39) cells.
> saveRDS(scenicOptions, file = "int/scenicOptions.Rds")

每一步的分析結果SCENIC都會自動儲存在所建立的int和out檔案夾。接下來對結果進行可視化，這裡是随機選的，沒有生物學意義。實際情況是要根據自己的研究目的。

1、可視化轉錄因子與seurat細胞分群關聯

#regulons AUC
AUCmatrix <- readRDS("int/3.4_regulonAUC.Rds")
AUCmatrix <- [email protected]@[email protected]$AUC
AUCmatrix <- data.frame(t(AUCmatrix), check.names=F)
RegulonName_AUC <- colnames(AUCmatrix)
RegulonName_AUC <- gsub(' \\(','_',RegulonName_AUC)
RegulonName_AUC <- gsub('\\)','',RegulonName_AUC)
colnames(AUCmatrix) <- RegulonName_AUC
scRNAauc <- AddMetaData(immune, AUCmatrix)
[email protected]$integrated <- NULL
saveRDS(scRNAauc,'immuneauc.rds')

#二進制regulo AUC
BINmatrix <- readRDS("int/4.1_binaryRegulonActivity.Rds")
BINmatrix <- data.frame(t(BINmatrix), check.names=F)
RegulonName_BIN <- colnames(BINmatrix)
RegulonName_BIN <- gsub(' \\(','_',RegulonName_BIN)
RegulonName_BIN <- gsub('\\)','',RegulonName_BIN)
colnames(BINmatrix) <- RegulonName_BIN
scRNAbin <- AddMetaData(immune, BINmatrix)
[email protected]$integrated <- NULL
saveRDS(scRNAbin, 'immunebin.rds')

作圖使用Seurat中FeaturePlot函數。小提琴圖也是可以的。

FeaturePlot(scRNAauc, features='CEBPB_extended_1035g', label=T, reduction = 'umap')
FeaturePlot(scRNAbin, features='CEBPB_extended_1035g', label=T, reduction = 'umap')

跟着Cell學單細胞轉錄組分析(十二):轉錄因子分析

2、最常見的熱圖，選擇需要可視化的regulons。

library(pheatmap)
celltype = subset(cellInfo,select = 'CellType')
AUCmatrix <- t(AUCmatrix)
BINmatrix <- t(BINmatrix)
regulons <- c('CEBPB_extended_1035g','RUNX1_extended_200g',
                 'FOXC1_extended_100g','MYBL1_extended_112g',
                 'IRF1_extended_1785g',
                 'ELF1_1760g','ELF1_extended_2165g',
                 'IRF1_extended_1765g','ETS1_extended_2906g',
                 'YY1_extended_1453g','ATF3_extended_1022g',
                 'E2F4_extended_637g',
                 'KLF2_12g','HES1_13g',
                 'GATA3_20g','HOXB2_extended_362g',
                 'SOX4_extended_10g',
                 'RUNX3_extended_170g','EGR3_extended_23g',
                 'MXD4_extended_182g','HOXD9_extended_25g')
AUCmatrix <- AUCmatrix[rownames(AUCmatrix)%in%regulons,]
BINmatrix <- BINmatrix[rownames(BINmatrix)%in%regulons,]
pheatmap(AUCmatrix, show_colnames=F, annotation_col=celltype,
         width = 6, height = 5)
pheatmap(BINmatrix, show_colnames=F, annotation_col=celltype,
         color = colorRampPalette(colors = c("white","black"))(100),
         width = 6, height = 5)

跟着Cell學單細胞轉錄組分析(十二):轉錄因子分析

好了，以上是一個基本的流程示範，具體怎麼用這個結果，如何解讀，可以參考相關的高分文獻，了解分析原理，與自己的研究相結合。

更多精彩内容請通路我的個人公衆号《KS科研分享與服務》！

跟着Cell學單細胞轉錄組分析(十二):轉錄因子分析

繼續閱讀

資料挖掘中的隐私保護

資料挖掘研究内容和本質（轉）

資料挖掘分類技術

淺談資料挖掘評估技術

資料挖掘001

【R語言】GARCH模型的應用一、資料來源二、資料分析三、模型建立四、模型優化五、結論六、實作代碼七、參考資料

資料分析實戰20絕技

Python TensorFlow循環神經網絡RNN-LSTM神經網絡預測股票市場價格時間序列和MSE評估準确性

Matlab随機波動率SV、GARCH用MCMC馬爾可夫鍊蒙特卡羅方法分析匯率時間序列

Lagrange插值的R語言實作

從大資料看技術，為什麼天貓雙11是史上最大數字經濟節日

用Matlab搞計算機視覺是怎樣的體驗？

R語言-相關性分析函數

拓端tecdat|R語言彈性網絡Elastic Net正則化懲罰回歸模型交叉驗證可視化

線上教育巨頭多鄰國Duolingo入華一周年，中國市場馬力全開

在weka中內建自己的算法