NC單細胞文章複現(三)：Clustering

分享是一種态度

我們上次基于各種marker對1189個細胞進行分類，然而，僅基于marker對細胞進行分類可能是不精确的，特别是考慮到scRNA-seq資料的high dropout rate 。是以，在進行t-SNE降維之前，作者又進一步将細胞進行分類。

首先，因為對不同病人樣本的效應進行回歸分析，以確定得到的聚類結果不是患者樣本之間的異質性導緻的。此外，選擇了在細胞間分散度高的基因表達子集(平均表達量>0.1)，以增加簇的穩健性。首先計算經驗離散值（empirical dispersion value ）去拟合離散度-均值（dispersion parameter ）的關系，為每個基因選擇離散度參數（dispersion-mean relationship ）。

# 将得到的1189個細胞分類結果更新到sceset_final中
colData(sceset_final)$cell_types_markers <- cell_types_simple
pd_norm <- colData(sceset_final)
#使用monocle包對細胞類型進行無監督聚類方法，monocle_unsup_clust_plots是包裝好的一個函數，
HSMM_clustering_ct <- monocle_unsup_clust_plots(sceset_obj = sceset_final, mat_to_cluster = mat_norm,                                                       anno_colors = anno_colors, 
                                                name_in_phenodata = "cluster_allregr_disp", disp_extra = 1,                                                 save_plots = 0,
                                                path_plots = NULL, type_pats = "allpats", regress_pat = 1)


HSMM_clustering_ct$Cluster <- HSMM_clustering_ct$cluster_allregr_disp
table(HSMM_clustering_ct$Cluster)
> table(HSMM_clustering_ct$Cluster)

  1   2   3   4   5   6   7 
 64 141  95  74 202 559  54

複制

我們看到，此時可分為7簇，跟文中的9簇不一樣，作者這麼解釋：由于reduceDimension and clusterCells函數的更新，是以結果會有所不同。

# due to changes in Monocle's functions (reduceDimension and clusterCells), the resulting clustering is slightly different from
# the original clustering from the paper. for reproducibility, we read in the original cluster assignment

複制

為了更好地進行下面的學習，我們就拿已經處理好的original clustering跑下去看看。

#讀取已經處理的的original clustering
original_clustering <- readRDS(file="data/original_clustering.RDS")
HSMM_clustering_ct$Cluster <- original_clustering
table(HSMM_clustering_ct$Cluster)
> table(HSMM_clustering_ct$Cluster)

  1   2   3   4   5   6   7   8   9 
 52  18 156 197 144 119 129 328  46

複制

這時候就有9簇細胞，接下來需要對每1簇細胞再進行細分。

# 将得到的9簇細胞分類結果更新到sceset_final中
colData(sceset_final) <- cbind(colData(sceset_final), pData(HSMM_clustering_ct)[,c(96:104)])
colData(sceset_final)$cell_types_cl_all <- colData(sceset_final)$cell_types_markers
pd_norm <- colData(sceset_final)

#首先在每個簇中确定unknown and undecided cell types 
cells_to_assign <- list()
cells_to_reassign <- list()
mean_exprs <- list()
mean_reassign_exprs <- list()

# cluster1: 隻有52 epithelial cells
cluster_here <- 1
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
mean_exprs[[cluster_here]] <- NULL
mean_reassign_exprs[[cluster_here]] <- NULL
cells_to_assign[[cluster_here]] <- NULL
cells_to_reassign[[cluster_here]] <- NULL

#cluster2：确定clsters2的大緻細胞類型：3個epithelial 細胞, 13個macrophage細胞, 1個undecided細胞 and 1個unknown細胞
cluster_here <- 2
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])

> table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
   epithelial macrophage  undecided    unknown 
         3         13          1          1 
#計算undecided and unknown cells的免疫标志物的平均表達量
to_assign_cluster <- c("undecided", "unknown")
mean_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm, 
                                                      cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster)),
                                                      epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
mean_exprs[[cluster_here]]

#對于undecided 和 unknown cell，如果免疫标志物的平均表達量最高，指定為巨噬細胞。
cells_to_assign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster))[which(apply(mean_exprs[[cluster_here]], 1, function(x){if (x[2] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_assign[[cluster_here]]] <- "macrophage"
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])

#确定clusters2中3個上皮細胞（epithelial cells）免疫标志物的平均表達水準
mean_reassign_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm, 
                                                               cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "epithelial")),
                                                               epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)

#類似地，clusters2有3個上皮細胞（epithelial cells）免疫标志物的平均表達最強，也被指定為巨噬細胞。
mean_reassign_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"),                                                                  mat_expr = mat_norm, 
                                                               cells_pos = intersect(which(pd_norm$Cluster                                      == cluster_here), which(pd_norm$cell_types_markers %in% "epithelial")),
                                                               epithelial_markers = epithelial_markers,                                        immune_markers = immune_markers, other_markers = other_markers)
cells_to_reassign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "epithelial"))[which(apply(mean_reassign_exprs[[cluster_here]], 1, function(x){if (x[2] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_reassign[[cluster_here]]] <- "macrophage"
#是以，clusters2由18個巨噬細胞組成。
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])

## cluster3：46 epithelial, 86 stroma, 14 endothelial, 5 undecided and 5 unknown cells.

cluster_here <- 3
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
> table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])

endothelial  epithelial      stroma   undecided     unknown 
         14          46          86           5           5 
         
mean_exprs[[cluster_here]] <- NULL
mean_reassign_exprs[[cluster_here]] <- NULL
cells_to_assign[[cluster_here]] <- NULL
cells_to_reassign[[cluster_here]] <- NULL


# cluster 4: 将unknown and undecided cells進一步細分 
cluster_here <- 4
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
to_assign_cluster <- c("undecided", "unknown")
mean_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm, 
                                                      cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster)),
                                                      epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
# clusters3中unknown and undecided cells共有5個細胞的epithelial markers的平均表達最強，被指定為epithelial cells

cells_to_assign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% to_assign_cluster))[which(apply(mean_exprs[[cluster_here]], 1, function(x){if (x[1] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_assign[[cluster_here]]] <- "epithelial"
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])

# 确定個stroma cell是否高表達epithelial marker，若epithelial marker表達較高，則被指定為 epithelial cells
mean_reassign_exprs[[cluster_here]] <- compute_mean_expr_types(types = c("epithelial", "immune", "other"), mat_expr = mat_norm, 
                                                               cells_pos = intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "stroma")),
                                                               epithelial_markers = epithelial_markers, immune_markers = immune_markers, other_markers = other_markers)
# only re-assign the stroma cells if their epithelial expression is higher
cells_to_reassign[[cluster_here]] <- intersect(which(pd_norm$Cluster == cluster_here), which(pd_norm$cell_types_markers %in% "stroma"))[which(apply(mean_reassign_exprs[[cluster_here]], 1, function(x){if (x[1] >= max(x)) return(1) else {return(0)}}) == 1)]
pd_norm$cell_types_cl_all[cells_to_reassign[[cluster_here]]] <- "epithelial"
table(pd_norm$cell_types_cl_all[which(pd_norm$Cluster == cluster_here)])

cluster 5:46 epithelial, 4 stroma, 52 T, 3 B, 11 undecided and 28 unknown cells
cluster_here <- 5
table(pd_norm$cell_types_markers[which(pd_norm$Cluster == cluster_here)])
mean_exprs[[cluster_here]] <- NULL
mean_reassign_exprs[[cluster_here]] <- NULL
cells_to_assign[[cluster_here]] <- NULL
cells_to_reassign[[cluster_here]] <- NULL
........

複制

接下來的第6、7、8和9簇細胞，按照同樣的辦法細分，代碼太長了，我就不貼上來了。在這裡先小結一下，我們開始看到的是1189個細胞，然後将這些細胞用無聚類監督分析先分為9個簇，因為相似的細胞更容易成為一個簇，但是簇裡面的細胞類型我們已經基于文獻的markers分類好了，若簇裡面有多種細胞類型的細胞存在，那麼我們需要進一步鑒定他們跟簇裡面那群最大比例的細胞是不是就是同一類細胞（不然為何他們會聚集到了同一簇），還是說有自己的獨特身份？但是我們怎麼定義那個最大比例的細胞，作者将這個比例定為80%，舉個例子，在clusters4總共有197個細胞（185 epithelial，3 stroma, 4 undecided and 5 unknown cells），其中 epithelial細胞占了94%，超過了80%的比例，那麼其他12個細胞有可能“僞裝”了自己，其真實身份有可能是epithelial細胞，不然怎麼解釋他們跟epithelial細胞聚集到了一起呢？這時候就需要進一步确定其他12個細胞epithelial markers 平均表達量是否高于其他的markers,是的話他們的“廬山真面目”就是epithelial細胞，若沒有，那麼就是冤枉到它了，讓它保持原來的身份。而對于clusters3（ 46 epithelial, 86 stroma, 14 endothelial, 5 undecided and 5 unknown cells）來說，沒有1種細胞的比例超過80%，是以他們都可以保持自己的身份。我感覺這個過程還是蠻有趣的，在對細胞核實身份的同時，其實就是不斷接近客觀世界的過程。今天這些内容已經蠻多了，下一節我們再繼續學習。

往期回顧

OSCA單細胞資料分析筆記9—Clustering

scPhere——用地球儀來展示降維結果

2021第一期生信入門微信群答疑精選200題

這50個ggplot2現成圖表你居然沒有從頭到尾自己畫一遍

如果你對單細胞轉錄組研究感興趣，但又不知道如何入門，也許你可以關注一下下面的課程

生信爆款入門-2021第4期
資料挖掘（GEO,TCGA,單細胞）2021第4期
明碼标價之共享96線程384G記憶體伺服器