本文介绍Seurat进行单细胞聚类的流程。
1. 创建对象
首先需要读取数据,并创建Seurat对象,同时过滤掉在少于3个细胞中表达的基因以及检测到的feature少于200的细胞。
1 2 3 4 5
| mydata <- Read10X(data.dir = "D:/000 MyWork/000 MyProject/000 scVariants/sample1", gene.column = 1) mydata <- CreateSeuratObject(counts = mydata,project = "sample1", min.cells = 3,min.features = 200, assay = "RNA")
|
2. 数据质控
常用的质控指标包括:
- 每个细胞的唯一基因数目 feature number
- 低质量或空液泡往往只能检测到少量基因
- 双液泡(doublet)或多液泡(multiplets)会具有异常多的基因数目
- 每个细胞的总counts数(相当于每个细胞的测序深度)
- 线粒体基因占比
- 低质量或死细胞会具有异常高的线粒体基因表达
这里计算线粒体基因的表达比例,并且查看feature number、count number、线粒体基因占比的小提琴图。
1 2 3 4 5 6 7 8 9
| mydata[["percent.mt"]] <- PercentageFeatureSet(mydata,pattern = "^mt:") VlnPlot(object = mydata, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) VlnPlot(object = mydata, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), idents = NULL,assay = NULL,pt.size = 0,ncol = 3) mydata <- subset(mydata, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 5)
|
根据小提琴图的情况来决定过滤的标准,有个更好的质控方法是根据质控指标的分位数进行过滤,例如过滤掉 nFeature_RNA 上四分位数和下四分位数的细胞。但是有时候也会感觉标准略微严格,如果我想尽可能多地保留细胞,那也可以只过滤掉一些离群值。
3. normalization
质控之后,进行counts的normalization,默认使用 “LogNormalize” 的方法,即将每个基因的counts除以细胞总的counts数,乘上10,000,再进行对数转换。这种转换的目的?
1
| mydata <- NormalizeData(mydata,scale.factor = 10000)
|
Seurat提供了另外的normalization方法,通过 normalization.method 指定, 包括:
- “CLR”: centered log ratio transformation
- “RC”: equals to “LogNormalize” without log-transformation
4. 特征选择
鉴定高变基因(代表了细胞之间主要的生物学差异)进行后续分析,默认选择前2000个。
1 2 3 4 5 6 7 8
| mydata <- FindVariableFeatures(mydata, selection.method = "vst", nfeatures = 2000) top10 <- head(VariableFeatures(mydata), 10) plot1 <- VariableFeaturePlot(mydata) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0,ynudge = 0) plot1 plot2
|
5. 数据缩放
对数据进行缩放scaling,目的是使数据的均值为0,方差为1。
1 2 3 4
| all.genes <- rownames(mydata) mydata[["percent.mt"]] <- PercentageFeatureSet(mydata, pattern = "^mt:",assay="RNA") mydata <- ScaleData(mydata, features = all.genes,vars.to.regress = "percent.mt")
|
Seurat v4.0版本中新的标准化方法SCTransform,这个函数可以替代三个函数normalization、Scale、FindVariableFeatures的功能。
6. 线性降维
只用2000个高变基因进行PCA线性降维,根据ElbowPlot函数的拐点确定后续用于分析的PCA主成分数。
1 2 3 4 5 6 7
| mydata <- RunPCA(mydata, npcs = 20, verbose = FALSE) DimPlot(mydata, reduction = "pca") DimHeatmap(mydata, dims = 1:20, cells = 500, balanced = TRUE) mydata <- JackStraw(mydata, num.replicate = 100) mydata <- ScoreJackStraw(mydata, dims = 1:20) JackStrawPlot(mydata, dims = 1:20) ElbowPlot(mydata)
|
7. 细胞聚类
1 2 3 4
| mydata <- FindNeighbors(mydata, reduction = "pca", dims = 1:20) mydata <- FindClusters(mydata, resolution = 0.4) mydata <- RunUMAP(mydata, reduction = "pca", dims = 1:20) mydata <- RunTSNE(mydata,reduction = "pca",dims = 1:20)
|
8. UMAP/tSNE降维
1 2 3 4 5 6 7 8
| p1 <- DimPlot(mydata, reduction = "umap") p2 <- DimPlot(mydata, reduction = "umap", label = TRUE, repel = TRUE) p1+p2
p3 <- DimPlot(mydata, reduction = "tsne") p4 <- DimPlot(mydata, reduction = "tsne", label = TRUE, repel = TRUE) p3+p4
|
9. 鉴定细胞类群的marker
FindMarkers返回指定指定cluster之间的marker。
FindAllMarkers 可以一次寻找所有clusters的markers,但只返回上调的markers。
VlnPlot绘制表达情况的小提琴图;FeaturePlot绘制表达情况的细胞降维图。
1 2 3 4 5 6 7 8 9 10 11 12
| cluster5.markers <- FindMarkers(mydata, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25) head(cluster5.markers, n = 5)
mydata.markers <- FindAllMarkers(mydata, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25 mydata.markers %>% group_by(cluster) %>% slice_max(n = 2, order_by = avg_log2FC)
VlnPlot(mydata, features = c("MS4A1", "CD79A")) FeaturePlot(mydata, features = c("MS4A1", "CD79A"))
|
10. 细胞注释
进行细胞注释,重命名cluster ID为细胞类型。
1 2 3 4
| new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono","NK", "DC", "Platelet") names(new.cluster.ids) <- levels(mydata) mydata <- RenameIdents(mydata, new.cluster.ids) DimPlot(mydata, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
|