In this exercise, we will analyze a scRNA-seq dataset from peripheral blood mononuclear cells (PBMC) generated with the 10x Genomics platform.
File > New File > Rscript
.File > Save As
.Session > Set Working Directory > To Source File Location
(by doing this, RStudio will consider the directory containing your code and data as the working directory).Let’s have a look at the format of the data we have just downloaded:
## [1] "hg19" "hg19/barcodes.tsv" "hg19/genes.tsv"
## [4] "hg19/matrix.mtx"
We have downloaded a directory called filtered_gene_bc_matrices, which contains a sub-directory names hg19 (i.e. the version of the human genome that was used to summarize the scRNA-seq data), which in turn contains three text files generated with the CellRanger pipeline: barcodes.tsv, genes.tsv, and matrix.mtx. These files contains the count expression values computed for each gene/UMI and cell (more information on these files can be found here).
In this exercise we will use the Seurat R toolkit to perform scRNA-seq analysis of the PBMC single-cell dataset. To this end, we need to load the Seurat package using the library()
function.
And, before we begin the analysis, let’s set the seed of the random number generator as indicated below to ensure reproducible plots and results:
We can use the Read10X()
function to read the CellRanger output files from the directory specified by the data.dir
parameter.
Let’s print to screen the count values of three genes (i.e. rows) in the first 10 cells (i.e. columns).
Little R note:: the c()
function can be used when we want to specific a list of numbers, strings, etc. For instance, c("A", "B", "C")
is used to list the first three letters of the alphabet, and c(1, 2, 3)
the first three non-negative integers –strings must be indicated in quotation marks, while numbers not. In R, a more compact form to specify a sequence of integers is through the use of “:”: the command 1:4
is equivalent to c(1, 2, 3, 4)
.
## 3 x 10 sparse Matrix of class "dgCMatrix"
##
## CD8A 1 . . . . . . 1 . .
## HLA-A 5 5 12 7 4 8 7 7 3 1
## MS4A1 . 6 . . . . . . 1 1
As the count matrix is very sparse, namely it contains a lot of 0’s (represented as dots), a sparse-matrix representation is used to reduce memory and computational time.
We can now save this data into a Seurat object, called pbmc
, which we will use throughout our analysis. The object will be used to store the raw data, normalized data, and all the results of our analyses (e.g. highly variable genes, PCA, clustering).
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
Let’s have a look at the pbmc
object we have just created:
## An object of class Seurat
## 32738 features across 2700 samples within 1 assay
## Active assay: RNA (32738 features, 0 variable features)
It contains count data for 32,738 genes across 2,700 cells.
The pbmc
object has a metadata slot containg, besides the cell type labels reported in the orig.ident
column (for now, all equal to the project name we specified), two covariates that are important for the quality control step:
nCount_RNA
: the number of total counts per barcode;nFeature_RNA
: the number of detected features (i.e. genes) per cell;## orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1 PBMC 2421 781
## AAACATTGAGCTAC-1 PBMC 4903 1352
## AAACATTGATCAGC-1 PBMC 3149 1131
## AAACCGTGCTTCCG-1 PBMC 2639 960
## AAACCGTGTATGCG-1 PBMC 981 522
## AAACGCACTGGTAC-1 PBMC 2164 782
We can add to this slot information of the percentage of mitochondrial RNA using the PercentageFeatureSet()
function. This function allows to calculate the percentage of all the counts belonging to a subset of the possible features for each cell; in this case, all genes whose name starts with “MT-” (specified by the pattern
argument).
Let’s visualize the distribution of our three QC variables with a violin plot. A violin plot depicts the distribution of numeric data for one or more groups using density curves.
We can also use a scatterplot to visualize the relationship between nCount_RNA
and percent.mt
:
or between nCount_RNA
and nFeature_RNA
:
Question: How can we make a scatterplot of
nFeature_RNA
vs.percent.mt
?
We now want to filter out cells that: - Have a very high number of detected genes (i.e. possible doublets) or very low number of detected genes (i.e. possible low-quality cells or empty droplets); - Have high mitochondrial counts (possible low-quality cells).
Precisely, based on the previous plots, we will only keep cells that have nFeature_RNA
in the 200-2,500 range and percent.mt
< 5%. To this end, we will use the subset()
function, which return subsets of data which meet specific conditions.
pbmc <- subset(pbmc,
subset = nFeature_RNA > 200 &
nFeature_RNA < 2500 &
percent.mt < 5)
print(pbmc)
## An object of class Seurat
## 32738 features across 2638 samples within 1 assay
## Active assay: RNA (32738 features, 0 variable features)
Question: How many cells have we removed?
Note: in this and in most of the commands of this exercise, our Seurat object pbmc
is both the input and the output of our Seurat functions, meaning that the object is updated using the newly computed data. Depending on the function used, Seurat saves the new data in specific slots of the Seurat object (which has a quite complex structure, [discussed in this Wiki]](https://github.com/satijalab/seurat/wiki)).
We will normalize our count data using the NormalizeData()
function. By default, this function performs global-scaling normalization using the “LogNormalize” approach, which:
scale.factor
argument);Now that gene expression levels are comparable across cells, we can identify highly variable genes (HVG), i.e. genes that are highly expressed in some cells and lowly expressed in others, using the FindVariableFeatures()
function. For this analysis, we will only consider the top 2,000 most variable genes.
## An object of class Seurat
## 32738 features across 2638 samples within 1 assay
## Active assay: RNA (32738 features, 2000 variable features)
Let’s have a look at the list of HVG, which can be extracted from the pbmc
object using the VariableFeatures()
function. We will use the head
function to visualize only the top 10 ones:
## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "FTL" "GNLY" "PF4" "FTH1"
## [9] "GNG11" "S100A8"
We can also save this list in a variable called top10
:
We can now make a scatterplot showing the relationship between the average expression and standardized variance for all genes, with HVG highlighted in red, using the VariableFeaturePlot()
function. By invoking the LabelPoints()
function on the plot we have just generated, we can then label the top-10 most variable genes saved in top10
by passing this list to the points
argument.
Question: How can we make the same plot but labelling the top 20 most variable genes?
top20 <- head(VariableFeatures(pbmc), 20)
plot2 <- VariableFeaturePlot(pbmc)
LabelPoints(plot = plot2,
points = top20,
repel = TRUE)
We finally apply a linear transformation, namely scaling, which is a standard pre-processing step prior to dimensional reduction techniques like PCA to ensure that all features have a similar range of values and, thus, equal “weight” in the analysis. We will use the ScaleData()
function, which by default:
We have to specify the features = rownames(pbmc)
argument to make sure that all genes are scaled and not only the highly variable ones, which is the default behaviour of this function.
We can now perform principal component analysis (PCA) on the scaled data using the RunPCA()
function. By default, only the data from the previously HVG are used as input.
The PCA results are saved into the pbmc
object. In the example below, we visualize the top 5 genes (positively or negatively) associated with the top 4 principal components (PC):
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1
## Negative: VIM, IL7R, S100A6, S100A8, IL32
Using the VizDimLoadings()
function, we can also visualize the loadings (positive or negative) of the top genes associated with the principal components of interest. To do so, we need to specify which type of dimensionality reduction results we want to access (reduction = "pca"
) and the dimensions we are interested in (here, principal components PC1 and PC2 as indicated by dims = 1:2
).
Question: Which is the gene with the largest negative association to PC1?
The top associated genes, and their loadings, can be also visualized in an heatmap using the DimHeatmap()
function. The cells
argument can be used to restrict the analysis to the most “extreme” cells to speed up the computation. In the resulting heatmap, both cells and genes are sorted according to their scores, allowing the detection of interesting or uninteresting patterns. These plots are useful, for instance, to identify PC strongly associated to cell cycle genes (e.g. TOP2A, MKI67) or to decide which PC to include in downstream analyses.
We can also plot more than one component, e.g. PC1 and PC20:
Question: What do you notice when comparing PC1 and PC20 heatmaps?
Question: How can we visualize the gene loadings of the first 9 PC using the
DimHeatmap()
function?
The next step of the analysis is clustering, which is performed on the single-cell PCA scores. To this end, we have to select the top PC that are needed to capture most of the data variance.
To decide how many PC to select, we can use an heuristic approach: we can plot the standard deviation or variance captured by each PC using the ElbowPlot()
function and (try to) identify an “elbow”, where the curve levels off, indicating that components to the right of this point add limited information. With the command below, we generate an elbow plot for the first 20 PC:
In this plot, we can see an elbow at about 10 PC, which we will use as a cut-off for downstream analysis.
Deciding how many PC to consider is not trivial and might require several iterations, for instance, to evaluate how the final results in terms of cell clustering and annotations are affected by different choices. In general, results are not heavily affected when more PC are considered (e.g. 15, 20), whereas can change drastically if a low number of PC is chosen (e.g. 3, 5).
Seurat uses a graph-based clustering approach based on the previously identified PC. Briefly, it builds namely a KNN graph based on the Euclidean distance between cells in PCA space. Then, it refines the edge weights between pairs of cells based on the overlap in their local neighborhoods. This step is performed using the FindNeighbors() function, which takes as input the previously defined dimensionality of the dataset (here, the first 10 PC).
Based on this graph, cells are then clustered using the Louvain algorithm (default in the FindClusters()
function) by performing modularity optimization;
The resolution
argument controls cluster granularity: set to 1.0 by default, can be increased (decreased) to obtain more (fewer) clusters. Values in the [0.4-1.2] rage are recommended for single-cell datasets of around 3,000 cells, but can be increased for larger datasets.
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95888
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8758
## Number of communities: 9
## Elapsed time: 0 seconds
The clusters are saved in the metadata slot (seurat_clusters
column) and can be accessed using the Idents()
function.
## orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5
## AAACATACAACCAC-1 PBMC 2421 781 3.0152829 0
## AAACATTGAGCTAC-1 PBMC 4903 1352 3.7935958 3
## AAACATTGATCAGC-1 PBMC 3149 1131 0.8891712 2
## AAACCGTGCTTCCG-1 PBMC 2639 960 1.7430845 5
## AAACCGTGTATGCG-1 PBMC 981 522 1.2232416 6
## AAACGCACTGGTAC-1 PBMC 2164 782 1.6635860 2
## seurat_clusters
## AAACATACAACCAC-1 0
## AAACATTGAGCTAC-1 3
## AAACATTGATCAGC-1 2
## AAACCGTGCTTCCG-1 5
## AAACCGTGTATGCG-1 6
## AAACGCACTGGTAC-1 2
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 0 3 2 5
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
We can now visualize the clustering results onto the 2D PCA plot using the DimPlot()
function:
To better visualize scRNA-seq data, we can use non-linear dimensionality reduction approaches like t-SNE or UMAP.
We can run the t-SNE algorithm considering only the first 10 PC using the RunTSNE()
function as shown in the following command:
Now that the t-SNE results have been saved into the pbmc
object, they can be plotted using again the DimPlot()
function and specifying reduction = "tsne"
.
Question: Which dimensionality reduction approach works better in separating cells belonging to different clusters, PCA or t-SNE?
Similarly, we can compute the UMAP using the RunUMAP()
function.
Question: How can we visualize the UMAP plot with single cells coloured according to clustering results?
The FeaturePlot()
function can be used to plot the expression of specific genes onto a dimensionality reduction plot. For instance, we can plot the normalized expression levels of gene CD3E, T cell-specific marker, onto the UMAP plot with the following command.
We can also plot more than one marker, e.g CD8A (CD8+ T cells), MS4A1 (B cells), GNLY (NK cells), and PPBP (platelets):
Question: By comparing this plot with the one coloured by clusters, which clusters coorespond to CD8+ T cells, B cells, NK cells, and platelets, respectively?
We can also visualize the distribution of the normalized expression levels for specific genes per cluster as violin plots (which we already made in the QC step):
Alternatively, we can use a visualization based on Rigeline plots (also called Joy plots):
Question: How can we make a Ridge plot for the NKG7 and IL7R genes?
Question: How can we make a violin plot for the NKG7 and IL7R genes?
The VlnPlot()
can be also used as we did at the beginning of this exercise to plot the number of detected genes (nFeature_RNA
), total counts (nCount_RNA
), and percentage of mitochondrial RNA (percent.mt
), but stratified by cluster.
Question: Which clusters of cells has (overall) the highest number of detected genes and total counts?
We can now perform differential expression analysis to identify cluster-specific markers.
The FindMarkers()
function can be used to identify markers of a single cluster compared to all other cells. By specifying the only.pos = TRUE
argument, we can look only at positive markers –otherwise, by default, both positive and negative ones are reported. The cluster of interest can be specified using the ident.1
parameter. Here, for instance, we identify the markers of cluster 3 (i.e. the one that expressed the B cell-specific marker MS4A1):
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## MS4A1 0.000000e+00 3.377357 0.855 0.053 0.000000e+00
## CD79A 0.000000e+00 4.309253 0.936 0.041 0.000000e+00
## CD79B 2.675038e-274 3.481243 0.916 0.142 8.757541e-270
## LINC00926 2.384127e-272 2.841851 0.564 0.009 7.805154e-268
## TCL1A 9.481762e-271 3.590674 0.622 0.022 3.104139e-266
MS4A1 is, indeed, among the top markers identified by this analysis.
With the same function, we can identify the top five markers that distinguish cluster 5 (ident.1 = 5
) from cluster 6 (ident.2 = 6
):
cluster5vs6.markers <- FindMarkers(pbmc,
ident.1 = 5,
ident.2 = 6,
only.pos = TRUE)
head(cluster5vs6.markers, 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## AIF1 1.416223e-54 4.924216 1 0.086 4.636432e-50
## LST1 1.023311e-53 4.773702 1 0.150 3.350114e-49
## COTL1 2.830334e-53 4.521067 1 0.186 9.265948e-49
## CST3 1.863082e-52 4.262893 1 0.257 6.099359e-48
## FTH1 1.816386e-50 3.683219 1 0.979 5.946484e-46
Question: Which are the top-three markers that distinguish cluster 7 from cluster 2?
cluster7vs2.markers <- FindMarkers(pbmc,
ident.1 = 7,
ident.2 = 2,
only.pos = TRUE)
head(cluster7vs2.markers, 3)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCER1A 3.161616e-83 3.925230 0.829 0.004 1.035050e-78
## HLA-DQA1 3.485526e-76 4.354738 0.943 0.026 1.141091e-71
## HLA-DQA2 6.399072e-71 2.975259 0.800 0.013 2.094928e-66
Alternatively, we could use the FindAllMarkers()
function, which systematically identifies markers specific for each cluster:
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## RPS12 1.647066e-146 0.7347797 1.000 0.991 5.392164e-142 0 RPS12
## RPS6 2.677487e-145 0.6854074 1.000 0.995 8.765556e-141 0 RPS6
## RPL32 8.402925e-143 0.6318857 0.999 0.995 2.750950e-138 0 RPL32
## RPS14 1.186082e-136 0.6377198 1.000 0.994 3.882994e-132 0 RPS14
## RPS27 3.122041e-136 0.7163892 0.999 0.992 1.022094e-131 0 RPS27
## RPS25 1.493222e-125 0.7562751 0.997 0.975 4.888509e-121 0 RPS25
We can save the results in our current working directory as a .csv file with the current command:
Let’s imagine we have some prior knowledge on cell type-specific markers, as indicated in the table below, that allows us to assign each cluster to a cell identity:
Cluster ID | Markers | Cell Type |
---|---|---|
0 | IL7R, CCR7 | Naive CD4+ T |
1 | IL7R, S100A4 | Memory CD4+ T |
2 | CD14, LYZ | CD14+ Mono |
3 | MS4A1 | B |
4 | CD8A | CD8+ T |
5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
6 | GNLY, NKG7 | NK |
7 | FCER1A, CST3 | DC |
8 | PPBP | Platelet |
We can use the RenameIdents()
function to rename the clusters with more biologically meaningful names reflecting cell identities:
new.cluster.ids <- c("Naive CD4 T",
"Memory CD4 T",
"CD14+ Mono",
"B",
"CD8 T",
"FCGR3A+ Mono",
"NK",
"DC",
"Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc,
new.cluster.ids)
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## Naive CD4 T B CD14+ Mono FCGR3A+ Mono
## AAACCGTGTATGCG-1
## NK
## 9 Levels: Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK ... Platelet
We can now make UMAP plot colored according to the new cell identities. We can also add the cell names directly to the UMAP by specifying label = TRUE
, and remove the legend on the right by adding + NoLegend()
.
Finally, the DotPlot()
function can be used to visualize, for a set of genes specified through the features
parameter, how their expression levels change across clusters. In the dot plot, the size of the dot encodes the percentage of cells within a cluster, while the color encodes the average expression level across all cells within a cluster. In the example code below, + RotatedAxis()
is added to rotate the labels on the x-axis to make them more easily readable:
knownmarkers <- c("L7R", "CCR7", "CD14", "LYZ",
"IL7R", "S100A4", "MS4A1", "CD8A",
"FCGR3A", "MS4A7", "GNLY", "NKG7",
"FCER1A", "CST3", "PPBP")
DotPlot(pbmc,
features = knownmarkers) + RotatedAxis()