In this exercise, we will analyze a scRNA-seq dataset from peripheral blood mononuclear cells (PBMC) generated with the 10x Genomics platform.
File > New File > Rscript
.File > Save As
.Session > Set Working Directory > To Source File Location
(by doing this, RStudio will consider the directory containing your code
and data as the working directory).Let’s have a look at the format of the data we have just downloaded:
## [1] "hg19" "hg19/barcodes.tsv" "hg19/genes.tsv"
## [4] "hg19/matrix.mtx"
We have downloaded a directory called filtered_gene_bc_matrices, which contains a sub-directory names hg19 (i.e. the version of the human genome that was used to summarize the scRNA-seq data), which in turn contains three text files generated with the CellRanger pipeline: barcodes.tsv, genes.tsv, and matrix.mtx. These files contains the count expression values computed for each gene/UMI and cell (more information on these files can be found here).
In this exercise we will use the Seurat R toolkit to perform
scRNA-seq analysis of the PBMC single-cell dataset. To this end, we need
to load the Seurat package using the library()
function.
And, before we begin the analysis, let’s set the seed of the random number generator as indicated below to ensure reproducible plots and results:
We can use the Read10X()
function to read the CellRanger
output files from the directory specified by the data.dir
parameter.
Let’s print to screen the count values of three genes (i.e. rows) in the first 10 cells (i.e. columns).
Little R note:: the c()
function can be
used when we want to specific a list of numbers, strings, etc. For
instance, c("A", "B", "C")
is used to list the first three
letters of the alphabet, and c(1, 2, 3)
the first three
non-negative integers –strings must be indicated in quotation marks,
while numbers not. In R, a more compact form to specify a sequence of
integers is through the use of “:”: the command 1:4
is
equivalent to c(1, 2, 3, 4)
.
## 3 x 10 sparse Matrix of class "dgCMatrix"
##
## CD8A 1 . . . . . . 1 . .
## HLA-A 5 5 12 7 4 8 7 7 3 1
## MS4A1 . 6 . . . . . . 1 1
As the count matrix is very sparse, namely it contains a lot of 0’s (represented as dots), a sparse-matrix representation is used to reduce memory and computational time.
We can now save this data into a Seurat object, called
pbmc
, which we will use throughout our analysis. The object
will be used to store the raw data, normalized data, and all the results
of our analyses (e.g. highly variable genes, PCA, clustering).
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
Let’s have a look at the pbmc
object we have just
created:
## An object of class Seurat
## 32738 features across 2700 samples within 1 assay
## Active assay: RNA (32738 features, 0 variable features)
## 1 layer present: counts
It contains count data for 32,738 genes across 2,700 cells.
The pbmc
object has a metadata slot containg, besides
the cell type labels reported in the orig.ident
column (for
now, all equal to the project name we specified), two covariates that
are important for the quality control step:
nCount_RNA
: the number of total counts per
barcode;nFeature_RNA
: the number of detected features
(i.e. genes) per cell;## orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1 PBMC 2421 781
## AAACATTGAGCTAC-1 PBMC 4903 1352
## AAACATTGATCAGC-1 PBMC 3149 1131
## AAACCGTGCTTCCG-1 PBMC 2639 960
## AAACCGTGTATGCG-1 PBMC 981 522
## AAACGCACTGGTAC-1 PBMC 2164 782
We can add to this slot information of the percentage of
mitochondrial RNA using the PercentageFeatureSet()
function. This function allows to calculate the percentage of all the
counts belonging to a subset of the possible features for each cell; in
this case, all genes whose name starts with “MT-” (specified by the
pattern
argument).
Let’s visualize the distribution of our three QC variables with a violin plot. A violin plot depicts the distribution of numeric data for one or more groups using density curves.
We can also use a scatterplot to visualize the relationship between
nCount_RNA
and percent.mt
:
or between nCount_RNA
and nFeature_RNA
:
Question: How can we make a scatterplot of
nFeature_RNA
vs.percent.mt
?
We now want to filter out cells that: - Have a very high number of detected genes (i.e. possible doublets) or very low number of detected genes (i.e. possible low-quality cells or empty droplets); - Have high mitochondrial counts (possible low-quality cells).
Precisely, based on the previous plots, we will only keep cells that
have nFeature_RNA
in the 200-2,500 range
and percent.mt
< 5%. To this end, we
will use the subset()
function, which return subsets of
data which meet specific conditions.
pbmc <- subset(pbmc,
subset = nFeature_RNA > 200 &
nFeature_RNA < 2500 &
percent.mt < 5)
print(pbmc)
## An object of class Seurat
## 32738 features across 2638 samples within 1 assay
## Active assay: RNA (32738 features, 0 variable features)
## 1 layer present: counts
Question: How many cells have we removed?
Note: in this and in most of the commands of this
exercise, our Seurat object pbmc
is both the input and the
output of our Seurat functions, meaning that the object is updated using
the newly computed data. Depending on the function used, Seurat saves
the new data in specific slots of the Seurat object (which has a quite
complex structure, discussed in this
Wiki).
We will normalize our count data using the
NormalizeData()
function. By default, this function
performs global-scaling normalization using the “LogNormalize” approach,
which:
scale.factor
argument);## Normalizing layer: counts
Now that gene expression levels are comparable across cells, we can
identify highly variable genes (HVG), i.e. genes that are highly
expressed in some cells and lowly expressed in others, using the
FindVariableFeatures()
function. For this analysis, we will
only consider the top 2,000 most variable genes.
## Finding variable features for layer counts
## An object of class Seurat
## 32738 features across 2638 samples within 1 assay
## Active assay: RNA (32738 features, 2000 variable features)
## 2 layers present: counts, data
Let’s have a look at the list of HVG, which can be extracted from the
pbmc
object using the VariableFeatures()
function. We will use the head
function to visualize only
the top 10 ones:
## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "FTL" "GNLY" "PF4" "FTH1"
## [9] "GNG11" "S100A8"
We can also save this list in a variable called
top10
:
We can now make a scatterplot showing the relationship between the
average expression and standardized variance for all genes, with HVG
highlighted in red, using the VariableFeaturePlot()
function. By invoking the LabelPoints()
function on the
plot we have just generated, we can then label the top-10 most variable
genes saved in top10
by passing this list to the
points
argument.
Question: How can we make the same plot but labelling the top 20 most variable genes?
top20 <- head(VariableFeatures(pbmc), 20)
plot2 <- VariableFeaturePlot(pbmc)
LabelPoints(plot = plot2,
points = top20,
repel = TRUE)
We finally apply a linear transformation, namely scaling,
which is a standard pre-processing step prior to dimensional reduction
techniques like PCA to ensure that all features have a similar range of
values and, thus, equal “weight” in the analysis. We will use the
ScaleData()
function, which by default:
We have to specify the features = rownames(pbmc)
argument to make sure that all genes are scaled and not only the highly
variable ones, which is the default behaviour of this function.
We can now perform principal component analysis (PCA) on the scaled
data using the RunPCA()
function. By default, only the data
from the previously HVG are used as input.
The PCA results are saved into the pbmc
object. In the
example below, we visualize the top 5 genes (positively or negatively)
associated with the top 4 principal components (PC):
## PC_ 1
## Positive: AIF1, FCN1, FCER1G, CFD, SERPINA1
## Negative: MALAT1, IL7R, ACAP1, B2M, CTSW
## PC_ 2
## Positive: GZMA, CST7, PRF1, CTSW, B2M
## Negative: MS4A1, HLA-DQA1, HLA-DQB1, LINC00926, CD79B
## PC_ 3
## Positive: HLA-DQA1, HLA-DQB1, CD79B, HLA-DRB1, MS4A1
## Negative: SPARC, GNG11, NRGN, GP9, TUBB1
## PC_ 4
## Positive: IL7R, S100A6, MAL, FYB, GIMAP7
## Negative: HLA-DQA1, CD79B, HLA-DQB1, MS4A1, HLA-DRB1
Using the VizDimLoadings()
function, we can also
visualize the loadings (positive or negative) of the top genes
associated with the principal components of interest. To do so, we need
to specify which type of dimensionality reduction results we want to
access (reduction = "pca"
) and the dimensions we are
interested in (here, principal components PC1 and PC2 as indicated by
dims = 1:2
).
Question: Which is the gene with the largest negative association to PC1?
The top associated genes, and their loadings, can be also visualized
in an heatmap using the DimHeatmap()
function. The
cells
argument can be used to restrict the analysis to the
most “extreme” cells to speed up the computation. In the resulting
heatmap, both cells and genes are sorted according to their scores,
allowing the detection of interesting or uninteresting patterns. These
plots are useful, for instance, to identify PC strongly associated to
cell cycle genes (e.g. TOP2A, MKI67) or to decide which PC to include in
downstream analyses.
We can also plot more than one component, e.g. PC1 and PC20:
Question: What do you notice when comparing PC1 and PC20 heatmaps?
Question: How can we visualize the gene loadings of the first 9 PC using the
DimHeatmap()
function?
The next step of the analysis is clustering, which is performed on the single-cell PCA scores. To this end, we have to select the top PC that are needed to capture most of the data variance.
To decide how many PC to select, we can use an heuristic approach: we
can plot the standard deviation or variance captured by each PC using
the ElbowPlot()
function and (try to) identify an “elbow”,
where the curve levels off, indicating that components to the right of
this point add limited information. With the command below, we generate
an elbow plot for the first 20 PC:
In this plot, we can see an elbow at about 10 PC, which we will use as a cut-off for downstream analysis.
Deciding how many PC to consider is not trivial and might require several iterations, for instance, to evaluate how the final results in terms of cell clustering and annotations are affected by different choices. In general, results are not heavily affected when more PC are considered (e.g. 15, 20), whereas can change drastically if a low number of PC is chosen (e.g. 3, 5).
Seurat uses a graph-based clustering approach based on the previously identified PC. Briefly, it builds namely a KNN graph based on the Euclidean distance between cells in PCA space. Then, it refines the edge weights between pairs of cells based on the overlap in their local neighborhoods. This step is performed using the FindNeighbors() function, which takes as input the previously defined dimensionality of the dataset (here, the first 10 PC).
Based on this graph, cells are then clustered using the Louvain
algorithm (default in the FindClusters()
function) by
performing modularity optimization;
The resolution
argument controls cluster granularity: set
to 1.0 by default, can be increased (decreased) to obtain more (fewer)
clusters. Values in the [0.4-1.2] rage are recommended for single-cell
datasets of around 3,000 cells, but can be increased for larger
datasets.
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95828
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8575
## Number of communities: 9
## Elapsed time: 0 seconds
The clusters are saved in the metadata slot
(seurat_clusters
column) and can be accessed using the
Idents()
function.
## orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5
## AAACATACAACCAC-1 PBMC 2421 781 3.0152829 0
## AAACATTGAGCTAC-1 PBMC 4903 1352 3.7935958 3
## AAACATTGATCAGC-1 PBMC 3149 1131 0.8891712 0
## AAACCGTGCTTCCG-1 PBMC 2639 960 1.7430845 2
## AAACCGTGTATGCG-1 PBMC 981 522 1.2232416 6
## AAACGCACTGGTAC-1 PBMC 2164 782 1.6635860 0
## seurat_clusters
## AAACATACAACCAC-1 0
## AAACATTGAGCTAC-1 3
## AAACATTGATCAGC-1 0
## AAACCGTGCTTCCG-1 2
## AAACCGTGTATGCG-1 6
## AAACGCACTGGTAC-1 0
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 0 3 0 2
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
We can now visualize the clustering results onto the 2D PCA plot
using the DimPlot()
function:
To better visualize scRNA-seq data, we can use non-linear dimensionality reduction approaches like t-SNE or UMAP.
We can run the t-SNE algorithm considering only the first 10 PC using
the RunTSNE()
function as shown in the following
command:
Now that the t-SNE results have been saved into the pbmc
object, they can be plotted using again the DimPlot()
function and specifying reduction = "tsne"
.
Question: Which dimensionality reduction approach works better in separating cells belonging to different clusters, PCA or t-SNE?
Similarly, we can compute the UMAP using the RunUMAP()
function.
Question: How can we visualize the UMAP plot with single cells coloured according to clustering results?
The FeaturePlot()
function can be used to plot the
expression of specific genes onto a dimensionality reduction plot. For
instance, we can plot the normalized expression levels of gene CD3E, T
cell-specific marker, onto the UMAP plot with the following command.
We can also plot more than one marker, e.g CD8A (CD8+ T cells), MS4A1 (B cells), GNLY (NK cells), and PPBP (platelets):
Question: By comparing this plot with the one coloured by clusters, which clusters coorespond to CD8+ T cells, B cells, NK cells, and platelets, respectively?
We can also visualize the distribution of the normalized expression levels for specific genes per cluster as violin plots (which we already made in the QC step):
Alternatively, we can use a visualization based on Rigeline plots (also called Joy plots):
Question: How can we make a Ridge plot for the NKG7 and IL7R genes?
Question: How can we make a violin plot for the NKG7 and IL7R genes?
The VlnPlot()
can be also used as we did at the
beginning of this exercise to plot the number of detected genes
(nFeature_RNA
), total counts (nCount_RNA
), and
percentage of mitochondrial RNA (percent.mt
), but
stratified by cluster.
Question: Which clusters of cells has (overall) the highest number of detected genes and total counts?
We can now perform differential expression analysis to identify cluster-specific markers.
The FindMarkers()
function can be used to identify
markers of a single cluster compared to all other cells. By specifying
the only.pos = TRUE
argument, we can look only at positive
markers –otherwise, by default, both positive and negative ones are
reported. The cluster of interest can be specified using the
ident.1
parameter. Here, for instance, we identify the
markers of cluster 3 (i.e. the one that expressed the B cell-specific
marker MS4A1):
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## CD79A 0.000000e+00 6.761646 0.938 0.043 0.000000e+00
## MS4A1 0.000000e+00 5.720184 0.861 0.054 0.000000e+00
## CD79B 1.263315e-271 4.604475 0.917 0.143 4.135840e-267
## LINC00926 1.592090e-267 7.110855 0.563 0.010 5.212186e-263
## TCL1A 7.293058e-267 6.521049 0.622 0.023 2.387601e-262
MS4A1 is, indeed, among the top markers identified by this analysis.
With the same function, we can identify the top five markers that
distinguish cluster 5 (ident.1 = 5
) from cluster 6
(ident.2 = 6
):
cluster5vs6.markers <- FindMarkers(pbmc,
ident.1 = 5,
ident.2 = 6,
only.pos = TRUE)
head(cluster5vs6.markers, 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## AIF1 1.059333e-56 6.407410 1 0.090 3.468044e-52
## LST1 5.376647e-56 6.031304 1 0.146 1.760207e-51
## COTL1 1.699901e-55 5.514206 1 0.188 5.565136e-51
## CST3 7.645408e-55 5.026532 1 0.243 2.502954e-50
## FTH1 9.227829e-53 3.672740 1 0.979 3.021007e-48
Question: Which are the top-three markers that distinguish cluster 7 from cluster 2?
cluster7vs2.markers <- FindMarkers(pbmc,
ident.1 = 7,
ident.2 = 2,
only.pos = TRUE)
head(cluster7vs2.markers, 3)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCER1A 1.198904e-52 6.160796 0.778 0.027 3.924973e-48
## SERPINF1 3.650442e-46 5.817548 0.593 0.013 1.195082e-41
## CD1C 2.217186e-43 5.291828 0.667 0.025 7.258624e-39
Alternatively, we could use the FindAllMarkers()
function, which systematically identifies markers specific for each
cluster:
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## LDHB 1.335922e-119 1.233504 0.968 0.578 4.373542e-115 0 LDHB
## IL7R 2.990959e-105 1.681329 0.756 0.286 9.791803e-101 0 IL7R
## LTB 1.572874e-102 1.264563 0.973 0.615 5.149274e-98 0 LTB
## CD3D 5.120533e-96 1.128731 0.907 0.392 1.676360e-91 0 CD3D
## IL32 1.862734e-91 1.084271 0.912 0.433 6.098217e-87 0 IL32
## CD3E 5.886229e-87 1.190251 0.837 0.369 1.927034e-82 0 CD3E
We can save the results in our current working directory as a .csv file with the current command:
Let’s imagine we have some prior knowledge on cell type-specific markers, as indicated in the table below, that allows us to assign each cluster to a cell identity:
Cluster ID | Markers | Cell Type |
---|---|---|
0 | IL7R, CCR7 | Naive CD4+ T |
1 | IL7R, S100A4 | Memory CD4+ T |
2 | CD14, LYZ | CD14+ Mono |
3 | MS4A1 | B |
4 | CD8A | CD8+ T |
5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
6 | GNLY, NKG7 | NK |
7 | FCER1A, CST3 | DC |
8 | PPBP | Platelet |
We can use the RenameIdents()
function to rename the
clusters with more biologically meaningful names reflecting cell
identities:
new.cluster.ids <- c("Naive CD4 T",
"Memory CD4 T",
"CD14+ Mono",
"B",
"CD8 T",
"FCGR3A+ Mono",
"NK",
"DC",
"Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc,
new.cluster.ids)
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## Naive CD4 T B Naive CD4 T CD14+ Mono
## AAACCGTGTATGCG-1
## NK
## 9 Levels: Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK ... Platelet
We can now make UMAP plot colored according to the new cell
identities. We can also add the cell names directly to the UMAP by
specifying label = TRUE
, and remove the legend on the right
by adding + NoLegend()
.
Finally, the DotPlot()
function can be used to
visualize, for a set of genes specified through the
features
parameter, how their expression levels change
across clusters. In the dot plot, the size of the dot encodes the
percentage of cells within a cluster, while the color encodes the
average expression level across all cells within a cluster. In the
example code below, + RotatedAxis()
is added to rotate the
labels on the x-axis to make them more easily readable:
knownmarkers <- c("IL7R", "CCR7", "CD14", "LYZ",
"S100A4", "MS4A1", "CD8A",
"FCGR3A", "MS4A7", "GNLY", "NKG7",
"FCER1A", "CST3", "PPBP")
DotPlot(pbmc,
features = knownmarkers) + RotatedAxis()