Initialization

In this exercise, we will analyze a scRNA-seq dataset from peripheral blood mononuclear cells (PBMC) generated with the 10x Genomics platform.

Data download

Create a folder called Ex_single_cell.
Download the raw PBMC single-cell data in the Ex_single_cell folder from this link.
Unzip the pbmc3k_filtered_gene_bc_matrices.tar.gz zipped folder.

RStudio initialization

Open RStudio.
Open a new script by selecting on the RStudio toolbar: File > New File > Rscript.
Save the script in the Ex_single_cell folder by selecting: File > Save As.
Set the working directory to Ex_single_cell by selecting: Session > Set Working Directory > To Source File Location (by doing this, RStudio will consider the directory containing your code and data as the working directory).

Let’s have a look at the format of the data we have just downloaded:

dir("filtered_gene_bc_matrices/",
   recursive = TRUE,
   include.dirs = TRUE)

## [1] "hg19"              "hg19/barcodes.tsv" "hg19/genes.tsv"   
## [4] "hg19/matrix.mtx"

We have downloaded a directory called filtered_gene_bc_matrices, which contains a sub-directory names hg19 (i.e. the version of the human genome that was used to summarize the scRNA-seq data), which in turn contains three text files generated with the CellRanger pipeline: barcodes.tsv, genes.tsv, and matrix.mtx. These files contains the count expression values computed for each gene/UMI and cell (more information on these files can be found here).

Data loading

In this exercise we will use the Seurat R toolkit to perform scRNA-seq analysis of the PBMC single-cell dataset. To this end, we need to load the Seurat package using the library() function.

library(Seurat)

And, before we begin the analysis, let’s set the seed of the random number generator as indicated below to ensure reproducible plots and results:

set.seed(1234)

We can use the Read10X() function to read the CellRanger output files from the directory specified by the data.dir parameter.

pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

Let’s print to screen the count values of three genes (i.e. rows) in the first 10 cells (i.e. columns).

Little R note:: the c() function can be used when we want to specific a list of numbers, strings, etc. For instance, c("A", "B", "C") is used to list the first three letters of the alphabet, and c(1, 2, 3) the first three non-negative integers –strings must be indicated in quotation marks, while numbers not. In R, a more compact form to specify a sequence of integers is through the use of “:”: the command 1:4 is equivalent to c(1, 2, 3, 4).

pbmc.data[c("CD8A", "HLA-A", "MS4A1"), 1:10]

## 3 x 10 sparse Matrix of class "dgCMatrix"

##                           
## CD8A  1 .  . . . . . 1 . .
## HLA-A 5 5 12 7 4 8 7 7 3 1
## MS4A1 . 6  . . . . . . 1 1

As the count matrix is very sparse, namely it contains a lot of 0’s (represented as dots), a sparse-matrix representation is used to reduce memory and computational time.

We can now save this data into a Seurat object, called pbmc, which we will use throughout our analysis. The object will be used to store the raw data, normalized data, and all the results of our analyses (e.g. highly variable genes, PCA, clustering).

pbmc <- CreateSeuratObject(counts = pbmc.data,
                           project = "PBMC")

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

Let’s have a look at the pbmc object we have just created:

print(pbmc)

## An object of class Seurat 
## 32738 features across 2700 samples within 1 assay 
## Active assay: RNA (32738 features, 0 variable features)
##  1 layer present: counts

It contains count data for 32,738 genes across 2,700 cells.

The pbmc object has a metadata slot containg, besides the cell type labels reported in the orig.ident column (for now, all equal to the project name we specified), two covariates that are important for the quality control step:

nCount_RNA: the number of total counts per barcode;
nFeature_RNA: the number of detected features (i.e. genes) per cell;

head(pbmc@meta.data)

##                  orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1       PBMC       2421          781
## AAACATTGAGCTAC-1       PBMC       4903         1352
## AAACATTGATCAGC-1       PBMC       3149         1131
## AAACCGTGCTTCCG-1       PBMC       2639          960
## AAACCGTGTATGCG-1       PBMC        981          522
## AAACGCACTGGTAC-1       PBMC       2164          782

We can add to this slot information of the percentage of mitochondrial RNA using the PercentageFeatureSet() function. This function allows to calculate the percentage of all the counts belonging to a subset of the possible features for each cell; in this case, all genes whose name starts with “MT-” (specified by the pattern argument).

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, 
                                             pattern = "^MT-")

Quality control (QC)

Let’s visualize the distribution of our three QC variables with a violin plot. A violin plot depicts the distribution of numeric data for one or more groups using density curves.

VlnPlot(pbmc, 
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
        ncol = 3)

We can also use a scatterplot to visualize the relationship between nCount_RNA and percent.mt:

FeatureScatter(pbmc, 
               feature1 = "nCount_RNA", 
               feature2 = "percent.mt")

or between nCount_RNA and nFeature_RNA:

FeatureScatter(pbmc, 
               feature1 = "nCount_RNA", 
               feature2 = "nFeature_RNA")

Question: How can we make a scatterplot of nFeature_RNA vs. percent.mt?

FeatureScatter(pbmc, 
               feature1 = "nFeature_RNA", 
               feature2 = "percent.mt")

We now want to filter out cells that: - Have a very high number of detected genes (i.e. possible doublets) or very low number of detected genes (i.e. possible low-quality cells or empty droplets); - Have high mitochondrial counts (possible low-quality cells).

Precisely, based on the previous plots, we will only keep cells that have nFeature_RNA in the 200-2,500 range and percent.mt < 5%. To this end, we will use the subset() function, which return subsets of data which meet specific conditions.

pbmc <- subset(pbmc, 
               subset =  nFeature_RNA > 200 & 
                 nFeature_RNA < 2500 & 
                 percent.mt < 5)

print(pbmc)

## An object of class Seurat 
## 32738 features across 2638 samples within 1 assay 
## Active assay: RNA (32738 features, 0 variable features)
##  1 layer present: counts

Question: How many cells have we removed?

Note: in this and in most of the commands of this exercise, our Seurat object pbmc is both the input and the output of our Seurat functions, meaning that the object is updated using the newly computed data. Depending on the function used, Seurat saves the new data in specific slots of the Seurat object (which has a quite complex structure, discussed in this Wiki).

Analysis and visualization

Normalization, feature selection, and scaling

We will normalize our count data using the NormalizeData() function. By default, this function performs global-scaling normalization using the “LogNormalize” approach, which:

Divides gene counts by the total count depth in each cell;
Multiplies this by a (10,000 by default, can be changed with the scale.factor argument);
Log-transforms the result.

pbmc <- NormalizeData(pbmc)

## Normalizing layer: counts

Now that gene expression levels are comparable across cells, we can identify highly variable genes (HVG), i.e. genes that are highly expressed in some cells and lowly expressed in others, using the FindVariableFeatures() function. For this analysis, we will only consider the top 2,000 most variable genes.

pbmc <- FindVariableFeatures(pbmc, 
                             nfeatures = 2000)

## Finding variable features for layer counts

print(pbmc)

## An object of class Seurat 
## 32738 features across 2638 samples within 1 assay 
## Active assay: RNA (32738 features, 2000 variable features)
##  2 layers present: counts, data

Let’s have a look at the list of HVG, which can be extracted from the pbmc object using the VariableFeatures() function. We will use the head function to visualize only the top 10 ones:

head(VariableFeatures(pbmc), 10)

##  [1] "PPBP"   "LYZ"    "S100A9" "IGLL5"  "FTL"    "GNLY"   "PF4"    "FTH1"  
##  [9] "GNG11"  "S100A8"

We can also save this list in a variable called top10:

top10 <- head(VariableFeatures(pbmc), 10)

We can now make a scatterplot showing the relationship between the average expression and standardized variance for all genes, with HVG highlighted in red, using the VariableFeaturePlot() function. By invoking the LabelPoints() function on the plot we have just generated, we can then label the top-10 most variable genes saved in top10 by passing this list to the points argument.

plot1 <- VariableFeaturePlot(pbmc)
LabelPoints(plot = plot1,
            points = top10,
            repel = TRUE)

Question: How can we make the same plot but labelling the top 20 most variable genes?

top20 <- head(VariableFeatures(pbmc), 20)

plot2 <- VariableFeaturePlot(pbmc)
LabelPoints(plot = plot2,
            points = top20,
            repel = TRUE)

We finally apply a linear transformation, namely scaling, which is a standard pre-processing step prior to dimensional reduction techniques like PCA to ensure that all features have a similar range of values and, thus, equal “weight” in the analysis. We will use the ScaleData() function, which by default:

Shifts the expression of each gene, so that the mean expression across cells is 0 (i.e. removes the mean across cells);
Scales the expression of each gene, so that the variance across cells is 1 (i.e. divides by the variance across cells).

We have to specify the features = rownames(pbmc) argument to make sure that all genes are scaled and not only the highly variable ones, which is the default behaviour of this function.

pbmc <- ScaleData(pbmc, 
                  features = rownames(pbmc))

Dimensionality reduction

We can now perform principal component analysis (PCA) on the scaled data using the RunPCA() function. By default, only the data from the previously HVG are used as input.

pbmc <- RunPCA(pbmc)

The PCA results are saved into the pbmc object. In the example below, we visualize the top 5 genes (positively or negatively) associated with the top 4 principal components (PC):

print(pbmc[["pca"]], 
  dims = 1:4, 
  nfeatures = 5)

## PC_ 1 
## Positive:  AIF1, FCN1, FCER1G, CFD, SERPINA1 
## Negative:  MALAT1, IL7R, ACAP1, B2M, CTSW 
## PC_ 2 
## Positive:  GZMA, CST7, PRF1, CTSW, B2M 
## Negative:  MS4A1, HLA-DQA1, HLA-DQB1, LINC00926, CD79B 
## PC_ 3 
## Positive:  HLA-DQA1, HLA-DQB1, CD79B, HLA-DRB1, MS4A1 
## Negative:  SPARC, GNG11, NRGN, GP9, TUBB1 
## PC_ 4 
## Positive:  IL7R, S100A6, MAL, FYB, GIMAP7 
## Negative:  HLA-DQA1, CD79B, HLA-DQB1, MS4A1, HLA-DRB1

Using the VizDimLoadings() function, we can also visualize the loadings (positive or negative) of the top genes associated with the principal components of interest. To do so, we need to specify which type of dimensionality reduction results we want to access (reduction = "pca") and the dimensions we are interested in (here, principal components PC1 and PC2 as indicated by dims = 1:2).

VizDimLoadings(pbmc, 
              reduction = "pca",
              dims = 1:2)

Question: Which is the gene with the largest negative association to PC1?

The top associated genes, and their loadings, can be also visualized in an heatmap using the DimHeatmap() function. The cells argument can be used to restrict the analysis to the most “extreme” cells to speed up the computation. In the resulting heatmap, both cells and genes are sorted according to their scores, allowing the detection of interesting or uninteresting patterns. These plots are useful, for instance, to identify PC strongly associated to cell cycle genes (e.g. TOP2A, MKI67) or to decide which PC to include in downstream analyses.

DimHeatmap(pbmc, 
           reduction = "pca",
           dims = 1,
           cells = 500)

We can also plot more than one component, e.g. PC1 and PC20:

DimHeatmap(pbmc, 
           reduction = "pca",
           dims = c(1, 20),
           cells = 500)

Question: What do you notice when comparing PC1 and PC20 heatmaps?

Question: How can we visualize the gene loadings of the first 9 PC using the DimHeatmap() function?

DimHeatmap(pbmc, 
           reduction = "pca",
           dims = 1:6,
           cells = 500)

The next step of the analysis is clustering, which is performed on the single-cell PCA scores. To this end, we have to select the top PC that are needed to capture most of the data variance.

To decide how many PC to select, we can use an heuristic approach: we can plot the standard deviation or variance captured by each PC using the ElbowPlot() function and (try to) identify an “elbow”, where the curve levels off, indicating that components to the right of this point add limited information. With the command below, we generate an elbow plot for the first 20 PC:

ElbowPlot(pbmc,
          reduction = "pca",
          ndims = 20)

In this plot, we can see an elbow at about 10 PC, which we will use as a cut-off for downstream analysis.

Deciding how many PC to consider is not trivial and might require several iterations, for instance, to evaluate how the final results in terms of cell clustering and annotations are affected by different choices. In general, results are not heavily affected when more PC are considered (e.g. 15, 20), whereas can change drastically if a low number of PC is chosen (e.g. 3, 5).

Clustering

Seurat uses a graph-based clustering approach based on the previously identified PC. Briefly, it builds namely a KNN graph based on the Euclidean distance between cells in PCA space. Then, it refines the edge weights between pairs of cells based on the overlap in their local neighborhoods. This step is performed using the FindNeighbors() function, which takes as input the previously defined dimensionality of the dataset (here, the first 10 PC).

pbmc <- FindNeighbors(pbmc, 
                      dims = 1:10)

Based on this graph, cells are then clustered using the Louvain algorithm (default in the FindClusters() function) by performing modularity optimization;
The resolution argument controls cluster granularity: set to 1.0 by default, can be increased (decreased) to obtain more (fewer) clusters. Values in the [0.4-1.2] rage are recommended for single-cell datasets of around 3,000 cells, but can be increased for larger datasets.

pbmc <- FindClusters(pbmc, 
                     resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95828
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8575
## Number of communities: 9
## Elapsed time: 0 seconds

The clusters are saved in the metadata slot (seurat_clusters column) and can be accessed using the Idents() function.

 head(pbmc@meta.data)

##                  orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5
## AAACATACAACCAC-1       PBMC       2421          781  3.0152829               0
## AAACATTGAGCTAC-1       PBMC       4903         1352  3.7935958               3
## AAACATTGATCAGC-1       PBMC       3149         1131  0.8891712               0
## AAACCGTGCTTCCG-1       PBMC       2639          960  1.7430845               2
## AAACCGTGTATGCG-1       PBMC        981          522  1.2232416               6
## AAACGCACTGGTAC-1       PBMC       2164          782  1.6635860               0
##                  seurat_clusters
## AAACATACAACCAC-1               0
## AAACATTGAGCTAC-1               3
## AAACATTGATCAGC-1               0
## AAACCGTGCTTCCG-1               2
## AAACCGTGTATGCG-1               6
## AAACGCACTGGTAC-1               0

 head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                0                3                0                2 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

Dimensionality reduction for data visualization

We can now visualize the clustering results onto the 2D PCA plot using the DimPlot() function:

DimPlot(pbmc,
        reduction = "pca")

To better visualize scRNA-seq data, we can use non-linear dimensionality reduction approaches like t-SNE or UMAP.

We can run the t-SNE algorithm considering only the first 10 PC using the RunTSNE() function as shown in the following command:

pbmc <- RunTSNE(pbmc, 
                dims = 1:10)

Now that the t-SNE results have been saved into the pbmc object, they can be plotted using again the DimPlot() function and specifying reduction = "tsne".

DimPlot(pbmc, 
        reduction = "tsne")

Question: Which dimensionality reduction approach works better in separating cells belonging to different clusters, PCA or t-SNE?

Similarly, we can compute the UMAP using the RunUMAP() function.

pbmc <- RunUMAP(pbmc,
                dims = 1:10)

Question: How can we visualize the UMAP plot with single cells coloured according to clustering results?

  DimPlot(pbmc, 
          reduction = "umap")

Data visualization

The FeaturePlot() function can be used to plot the expression of specific genes onto a dimensionality reduction plot. For instance, we can plot the normalized expression levels of gene CD3E, T cell-specific marker, onto the UMAP plot with the following command.

FeaturePlot(pbmc, 
            features = "CD3E",
            reduction = "umap")

We can also plot more than one marker, e.g CD8A (CD8+ T cells), MS4A1 (B cells), GNLY (NK cells), and PPBP (platelets):

FeaturePlot(pbmc, 
            features = c("CD8A", "MS4A1", "GNLY", "PPBP"), 
            reduction = "umap")

Question: By comparing this plot with the one coloured by clusters, which clusters coorespond to CD8+ T cells, B cells, NK cells, and platelets, respectively?

We can also visualize the distribution of the normalized expression levels for specific genes per cluster as violin plots (which we already made in the QC step):

VlnPlot(pbmc,
        features = "CD3E")

Alternatively, we can use a visualization based on Rigeline plots (also called Joy plots):

RidgePlot(pbmc,
          features = "CD3E")

Question: How can we make a Ridge plot for the NKG7 and IL7R genes?

RidgePlot(pbmc,
          features = c("NKG7", "IL7R"))

Question: How can we make a violin plot for the NKG7 and IL7R genes?

VlnPlot(pbmc,
        features = c("NKG7", "IL7R"))

The VlnPlot() can be also used as we did at the beginning of this exercise to plot the number of detected genes (nFeature_RNA), total counts (nCount_RNA), and percentage of mitochondrial RNA (percent.mt), but stratified by cluster.

VlnPlot(pbmc,
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"))

Question: Which clusters of cells has (overall) the highest number of detected genes and total counts?

Differential expression analysis

We can now perform differential expression analysis to identify cluster-specific markers.

The FindMarkers() function can be used to identify markers of a single cluster compared to all other cells. By specifying the only.pos = TRUE argument, we can look only at positive markers –otherwise, by default, both positive and negative ones are reported. The cluster of interest can be specified using the ident.1 parameter. Here, for instance, we identify the markers of cluster 3 (i.e. the one that expressed the B cell-specific marker MS4A1):

cluster3.markers <- FindMarkers(pbmc, 
                                ident.1 = 3,
                                only.pos = TRUE)
head(cluster3.markers, 5)

##                   p_val avg_log2FC pct.1 pct.2     p_val_adj
## CD79A      0.000000e+00   6.761646 0.938 0.043  0.000000e+00
## MS4A1      0.000000e+00   5.720184 0.861 0.054  0.000000e+00
## CD79B     1.263315e-271   4.604475 0.917 0.143 4.135840e-267
## LINC00926 1.592090e-267   7.110855 0.563 0.010 5.212186e-263
## TCL1A     7.293058e-267   6.521049 0.622 0.023 2.387601e-262

MS4A1 is, indeed, among the top markers identified by this analysis.

With the same function, we can identify the top five markers that distinguish cluster 5 (ident.1 = 5) from cluster 6 (ident.2 = 6):

cluster5vs6.markers <- FindMarkers(pbmc, 
                                   ident.1 = 5,
                                   ident.2 = 6,
                                   only.pos = TRUE)
head(cluster5vs6.markers, 5)

##              p_val avg_log2FC pct.1 pct.2    p_val_adj
## AIF1  1.059333e-56   6.407410     1 0.090 3.468044e-52
## LST1  5.376647e-56   6.031304     1 0.146 1.760207e-51
## COTL1 1.699901e-55   5.514206     1 0.188 5.565136e-51
## CST3  7.645408e-55   5.026532     1 0.243 2.502954e-50
## FTH1  9.227829e-53   3.672740     1 0.979 3.021007e-48

Question: Which are the top-three markers that distinguish cluster 7 from cluster 2?

cluster7vs2.markers <- FindMarkers(pbmc, 
                                   ident.1 = 7,
                                   ident.2 = 2,
                                   only.pos = TRUE)
head(cluster7vs2.markers, 3)

##                 p_val avg_log2FC pct.1 pct.2    p_val_adj
## FCER1A   1.198904e-52   6.160796 0.778 0.027 3.924973e-48
## SERPINF1 3.650442e-46   5.817548 0.593 0.013 1.195082e-41
## CD1C     2.217186e-43   5.291828 0.667 0.025 7.258624e-39

Alternatively, we could use the FindAllMarkers() function, which systematically identifies markers specific for each cluster:

pbmc.markers <- FindAllMarkers(pbmc, 
                               only.pos = TRUE)
head(pbmc.markers)

##              p_val avg_log2FC pct.1 pct.2     p_val_adj cluster gene
## LDHB 1.335922e-119   1.233504 0.968 0.578 4.373542e-115       0 LDHB
## IL7R 2.990959e-105   1.681329 0.756 0.286 9.791803e-101       0 IL7R
## LTB  1.572874e-102   1.264563 0.973 0.615  5.149274e-98       0  LTB
## CD3D  5.120533e-96   1.128731 0.907 0.392  1.676360e-91       0 CD3D
## IL32  1.862734e-91   1.084271 0.912 0.433  6.098217e-87       0 IL32
## CD3E  5.886229e-87   1.190251 0.837 0.369  1.927034e-82       0 CD3E

We can save the results in our current working directory as a .csv file with the current command:

write.csv(pbmc.markers,
          quote = FALSE,
          row.names = FALSE,
          file = "PBMC_DEGs.csv")

Manual cell-type classification

Let’s imagine we have some prior knowledge on cell type-specific markers, as indicated in the table below, that allows us to assign each cluster to a cell identity:

Cluster ID	Markers	Cell Type
0	IL7R, CCR7	Naive CD4+ T
1	IL7R, S100A4	Memory CD4+ T
2	CD14, LYZ	CD14+ Mono
3	MS4A1	B
4	CD8A	CD8+ T
5	FCGR3A, MS4A7	FCGR3A+ Mono
6	GNLY, NKG7	NK
7	FCER1A, CST3	DC
8	PPBP	Platelet

We can use the RenameIdents() function to rename the clusters with more biologically meaningful names reflecting cell identities:

new.cluster.ids <- c("Naive CD4 T",
                     "Memory CD4 T", 
                     "CD14+ Mono", 
                     "B", 
                     "CD8 T", 
                     "FCGR3A+ Mono",
                     "NK", 
                     "DC", 
                     "Platelet")
names(new.cluster.ids) <- levels(pbmc)

pbmc <- RenameIdents(pbmc, 
                     new.cluster.ids)

head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##      Naive CD4 T                B      Naive CD4 T       CD14+ Mono 
## AAACCGTGTATGCG-1 
##               NK 
## 9 Levels: Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK ... Platelet

We can now make UMAP plot colored according to the new cell identities. We can also add the cell names directly to the UMAP by specifying label = TRUE, and remove the legend on the right by adding + NoLegend().

DimPlot(pbmc, 
        reduction = "umap", 
        label = TRUE) + NoLegend()

Finally, the DotPlot() function can be used to visualize, for a set of genes specified through the features parameter, how their expression levels change across clusters. In the dot plot, the size of the dot encodes the percentage of cells within a cluster, while the color encodes the average expression level across all cells within a cluster. In the example code below, + RotatedAxis() is added to rotate the labels on the x-axis to make them more easily readable:

knownmarkers <- c("IL7R", "CCR7", "CD14", "LYZ",
                  "S100A4", "MS4A1", "CD8A",
                  "FCGR3A", "MS4A7", "GNLY", "NKG7",
                  "FCER1A", "CST3", "PPBP")

DotPlot(pbmc,
  features = knownmarkers) + RotatedAxis()

scRNA-seq analysis

Francesca Finotello, PhD

June 04, 2025