WM7 - Computational and Systems Biology

DGE analysis with DESeq2

Bulk RNAseq data was obtained from a colorectal cancer cell line Caco-2 which was treated with the MEK inhibitor. The treated cells should be compared with untreated control cells to determine the effects of the inhibitor on the transcriptional level.

The experiment was done with 3 replicates for each condition (treated, untreated).

1. Install required R packages

We first need to install some R packages which help us to run our differential gene expression analysis with DESeq2.

tidyverse

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. It includes the following packages:

ggplot2, for data visualisation.
dplyr, for data manipulation.
tidyr, for data tidying.
readr, for data import.
purrr, for functional programming.
tibble, for tibbles, a modern re-imagining of data frames.
stringr, for strings.
forcats, for factors.
lubridate, for date/times.

Other packages

conflicted: alternative conflict resolution
writexl: Excel file writing
BiocManager: a package manager to install and manage R Bioconductor pagages
DESeq2: a Bioconductor package for differential gene expression analysis
ggrepel: provides geoms for ggplot2 to repel overlapping text labels
EnhancedVolcano: a package for creating publication ready volcano plots
ComplexHeatmap: this package provides a highly flexible way to generate heatmaps and supports various annotation graphics.
clusterProfiler: provides a universal interface for gene functional annotation from a variety of sources
org.Hs.eg.db: Genome wide annotation for Human, primarily based on mapping using Entrez Gene identifiers.
ReactomePA: This package provides functions for pathway analysis based on REACTOME pathway database.

install.packages("conflicted")
install.packages("tidyverse")
install.packages("writexl")
install.packages("BiocManager")
BiocManager::install("DESeq2")
BiocManager::install("EnhancedVolcano")
BiocManager::install("ComplexHeatmap")
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("ReactomePA")
install.packages("ggrepel")

2. Load the required R packages

After installation of the packages we need to load them into our workspace.

library(conflicted)
library(tidyverse)
library(writexl)
library(DESeq2)
library(ggrepel)
library(EnhancedVolcano)
library(ComplexHeatmap)
library(clusterProfiler)
library(org.Hs.eg.db)
library(ReactomePA)

We also need to set preferred libraries for conflicting functions:

conflicts_prefer(dplyr::filter)

## [conflicted] Will prefer dplyr::filter over any other package.

conflicts_prefer(dplyr::select)

## [conflicted] Will prefer dplyr::select over any other package.

3. Load the sample sheet and prepare meta data

We next need to prepare the meta data, which is the sample information which is stored in our samplesheet.csv file.

We can use the read_csv() function from the readr package and assign the contents of the sample sheet to a tibble object named sample_info

sample_info <- read_csv("your samplesheet file name here")

Toggle solution

sample_info <- read_csv("sampleSheet_deseq2.csv")

Now we can remove the fastq_1, fastq_2 and strandedness columns since we do not need this information. To do so we can use the dplyr::select() function.

sample_info <- sample_info |> select(-fastq_1, -fastq_2, -strandedness)

Our sample_info should now look as follows:

4. Load the raw count data

From the nf-core/rnaseq pipeline run we obtained a result file which contains the raw read counts of all samples over all genes.

salmon.merged.gene_counts.tsv

We can use the read_tsv() function from the readr package to read the file and assign it to count_mat

count_mat <- read_tsv("your salmon.merged.gene_counts.tsv here")

Toggle solution

count_mat <- read_tsv("salmon.merged.gene_counts.tsv")

Our count_mat looks as follows:

The first column gene_id contains the Ensembl Gene IDs which are unique. The second column gene_names contains the HGNC Gene Symbols. The remaining columns contain the raw counts in each sample for each gene (-Note: the values are not integers-).

We now need to remove the gene_name column from the count_mat and convert the gene_id column to rownames. As the count values are not integers we also need to convert them using the round() function.

But first we will save the mapping of gene_id to gene_name into a new tibble object named ensg_to_symbol for later use.

We can use the dplyr select() function to extract the two columns (gene_id, gene_name). We also want to remove pseudo autosomal regions, since we sum those up below and unify the homologous genes.

ensg_to_symbol <- count_mat |> 
  select(gene_id, gene_name) |> 
  filter(!str_detect(gene_id, "PAR_Y"))

The ensg_to_symbol map object looks as follow:

We will now sum up the counts for genes that are present multiple times but have different Ensembl gene IDs.

count_mat_sum <- count_mat |>
  group_by(gene_name) |>
  summarize(across(starts_with("CaCo"), sum))
head(count_mat_sum, 5)

Now we need to re-add the gene_id column which we lost in the previous step. To add it using the dplyr left_join() function together with the ens_to_symbol object where we have the mapping information. For multiple matches, we have a one-to-many mapping, we just keep the first occurring match. We use relocate() to move the gene_id column from the last positions to the position before gene_name.

count_mat <- count_mat_sum |>
  left_join(ensg_to_symbol, multiple="first") |>
  relocate(gene_id, .before = gene_name)

## Joining with `by = join_by(gene_name)`

head(count_mat, 5)

We then use again select() to remove the gene_name column and pass the output to column_to_rownames() and then further on to round()

count_mat <- count_mat |> select(-gene_name) |> column_to_rownames("gene_id") |> round()

5. Create a DESeqDataSet

From our sample_info and our count_mat we can now create a DESeqDataSet using the DESeq2 function DESeqDataSetFromMatrix() and assign it to dds.

We need to pass the following information to DESeqDataSetFromMatrix():

countData, which is our count matrix from above
colData, which is out sample information
design, which indicates how to model the samples, i.e. here we want to measure the effect of the condition (treated vs. untreated)

Note: The factor variable condition should be columns of colData

Examples for models:

simple model with no covariate to measure the effect of the condition

design = ~ condition

model with batch as covariate to measure the effect of the condition and controlling for batch differences

design = ~ batch + condition

We do not have different batches, so we do not need to control for batch effects

dds <- DESeqDataSetFromMatrix(countData = count_mat,
                              colData = sample_info,
                              design= ~ condition)

## converting counts to integer mode

DESeqDataSet information

dds

## class: DESeqDataSet 
## dim: 61255 6 
## metadata(1): version
## assays(1): counts
## rownames(61255): ENSG00000276861.1 ENSG00000273730.1 ...
##   ENSG00000284550.1 ENSG00000281780.1
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition

61255 genes
6 samples
2 conditions

6. Filter for low expressed genes

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.

Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads per sample condition.

We use the collapseReplicates() function from DESeq2 to get the counts per condition and check if any of the conditions has more than 10 counts.

pass_min_counts <- counts(collapseReplicates(dds, dds$condition)) > 10

keep <- as.data.frame(pass_min_counts) |>
  mutate(keep = ifelse( MEKi | control, TRUE, FALSE)) |>
  pull(keep)

head(pass_min_counts, 5)

##                    control  MEKi
## ENSG00000276861.1    FALSE FALSE
## ENSG00000273730.1    FALSE  TRUE
## ENSG00000202198.1     TRUE  TRUE
## ENSG00000121410.12    TRUE  TRUE
## ENSG00000268895.6     TRUE  TRUE

head(keep, 5)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE

Now we have our list of genes that have at least 10 reads in any of the two conditions and we use it to filter the dds

dds <- dds[keep,]

Filtered DESeqDataSet information

dds

## class: DESeqDataSet 
## dim: 16982 6 
## metadata(1): version
## assays(1): counts
## rownames(16982): ENSG00000273730.1 ENSG00000202198.1 ...
##   ENSG00000074755.15 ENSG00000036549.13
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition

16982 genes
6 samples
2 conditions

7. Save filtered raw counts as TSV and Excel file

Now we can save our filtered raw counts as files.

We may first the gene symbols to the count table again by joining the ensg_to_symbol map object.

To do so, we get the count matrix from the dds object using the counts() function and transform it to a tibble and join it using the dplyr left_join() function. We use relocate() to move the gene_name column from the last positions to the position after gene_id

filtered_counts <- counts(dds) |> 
  as_tibble(rownames = "gene_id") |>
  left_join(ensg_to_symbol) |>
  relocate(gene_name, .after = gene_id)

The filtered_counts map object looks as follow:

Now lets save the table as Excel file using the write_xlsx() function:

x = filtered_counts
path = filtered_counts.xlsx

and as TSV file using the write_tsv() function:

x = filtered_counts
file = filtered_counts.xlsx

write_xlsx(...)
write_tsv(...)

Toggle solution

write_xlsx(filtered_counts, path = "filtered_counts.xlsx")
write_tsv(filtered_counts, file = "filtered_counts.tsv")

We should now have the following filtered count files:

## [1] "filtered_counts.tsv"  "filtered_counts.xlsx"

8. Normalize the raw counts and save as files

To manually normalize the raw count data we may utilize the estimateSizeFactors. To pull out the normalized counts we again use counts() but specify the option normalized=TRUE.

dds <- estimateSizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)

Repeating the steps from above (7. Save filtered raw counts) we can add the gene symbols and save the normalized counts as well:

normalized_counts.xlsx
normalized_counts.tsv

write your own code here
first left_join(ensg_to_symbol) 
then save xlsx and tsv

Toggle solution

normalized_counts <- normalized_counts |> 
  as_tibble(rownames = "gene_id") |>
  left_join(ensg_to_symbol) |>
  relocate(gene_name, .after = gene_id)

write_xlsx(normalized_counts, path = "normalized_counts.xlsx")
write_tsv(normalized_counts, file = "normalized_counts.tsv")

We should now have the following normalized count files:

## [1] "normalized_counts.tsv"  "normalized_counts.xlsx"

9. Set reference condtion and run Differential expression analysis

We want to make sure that our reference (control) condition is control. To do so we use the relevel() function.

dds$condition = relevel(dds$condition, "control")

Finally run DESeq

dds <- DESeq(dds)

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

10. Fetch the results

We can get the results table from the dds object using the results() function.

res <- results(dds)

The result table should look as follows:

## log2 fold change (MLE): condition MEKi vs control 
## Wald test p-value: condition MEKi vs control 
## DataFrame with 5 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat    pvalue
##                    <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000273730.1    1.91806      4.3844998  3.929677 1.1157406  0.264533
## ENSG00000202198.1   32.91620      0.0186264  0.368995 0.0504787  0.959741
## ENSG00000121410.12  10.59886      0.3110182  0.596268 0.5216079  0.601943
## ENSG00000268895.6   13.31781      0.0497901  0.528262 0.0942527  0.924908
## ENSG00000148584.16  12.02246      0.5314377  0.570464 0.9315890  0.351549
##                         padj
##                    <numeric>
## ENSG00000273730.1         NA
## ENSG00000202198.1   0.979662
## ENSG00000121410.12  0.765222
## ENSG00000268895.6   0.961928
## ENSG00000148584.16  0.555553

Lets get some summary information:

summary(res)

We may want to save result to an Excel file now. But first lets add the gene names again. For this the result needs to be converted into a dataframe first.

res <- as.data.frame(res) |>
  rownames_to_column(var = "gene_id") |>
  as_tibble() |>
  left_join(ensg_to_symbol) |>
  relocate(gene_name, .after = gene_id)

The res table should look as follows:

Now get all differential expressed genes with adjusted p-values below FDR < 0.1 and sort by the smallest p-value. To do so we use the filter() function to filter padj be < 0.1 and the arrange() function to order the rows by padj

res_significant <- res |>
  filter(padj < 0.1) |>
  arrange(padj)

The res_significant table should look as follows:

Save as Excel and TSV

write your own code here
check above for method to save xlsx and tsv
save full results and significant results

Toggle solution

write_xlsx(res, path = "DESeq_result_full.xlsx")
write_xlsx(res_significant, path = "DESeq_result_significant.xlsx")
write_tsv(res, file = "DESeq_result_full.tsv")
write_tsv(res_significant, file = "DESeq_result_significant.tsv")

We should now have the following DESeq2 result files:

## [1] "DESeq_result_full.tsv"           "DESeq_result_significant_fc.tsv"
## [3] "DESeq_result_significant.tsv"

## [1] "DESeq_result_full.xlsx"           "DESeq_result_significant_fc.xlsx"
## [3] "DESeq_result_significant.xlsx"

11. Generate a PCA plot

To see if the MEKi treatment had a global effect on our cells, we will run a PCA analysis and plot the first two principal components in two dimensions and visually identify differences or similarities.

We will first perform a variance stabilizing transformations (VST) to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low.

For downstream analyses – e.g. for visualization or clustering (PCA) – it is useful to work with transformed versions of the count data.

We use the vst() function and set the option blind=FALSE to perform the transformation.

vsd <- vst(dds, blind=FALSE)

Now we can use the vsd (variance stabilized data) and the plotPCA() function to generate a PCA plot.

Note: we may use the function geom_text_repel(aes(label=name),vjust=2)) from the ggrepel package to add labels to the data points. This annotation can be added using the + sign after plotPCA() which is producing a ggplot() object. See also: https://ggplot2.tidyverse.org/reference/ggplot.html

write your own code here
use the plotPCA function and ggrepel to create a
PCA plot of the samples.

The resulting plot should look similar to this here:

Toggle solution

print(plotPCA(vsd) + geom_text_repel(aes(label=name),vjust=3,hjust=1.5))

12. Volcano plot

A volcano plot is a type of scatterplot that shows statistical significance (p value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.

We will use the EnhancedVolcano() function to generate a volcano plot to visually summarize differences in the gene expression.

write your own code here
use the EnhancedVolcano() function to plot the data in DESeq res 

on the x axis you want to plot the log2FoldChange
on the y axis you want to plot the padj
set the pCutoff to 0.1
set the FCcutoff to 1

The resulting plot should look similar to this here:

Toggle solution

EnhancedVolcano(res,
                lab = res$gene_name,
                x = "log2FoldChange",
                y = "padj",
                pCutoff = 0.1,
                FCcutoff = 1,
                gridlines.major = FALSE,
                title = 'DESeq2 results',
                subtitle = "",
                legendLabels = c("NS", expression(Log[2] ~ FC), "p-adj", expression("p-adj" ~ and ~ log[2] ~ FC)),
                ylab = bquote(~-Log[10] ~ italic("p-adj"))
                )

As we can see there are some genes with extreme p-values on the y-axis and Log2foldChanges on the x-axis. This makes it difficult to display genes that are not as stongly regulated.

To better show those genes we may limit (xlim, ylim) the axes extend to -5,5 and 0, -log10(10e-10)

Toggle solution

EnhancedVolcano(res,
                lab = res$gene_name,
                x = "log2FoldChange",
                y = "padj",
                pCutoff = 0.1,
                FCcutoff = 1,
                gridlines.major = FALSE,
                xlim = c(-5, 5),
                ylim = c(0, -log10(10e-10)),
                title = 'DESeq2 results',
                subtitle = "",
                legendLabels = c("NS", expression(Log[2] ~ FC), "p-adj", expression("p-adj" ~ and ~ log[2] ~ FC)),
                ylab = bquote(~-Log[10] ~ italic("p-adj"))
                )

13. List of significant de genes with FC > 2

We now want to save genes that are differential expressed in the MEKi treated cells. We define the following cutoffs:

Foldchange > 2 : abs(log2FoldChange) > 1
p-adjusted < 0.1: padj < 0.1

We can use the dplyr filter() function to generate the filtered results.

write your own code here

res_sig_fc <- res |> filter .....

Note: you can use multiple filter criteria in the filter() function by
separating them with (,)

Toggle solution

res_sig_fc <- res |>
  filter(padj < 0.1, abs(log2FoldChange) > 1) |>
  arrange(padj)

The res_sig_fc table should look as follows:

Question:

How many genes did pass your filters?

We will also save this result as Excel and TSV file:

write your own code here

it should save res_sig_fc as DESeq_result_significant_fc.xlsx and DESeq_result_significant_fc.tsv

Toggle solution

write_xlsx(res_sig_fc, path = "DESeq_result_significant_fc.xlsx")
write_tsv(res_sig_fc, file = "DESeq_result_significant_fc.tsv")

We should now have the following DESeq2 result files:

## [1] "DESeq_result_significant_fc.tsv"

## [1] "DESeq_result_significant_fc.xlsx"

14. Heatmap of significant de genes with FC > 2

Now that we have a list de genes we can use the TPM values of the TOP 30 up and down regulated genes and plot them as heatmap.

To do so we will load the TPM data obtained from the nf-core/rnaseq pipeline.

salmon.merged.gene_tpm.tsv

We can use the read_tsv() function from the readr package to read the file and assign it to tpm_mat

tpm_mat <- read_tsv("your salmon.merged.gene_tpm.tsv here")

Toggle solution

tpm_mat <- read_tsv("salmon.merged.gene_tpm.tsv")

Our tpm_mat looks as follows:

As with the raw counts (se above) we will need to sum up the TPMs for genes that are present multiple times but have different Ensembl gene IDs.

tpm_mat_sum <- tpm_mat |>
  group_by(gene_name) |>
  summarize(across(starts_with("CaCo"), sum))
head(tpm_mat_sum, 5)

tpm_mat <- tpm_mat_sum |>
  left_join(ensg_to_symbol, multiple="first") |>
  relocate(gene_id, .before = gene_name)

## Joining with `by = join_by(gene_name)`

Our tpm_mat looks as follows:

We will now filter the TPM data to extract only the TOP 30 up/down regulated de genes (FC > 2, padj < 0.1). We will use the dplyr filter() and our de gene list from above (12.) from which we use the gene_id column as filter.

get the top 30 up/down regulated genes

We first select the top 15 genes based on the “log2FoldChange” column using top_n(15, log2FoldChange). Then, we use bind_rows() to combine the top 15 genes based on the highest “log2FoldChange” values with the bottom 15 genes based on the lowest “log2FoldChange” top_n(-15, log2FoldChange)

top_bottom_genes <- res_sig_fc |>
  top_n(15, log2FoldChange) |>
  bind_rows(
    res_sig_fc |>
      top_n(-15, log2FoldChange)
  )

Finally filter the tpm_mat

de_tpm <- tpm_mat |>
  filter(gene_id %in% top_bottom_genes$gene_id)

dim(de_tpm)

## [1] 30  8

Our de_tpm looks as follows:

For making the heatmap we now convert the de_tpm to a matrix

# gene_names <- de_tpm$gene_name
de_tpm_mat <- de_tpm |>
  select(-gene_id) |>
  column_to_rownames("gene_name") |>
  as.matrix()

To better visualize the expression differences between we the scale (z-score) the TPM rowwise over the samples.

de_tpm_scaled <- t(scale(t(de_tpm_mat), center = TRUE, scale = TRUE))

We now sort the de_tpm_scaled matrix according to our top/bottom gene sort order

de_tpm_scaled <- de_tpm_scaled[top_bottom_genes$gene_name,]

Now we use Heatmap() from ComplexHeatmap to produce the heatmap.

https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html

The result should be similar to the one shown here

Heatmap(de_tpm_scaled,
        rect_gp = gpar(col = "white", lwd = 1),
        column_split = sample_info$condition,
        row_split = c(rep("top", 15), rep("bottom", 15)),
        heatmap_legend_param = list(title = "z-score")
        )