WM2 - Bioinformatics

DGE analysis with DESeq2

For this example we use bulk RNAseq data obtained from a colorectal cancer cell line Caco-2 which was treated with the MEK inhibitor. The treated cells should be compared with untreated control cells to determine the effects of the inhibitor on the transcriptional level.

The experiment was done with 3 replicates for each condition (treated, untreated).

Data files

raw read count table: salmon.merged.gene_counts.tsv
TPM table: salmon.merged.gene_tpm.tsv

Questions:

Provide some more information about the Caco-2 celline.
What is the MEK inihbitor?

1. Install required R packages

We first need to install some R packages which help us to run our differential gene expression analysis with DESeq2.

tidyverse

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. It includes the following packages:

ggplot2, for data visualisation.
dplyr, for data manipulation.
tidyr, for data tidying.
readr, for data import.
purrr, for functional programming.
tibble, for tibbles, a modern re-imagining of data frames.
stringr, for strings.
forcats, for factors.
lubridate, for date/times.

Other packages

conflicted: alternative conflict resolution
writexl: Excel file writing
BiocManager: a package manager to install and manage R Bioconductor pagages
DESeq2: a Bioconductor package for differential gene expression analysis
ggrepel: provides geoms for ggplot2 to repel overlapping text labels
EnhancedVolcano: a package for createing publication ready volcano plots
ComplexHeatmap: this package provides a highly flexible way to generate heatmaps and supports various annotation graphics.

install.packages("conflicted")
install.packages("tidyverse")
install.packages("writexl")
install.packages("BiocManager")
BiocManager::install("DESeq2")
BiocManager::install("EnhancedVolcano")
BiocManager::install("ComplexHeatmap")
install.packages("ggrepel")

2. Load the required R packages

After installation of the packages we need to load them into our workspace.

library(conflicted)
library(tidyverse)
library(writexl)
library(DESeq2)
library(ggrepel)
library(EnhancedVolcano)
library(ComplexHeatmap)

We also need to set preferred libraries for conflicting functions:

conflicts_prefer(dplyr::filter)

## [conflicted] Will prefer dplyr::filter over any other package.

conflicts_prefer(dplyr::select)

## [conflicted] Will prefer dplyr::select over any other package.

3. Load the sample sheet and prepare meta data

We next need to prepare the meta data, which is the sample information stored in a comma separated value file samplesheet.csv file. We can use a text editor to create the samplesheet.csv file. It consists of 2 columns (separated by commas), the first is named sample the second condition.

sample,condition
CaCo2_1,control
CaCo2_2,control
CaCo2_3,control
CaCo2_4,MEKi
CaCo2_5,MEKi
CaCo2_6,MEKi

As we can see, each sample is assigned a condition. We have 2 conditions control and MEKi and 3 replicates per condition.

We can use the read_csv() function from the readr package and assign the contents of the sample sheet to a tibble object named sample_info

sample_info <- read_csv("your samplesheet file name here")

Our sample_info should now look as follows:

4. Load the raw count data

The raw count data for the experiment was obtained from the nf-core/rnaseq pipeline. The pipeline uses the raw sequencing reads in fastq format and aligns them to a reference transcriptome and quantifies the number of aligned reads for each single gene. The result file contains the raw read counts for all samples over all genes in a tabulator separated value (TSV) table with genes as rows and samples as columns.

salmon.merged.gene_counts.tsv (see link above)

We can use the read_tsv() function from the readr package to read the file and assign it to count_mat

count_mat <- read_tsv("your salmon.merged.gene_counts.tsv here")

Our count_mat looks as follows:

To show the first 5 lines we can use the R head() command and pipe the contents of count_mat into it, e.g. count_mat |> head(5)

The first column gene_id contains the Ensembl Gene IDs which are unique. The second column gene_names contains the HGNC Gene Symbols. The remaining columns contain the raw counts in each sample for each gene (Note: the values are not integers).

We now need to remove the gene_name column from the count_mat and convert the gene_id column to rownames. As the count values are not integers we also need to convert them using the round() function.

But first we will save the mapping of gene_id to gene_name into a new tibble object named ensg_to_symbol for later use.

We can use the dplyr select() function to extract the two columns (gene_id, gene_name). We also want to remove pseudo autosomal regions, since we sum those up below and unify the homologous genes.

ensg_to_symbol <- count_mat |> 
  select(gene_id, gene_name) |> 
  filter(!str_detect(gene_id, "PAR_Y"))

The ensg_to_symbol map object looks as follow:

We will now sum up the counts for genes that are present multiple times but have different Ensembl gene IDs. This is done by first grouping the rows by gene_name and then summing the grouped gene counts up for each column that starts with CaCo2.

count_mat_sum <- count_mat |>
  group_by(gene_name) |>
  summarize(across(starts_with("CaCo"), sum))
head(count_mat_sum, 5)

Now we need to re-add the gene_id column which we lost in the previous step. To add it using the dplyr left_join() function together with the ens_to_symbol object where we have the mapping information. For multiple matches, we have a one-to-many mapping, we just keep the first occurring match. We use relocate() to move the gene_id column from the last positions to the position before gene_name.

count_mat <- count_mat_sum |>
  left_join(ensg_to_symbol, multiple="first") |>
  relocate(gene_id, .before = gene_name)

## Joining with `by = join_by(gene_name)`

head(count_mat, 5)

We then use again select() to remove the gene_name column and pass the output to column_to_rownames() and then further on to round()

count_mat <- count_mat |> select(-gene_name) |> column_to_rownames("gene_id") |> round()

Questions:

What data does the final count_mat contain?
What operation is a lef_join() function performing?
Are the raw counts comparable between the different samples? (explain your answer)

5. Create a DESeqDataSet

From our sample_info and our count_mat we can now create a DESeqDataSet using the DESeq2 function DESeqDataSetFromMatrix() and assign it to dds.

We need to pass the following information to DESeqDataSetFromMatrix():

countData, which is our count matrix from above
colData, which is out sample information
design, which indicates how to model the samples, i.e. here we want to measure the effect of the condition (treated vs. untreated)

Note: The factor variable condition should be columns of colData

Examples for models:

simple model with no covariate to measure the effect of the condition

design = ~ condition

model with batch as covariate to measure the effect of the condition and controlling for batch differences. In such a case the samplesheet.csv would need an additional column named batch, which asigns a batch name to each sample.

design = ~ batch + condition

We do not have different batches, so we do not need to control for batch effects

dds <- DESeqDataSetFromMatrix(countData = count_mat,
                              colData = sample_info,
                              design= ~ condition)

## converting counts to integer mode

DESeqDataSet information

dds

## class: DESeqDataSet 
## dim: 61267 6 
## metadata(1): version
## assays(1): counts
## rownames(61267): ENSG00000276861.1 ENSG00000273730.1 ...
##   ENSG00000284550.1 ENSG00000281780.1
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition

Questions:

How many genes, sample and conditions are in your dds object?

6. Filter for low expressed genes

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.

Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads per sample condition.

We use the collapseReplicates() function from DESeq2 to get the counts per condition and check if any of the conditions has more than 10 counts.

pass_min_counts <- counts(collapseReplicates(dds, dds$condition)) > 10

keep <- as.data.frame(pass_min_counts) |>
  mutate(keep = ifelse( MEKi | control, TRUE, FALSE)) |>
  pull(keep)

head(pass_min_counts, 5)

##                    control  MEKi
## ENSG00000276861.1    FALSE FALSE
## ENSG00000273730.1    FALSE  TRUE
## ENSG00000202198.1     TRUE  TRUE
## ENSG00000121410.12    TRUE  TRUE
## ENSG00000268895.6     TRUE  TRUE

head(keep, 5)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE

Now we have our list of genes that have at least 10 reads in any of the two conditions and we use it to filter the dds

dds <- dds[keep,]

Filtered DESeqDataSet information

dds

## class: DESeqDataSet 
## dim: 17666 6 
## metadata(1): version
## assays(1): counts
## rownames(17666): ENSG00000273730.1 ENSG00000202198.1 ...
##   ENSG00000074755.15 ENSG00000036549.13
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition

Questions:

Briefly explain how the above code works for obtainig the filtered DESeqDataSet works!
How many genes remain in the data set?

7. Save filtered raw counts as TSV and Excel file

Now we can save our filtered raw counts as files.

We may first add the gene symbols to the count table again by joining the ensg_to_symbol map object.

To do so, we get the count matrix from the dds object using the counts() function and transform it to a tibble and join it using the dplyr left_join() function. We use relocate() to move the gene_name column from the last positions to the positin after gene_id

filtered_counts <- counts(dds) |> 
  as_tibble(rownames = "gene_id") |>
  left_join(ensg_to_symbol) |>
  relocate(gene_name, .after = gene_id)

The filtered_counts map object looks as follow:

Now lets save the table as Excel file using the write_xlsx() function:

x = filtered_counts
path = filtered_counts.xlsx

and as TSV file using the write_tsv() function:

x = filtered_counts
file = filtered_counts.xlsx

write_xlsx(...)
write_tsv(...)

We should now have the following filtered count files:

## [1] "filtered_counts.tsv"  "filtered_counts.xlsx"

8. Normalize the raw counts and save as files

To manually normalize the raw count data we may utilize the estimateSizeFactors. To pull out the normalized counts we again use counts() but specify the option normalized=TRUE.

dds <- estimateSizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)

Repeating the steps from above (7. Save filtered raw counts) we can add the gene symbols and save the normalized counts as well:

normalized_counts.xlsx
normalized_counts.tsv

write your own code here
first left_join(ensg_to_symbol) 
then save xlsx and tsv

We should now have the following normalized count files:

## [1] "normalized_counts.tsv"  "normalized_counts.xlsx"

Questions:

Why do we normalize the counts?
Which normalization method is DESeq2 using?

9. Set reference condtion and run Differential expression analysis

We want to make sure that our reference (control) condition is control. To do so we use the relevel() function.

dds$condition = relevel(dds$condition, "control")

Finally run DESeq

dds <- DESeq(dds)

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

10. Fetch the results

We can get the results table from the dds object using the results() function.

res <- results(dds)

The result table should look as follows:

## log2 fold change (MLE): condition MEKi vs control 
## Wald test p-value: condition MEKi vs control 
## DataFrame with 5 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat    pvalue
##                    <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000273730.1    3.01099      5.0302493  3.672561 1.3696844  0.170785
## ENSG00000202198.1   33.79762      0.0315941  0.368231 0.0857996  0.931626
## ENSG00000121410.12  10.79336      0.2597592  0.603469 0.4304430  0.666873
## ENSG00000268895.6   13.31326      0.0518552  0.540735 0.0958977  0.923602
## ENSG00000148584.16  12.70348      0.5465669  0.562510 0.9716566  0.331221
##                         padj
##                    <numeric>
## ENSG00000273730.1         NA
## ENSG00000202198.1   0.965237
## ENSG00000121410.12  0.810786
## ENSG00000268895.6   0.961562
## ENSG00000148584.16  0.540260

Lets get some summary information:

summary(res)

Questions:

How many genes, in numbers and percent, are up or down regulated?
How many outliers were found?
How many genes with low counts wer found?

We may want to save result to an Excel file now. But first lets add the gene names again. For this the result needs to be converted into a dataframe first.

res <- as.data.frame(res) |>
  rownames_to_column(var = "gene_id") |>
  as_tibble() |>
  left_join(ensg_to_symbol) |>
  relocate(gene_name, .after = gene_id)

The res table should look as follows:

Get all significant differentially expressed genes

Now get all differential expressed genes with adjusted p-values below FDR < 0.1 and sort by the smallest p-value. To do so we use the filter() function to filter padj be < 0.1 and the arrange() function to order the rows by padj

res_significant <- res |>
  filter(padj < 0.1) |>
  arrange(padj)

The res_significant table should look as follows:

Questions:

How many significant genes did we get?
How many of them are up or down regulated?

Save as Excel and TSV

write your own code here
check above for method to save xlsx and tsv
save full results and significant results

We should now have the following DESeq2 result files:

## [1] "DESeq_result_full.tsv"           "DESeq_result_significant_fc.tsv"
## [3] "DESeq_result_significant.tsv"

## [1] "DESeq_result_full.xlsx"           "DESeq_result_significant_fc.xlsx"
## [3] "DESeq_result_significant.xlsx"

11. Generate a PCA plot

To see if the MEKi treatment had a global effect on our cells, we will run a PCA analysis and plot the first two principal components in two dimensions and visually identify differences or similarities.

We will first perform a variance stabilizing transformations (VST) to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low.

For downstream analyses – e.g. for visualization or clustering (PCA) – it is useful to work with transformed versions of the count data.

We use the vst() function and set the option blind=FALSE to perform the transformation.

vsd <- vst(dds, blind=FALSE)

Now we can use the vsd (variance stabilized data) and the plotPCA() function to generate a PCA plot.

Note: we may use the function geom_text_repel(aes(label=name),vjust=2)) from the ggrepel package to add labels to the data points. This annotation can be added using the + sign after plotPCA() which is producing a ggplot() object. See also: https://ggplot2.tidyverse.org/reference/ggplot.html

write your own code here
use the plotPCA function and ggrepel to create a
PCA plot of the samples.

The resulting plot should look similar to this here:

## using ntop=500 top features by variance

Questions:

Interpret and discuss the PCA plot.