Bulk RNAseq data was obtained from a colorectal cancer cell line Caco-2 which was treated with the MEK inhibitor. The treated cells should be compared with untreated control cells to determine the effects of the inhibitor on the transcriptional level.
The experiment was done with 3 replicates for each condition (treated, untreated).
We first need to install some R packages which help us to run our differential gene expression analysis with DESeq2.
tidyverse
The tidyverse is a set of packages that work in harmony because they share common data representations and API design. It includes the following packages:
Other packages
install.packages("conflicted")
install.packages("tidyverse")
install.packages("writexl")
install.packages("BiocManager")
BiocManager::install("DESeq2")
BiocManager::install("EnhancedVolcano")
BiocManager::install("ComplexHeatmap")
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("ReactomePA")
install.packages("ggrepel")
After installation of the packages we need to load them into our workspace.
library(conflicted)
library(tidyverse)
library(writexl)
library(DESeq2)
library(ggrepel)
library(EnhancedVolcano)
library(ComplexHeatmap)
library(clusterProfiler)
library(org.Hs.eg.db)
library(ReactomePA)
We also need to set preferred libraries for conflicting functions:
conflicts_prefer(dplyr::filter)
## [conflicted] Will prefer dplyr::filter over any other package.
conflicts_prefer(dplyr::select)
## [conflicted] Will prefer dplyr::select over any other package.
We next need to prepare the meta data, which is the sample
information which is stored in our samplesheet.csv
file.
We can use the read_csv() function from the
readr package and assign the contents of the sample sheet
to a tibble object named sample_info
sample_info <- read_csv("your samplesheet file name here")
sample_info <- read_csv("sampleSheet_deseq2.csv")
Now we can remove the fastq_1, fastq_2 and
strandedness columns since we do not need this information.
To do so we can use the dplyr::select() function.
sample_info <- sample_info |> select(-fastq_1, -fastq_2, -strandedness)
Our sample_info should now look as follows:
From the nf-core/rnaseq pipeline run we obtained a result file which contains the raw read counts of all samples over all genes.
salmon.merged.gene_counts.tsvWe can use the read_tsv() function from the
readr package to read the file and assign it to
count_mat
count_mat <- read_tsv("your salmon.merged.gene_counts.tsv here")
count_mat <- read_tsv("salmon.merged.gene_counts.tsv")
Our count_mat looks as follows:
The first column gene_id contains the Ensembl
Gene IDs which are unique. The second column
gene_names contains the HGNC Gene Symbols.
The remaining columns contain the raw counts in each sample for each
gene (-Note: the values are not integers-).
We now need to remove the gene_name column from the
count_mat and convert the gene_id column to
rownames. As the count values are not integers we also need to convert
them using the round() function.
But first we will save the mapping of gene_id to
gene_name into a new tibble object named
ensg_to_symbol for later use.
select() function to extract the
two columns (gene_id, gene_name). We also want
to remove pseudo autosomal regions, since we sum those up below and
unify the homologous genes.ensg_to_symbol <- count_mat |>
select(gene_id, gene_name) |>
filter(!str_detect(gene_id, "PAR_Y"))
The ensg_to_symbol map object looks as follow:
We will now sum up the counts for genes that are present multiple times but have different Ensembl gene IDs.
count_mat_sum <- count_mat |>
group_by(gene_name) |>
summarize(across(starts_with("CaCo"), sum))
head(count_mat_sum, 5)
Now we need to re-add the gene_id column which we lost
in the previous step. To add it using the dplyr left_join()
function together with the ens_to_symbol object where we
have the mapping information. For multiple matches, we have a
one-to-many mapping, we just keep the first occurring match. We use
relocate() to move the gene_id column from the
last positions to the position before gene_name.
count_mat <- count_mat_sum |>
left_join(ensg_to_symbol, multiple="first") |>
relocate(gene_id, .before = gene_name)
## Joining with `by = join_by(gene_name)`
head(count_mat, 5)
select() to remove the
gene_name column and pass the output to
column_to_rownames() and then further on to
round()count_mat <- count_mat |> select(-gene_name) |> column_to_rownames("gene_id") |> round()
From our sample_info and our count_mat we
can now create a DESeqDataSet using the DESeq2 function
DESeqDataSetFromMatrix() and assign it to
dds.
We need to pass the following information to
DESeqDataSetFromMatrix():
countData, which is our count matrix from abovecolData, which is out sample informationdesign, which indicates how to model the samples,
i.e. here we want to measure the effect of the condition (treated
vs. untreated)Note: The factor variable condition should
be columns of colData
Examples for models:
design = ~ condition
batch as covariate to measure the effect of
the condition and controlling for batch differencesdesign = ~ batch + condition
dds <- DESeqDataSetFromMatrix(countData = count_mat,
colData = sample_info,
design= ~ condition)
## converting counts to integer mode
DESeqDataSet information
dds
## class: DESeqDataSet
## dim: 61255 6
## metadata(1): version
## assays(1): counts
## rownames(61255): ENSG00000276861.1 ENSG00000273730.1 ...
## ENSG00000284550.1 ENSG00000281780.1
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.
Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads per sample condition.
We use the collapseReplicates() function from DESeq2 to
get the counts per condition and check if any of the conditions has more
than 10 counts.
pass_min_counts <- counts(collapseReplicates(dds, dds$condition)) > 10
keep <- as.data.frame(pass_min_counts) |>
mutate(keep = ifelse( MEKi | control, TRUE, FALSE)) |>
pull(keep)
head(pass_min_counts, 5)
## control MEKi
## ENSG00000276861.1 FALSE FALSE
## ENSG00000273730.1 FALSE TRUE
## ENSG00000202198.1 TRUE TRUE
## ENSG00000121410.12 TRUE TRUE
## ENSG00000268895.6 TRUE TRUE
head(keep, 5)
## [1] FALSE TRUE TRUE TRUE TRUE
Now we have our list of genes that have at least 10 reads in any of
the two conditions and we use it to filter the dds
dds <- dds[keep,]
Filtered DESeqDataSet information
dds
## class: DESeqDataSet
## dim: 16982 6
## metadata(1): version
## assays(1): counts
## rownames(16982): ENSG00000273730.1 ENSG00000202198.1 ...
## ENSG00000074755.15 ENSG00000036549.13
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition
Now we can save our filtered raw counts as files.
We may first the gene symbols to the count table again by joining the
ensg_to_symbol map object.
To do so, we get the count matrix from the dds object
using the counts() function and transform it to a tibble
and join it using the dplyr left_join() function. We use
relocate() to move the gene_name column from
the last positions to the position after gene_id
filtered_counts <- counts(dds) |>
as_tibble(rownames = "gene_id") |>
left_join(ensg_to_symbol) |>
relocate(gene_name, .after = gene_id)
The filtered_counts map object looks as follow:
Now lets save the table as Excel file using the
write_xlsx() function:
filtered_countsfiltered_counts.xlsxand as TSV file using the write_tsv() function:
filtered_countsfiltered_counts.xlsxwrite_xlsx(...)
write_tsv(...)
write_xlsx(filtered_counts, path = "filtered_counts.xlsx")
write_tsv(filtered_counts, file = "filtered_counts.tsv")
We should now have the following filtered count files:
## [1] "filtered_counts.tsv" "filtered_counts.xlsx"
To manually normalize the raw count data we may utilize the
estimateSizeFactors. To pull out the normalized counts we
again use counts() but specify the option
normalized=TRUE.
dds <- estimateSizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)
Repeating the steps from above (7. Save filtered raw counts) we can add the gene symbols and save the normalized counts as well:
write your own code here
first left_join(ensg_to_symbol)
then save xlsx and tsv
normalized_counts <- normalized_counts |>
as_tibble(rownames = "gene_id") |>
left_join(ensg_to_symbol) |>
relocate(gene_name, .after = gene_id)
write_xlsx(normalized_counts, path = "normalized_counts.xlsx")
write_tsv(normalized_counts, file = "normalized_counts.tsv")
We should now have the following normalized count files:
## [1] "normalized_counts.tsv" "normalized_counts.xlsx"
We want to make sure that our reference (control) condition is
control. To do so we use the relevel()
function.
dds$condition = relevel(dds$condition, "control")
Finally run DESeq
dds <- DESeq(dds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
We can get the results table from the dds object using
the results() function.
res <- results(dds)
The result table should look as follows:
## log2 fold change (MLE): condition MEKi vs control
## Wald test p-value: condition MEKi vs control
## DataFrame with 5 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000273730.1 1.91806 4.3844998 3.929677 1.1157406 0.264533
## ENSG00000202198.1 32.91620 0.0186264 0.368995 0.0504787 0.959741
## ENSG00000121410.12 10.59886 0.3110182 0.596268 0.5216079 0.601943
## ENSG00000268895.6 13.31781 0.0497901 0.528262 0.0942527 0.924908
## ENSG00000148584.16 12.02246 0.5314377 0.570464 0.9315890 0.351549
## padj
## <numeric>
## ENSG00000273730.1 NA
## ENSG00000202198.1 0.979662
## ENSG00000121410.12 0.765222
## ENSG00000268895.6 0.961928
## ENSG00000148584.16 0.555553
Lets get some summary information:
summary(res)
We may want to save result to an Excel file now. But first lets add
the gene names again. For this the result needs to be converted into a
dataframe first.
res <- as.data.frame(res) |>
rownames_to_column(var = "gene_id") |>
as_tibble() |>
left_join(ensg_to_symbol) |>
relocate(gene_name, .after = gene_id)
The res table should look as follows:
Now get all differential expressed genes with adjusted p-values below
FDR < 0.1 and sort by the smallest p-value. To do so we use the
filter() function to filter padj be < 0.1
and the arrange() function to order the rows by
padj
res_significant <- res |>
filter(padj < 0.1) |>
arrange(padj)
The res_significant table should look as follows:
Save as Excel and TSV
write your own code here
check above for method to save xlsx and tsv
save full results and significant results
write_xlsx(res, path = "DESeq_result_full.xlsx")
write_xlsx(res_significant, path = "DESeq_result_significant.xlsx")
write_tsv(res, file = "DESeq_result_full.tsv")
write_tsv(res_significant, file = "DESeq_result_significant.tsv")
We should now have the following DESeq2 result files:
## [1] "DESeq_result_full.tsv" "DESeq_result_significant_fc.tsv"
## [3] "DESeq_result_significant.tsv"
## [1] "DESeq_result_full.xlsx" "DESeq_result_significant_fc.xlsx"
## [3] "DESeq_result_significant.xlsx"
To see if the MEKi treatment had a global effect on our cells, we will run a PCA analysis and plot the first two principal components in two dimensions and visually identify differences or similarities.
We will first perform a variance stabilizing transformations (VST) to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low.
For downstream analyses – e.g. for visualization or clustering (PCA) – it is useful to work with transformed versions of the count data.
We use the vst() function and set the option
blind=FALSE to perform the transformation.
vsd <- vst(dds, blind=FALSE)
Now we can use the vsd (variance stabilized data) and
the plotPCA() function to generate a PCA plot.
Note: we may use the function
geom_text_repel(aes(label=name),vjust=2)) from the
ggrepel package to add labels to the data points. This
annotation can be added using the + sign after
plotPCA() which is producing a ggplot()
object. See also: https://ggplot2.tidyverse.org/reference/ggplot.html
write your own code here
use the plotPCA function and ggrepel to create a
PCA plot of the samples.
The resulting plot should look similar to this here:
print(plotPCA(vsd) + geom_text_repel(aes(label=name),vjust=3,hjust=1.5))
A volcano plot is a type of scatterplot that shows statistical significance (p value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.
We will use the EnhancedVolcano() function to generate a
volcano plot to visually summarize differences in the gene
expression.
write your own code here
use the EnhancedVolcano() function to plot the data in DESeq res
on the x axis you want to plot the log2FoldChange
on the y axis you want to plot the padj
set the pCutoff to 0.1
set the FCcutoff to 1
The resulting plot should look similar to this here:
EnhancedVolcano(res,
lab = res$gene_name,
x = "log2FoldChange",
y = "padj",
pCutoff = 0.1,
FCcutoff = 1,
gridlines.major = FALSE,
title = 'DESeq2 results',
subtitle = "",
legendLabels = c("NS", expression(Log[2] ~ FC), "p-adj", expression("p-adj" ~ and ~ log[2] ~ FC)),
ylab = bquote(~-Log[10] ~ italic("p-adj"))
)
As we can see there are some genes with extreme p-values on the y-axis and Log2foldChanges on the x-axis. This makes it difficult to display genes that are not as stongly regulated.
To better show those genes we may limit (xlim, ylim) the axes extend to -5,5 and 0, -log10(10e-10)
EnhancedVolcano(res,
lab = res$gene_name,
x = "log2FoldChange",
y = "padj",
pCutoff = 0.1,
FCcutoff = 1,
gridlines.major = FALSE,
xlim = c(-5, 5),
ylim = c(0, -log10(10e-10)),
title = 'DESeq2 results',
subtitle = "",
legendLabels = c("NS", expression(Log[2] ~ FC), "p-adj", expression("p-adj" ~ and ~ log[2] ~ FC)),
ylab = bquote(~-Log[10] ~ italic("p-adj"))
)
We now want to save genes that are differential expressed in the MEKi treated cells. We define the following cutoffs:
We can use the dplyr filter() function to generate the
filtered results.
write your own code here
res_sig_fc <- res |> filter .....
Note: you can use multiple filter criteria in the filter() function by
separating them with (,)
res_sig_fc <- res |>
filter(padj < 0.1, abs(log2FoldChange) > 1) |>
arrange(padj)
The res_sig_fc table should look as follows:
Question:
We will also save this result as Excel and TSV file:
write your own code here
it should save res_sig_fc as DESeq_result_significant_fc.xlsx and DESeq_result_significant_fc.tsv
write_xlsx(res_sig_fc, path = "DESeq_result_significant_fc.xlsx")
write_tsv(res_sig_fc, file = "DESeq_result_significant_fc.tsv")
We should now have the following DESeq2 result files:
## [1] "DESeq_result_significant_fc.tsv"
## [1] "DESeq_result_significant_fc.xlsx"
Now that we have a list de genes we can use the TPM values of the TOP 30 up and down regulated genes and plot them as heatmap.
To do so we will load the TPM data obtained from the nf-core/rnaseq pipeline.
salmon.merged.gene_tpm.tsvWe can use the read_tsv() function from the
readr package to read the file and assign it to
tpm_mat
tpm_mat <- read_tsv("your salmon.merged.gene_tpm.tsv here")
tpm_mat <- read_tsv("salmon.merged.gene_tpm.tsv")
Our tpm_mat looks as follows:
As with the raw counts (se above) we will need to sum up the TPMs for genes that are present multiple times but have different Ensembl gene IDs.
tpm_mat_sum <- tpm_mat |>
group_by(gene_name) |>
summarize(across(starts_with("CaCo"), sum))
head(tpm_mat_sum, 5)
Now we need to re-add the gene_id column which we lost
in the previous step. To add it using the dplyr left_join()
function together with the ens_to_symbol object where we
have the mapping information. For multiple matches, we have a
one-to-many mapping, we just keep the first occurring match. We use
relocate() to move the gene_id column from the
last positions to the position before gene_name.
tpm_mat <- tpm_mat_sum |>
left_join(ensg_to_symbol, multiple="first") |>
relocate(gene_id, .before = gene_name)
## Joining with `by = join_by(gene_name)`
Our tpm_mat looks as follows:
We will now filter the TPM data to extract only the TOP 30 up/down
regulated de genes (FC > 2, padj < 0.1). We will use the dplyr
filter() and our de gene list from above (12.) from which
we use the gene_id column as filter.
We first select the top 15 genes based on the “log2FoldChange” column
using top_n(15, log2FoldChange). Then, we use
bind_rows() to combine the top 15 genes based on the
highest “log2FoldChange” values with the bottom 15 genes based on the
lowest “log2FoldChange” top_n(-15, log2FoldChange)
top_bottom_genes <- res_sig_fc |>
top_n(15, log2FoldChange) |>
bind_rows(
res_sig_fc |>
top_n(-15, log2FoldChange)
)
tpm_matde_tpm <- tpm_mat |>
filter(gene_id %in% top_bottom_genes$gene_id)
dim(de_tpm)
## [1] 30 8
Our de_tpm looks as follows:
For making the heatmap we now convert the de_tpm to a
matrix
# gene_names <- de_tpm$gene_name
de_tpm_mat <- de_tpm |>
select(-gene_id) |>
column_to_rownames("gene_name") |>
as.matrix()
To better visualize the expression differences between we the scale (z-score) the TPM rowwise over the samples.
de_tpm_scaled <- t(scale(t(de_tpm_mat), center = TRUE, scale = TRUE))
We now sort the de_tpm_scaled matrix according to our
top/bottom gene sort order
de_tpm_scaled <- de_tpm_scaled[top_bottom_genes$gene_name,]
Now we use Heatmap() from ComplexHeatmap to produce the
heatmap.
https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html
The result should be similar to the one shown here
Heatmap(de_tpm_scaled,
rect_gp = gpar(col = "white", lwd = 1),
column_split = sample_info$condition,
row_split = c(rep("top", 15), rep("bottom", 15)),
heatmap_legend_param = list(title = "z-score")
)