For this example we use bulk RNAseq data obtained from a colorectal cancer cell line Caco-2 which was treated with the MEK inhibitor. The treated cells should be compared with untreated control cells to determine the effects of the inhibitor on the transcriptional level.
The experiment was done with 3 replicates for each condition (treated, untreated).
Data files
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm.tsv
Questions:
We first need to install some R packages which help us to run our differential gene expression analysis with DESeq2.
tidyverse
The tidyverse is a set of packages that work in harmony because they share common data representations and API design. It includes the following packages:
Other packages
install.packages("conflicted")
install.packages("tidyverse")
install.packages("writexl")
install.packages("BiocManager")
BiocManager::install("DESeq2")
BiocManager::install("EnhancedVolcano")
BiocManager::install("ComplexHeatmap")
install.packages("ggrepel")
After installation of the packages we need to load them into our workspace.
library(conflicted)
library(tidyverse)
library(writexl)
library(DESeq2)
library(ggrepel)
library(EnhancedVolcano)
library(ComplexHeatmap)
We also need to set preferred libraries for conflicting functions:
conflicts_prefer(dplyr::filter)
## [conflicted] Will prefer dplyr::filter over any other package.
conflicts_prefer(dplyr::select)
## [conflicted] Will prefer dplyr::select over any other package.
We next need to prepare the meta data, which is the sample
information stored in a comma separated value file
samplesheet.csv
file. We can use a text editor to create
the samplesheet.csv
file. It consists of 2 columns
(separated by commas), the first is named sample
the second
condition
.
sample,condition
CaCo2_1,control
CaCo2_2,control
CaCo2_3,control
CaCo2_4,MEKi
CaCo2_5,MEKi
CaCo2_6,MEKi
As we can see, each sample is assigned a condition. We have 2
conditions control
and MEKi
and 3 replicates
per condition.
We can use the read_csv()
function from the
readr
package and assign the contents of the sample sheet
to a tibble
object named sample_info
sample_info <- read_csv("your samplesheet file name here")
Our sample_info
should now look as follows:
The raw count data for the experiment was obtained from the nf-core/rnaseq pipeline. The pipeline uses the raw sequencing reads in fastq format and aligns them to a reference transcriptome and quantifies the number of aligned reads for each single gene. The result file contains the raw read counts for all samples over all genes in a tabulator separated value (TSV) table with genes as rows and samples as columns.
salmon.merged.gene_counts.tsv
(see link above)We can use the read_tsv()
function from the
readr
package to read the file and assign it to
count_mat
count_mat <- read_tsv("your salmon.merged.gene_counts.tsv here")
Our count_mat
looks as follows:
To show the first 5 lines we can use the R head()
command and pipe the contents of count_mat
into it,
e.g. count_mat |> head(5)
The first column gene_id
contains the Ensembl
Gene IDs which are unique. The second column
gene_names
contains the HGNC Gene Symbols.
The remaining columns contain the raw counts in each sample for each
gene (Note: the values are not integers).
We now need to remove the gene_name
column from the
count_mat
and convert the gene_id
column to
rownames. As the count values are not integers we also need to convert
them using the round()
function.
But first we will save the mapping of gene_id
to
gene_name
into a new tibble object named
ensg_to_symbol
for later use.
select()
function to extract the
two columns (gene_id
, gene_name
). We also want
to remove pseudo autosomal regions, since we sum those up below and
unify the homologous genes.ensg_to_symbol <- count_mat |>
select(gene_id, gene_name) |>
filter(!str_detect(gene_id, "PAR_Y"))
The ensg_to_symbol
map object looks as follow:
We will now sum up the counts for genes that are present multiple
times but have different Ensembl gene IDs. This is done by first
grouping the rows by gene_name
and then summing the grouped
gene counts up for each column that starts with CaCo2
.
count_mat_sum <- count_mat |>
group_by(gene_name) |>
summarize(across(starts_with("CaCo"), sum))
head(count_mat_sum, 5)
Now we need to re-add the gene_id
column which we lost
in the previous step. To add it using the dplyr left_join()
function together with the ens_to_symbol
object where we
have the mapping information. For multiple matches, we have a
one-to-many mapping, we just keep the first occurring match. We use
relocate()
to move the gene_id
column from the
last positions to the position before gene_name
.
count_mat <- count_mat_sum |>
left_join(ensg_to_symbol, multiple="first") |>
relocate(gene_id, .before = gene_name)
## Joining with `by = join_by(gene_name)`
head(count_mat, 5)
select()
to remove the
gene_name
column and pass the output to
column_to_rownames()
and then further on to
round()
count_mat <- count_mat |> select(-gene_name) |> column_to_rownames("gene_id") |> round()
Questions:
count_mat
contain?lef_join()
function
performing?From our sample_info
and our count_mat
we
can now create a DESeqDataSet
using the DESeq2 function
DESeqDataSetFromMatrix()
and assign it to
dds
.
We need to pass the following information to
DESeqDataSetFromMatrix()
:
countData
, which is our count matrix from abovecolData
, which is out sample informationdesign
, which indicates how to model the samples,
i.e. here we want to measure the effect of the condition (treated
vs. untreated)Note: The factor variable condition
should
be columns of colData
Examples for models:
design = ~ condition
batch
as covariate to measure the effect of
the condition and controlling for batch differences. In such a case the
samplesheet.csv
would need an additional column named
batch
, which asigns a batch name to each sample.design = ~ batch + condition
dds <- DESeqDataSetFromMatrix(countData = count_mat,
colData = sample_info,
design= ~ condition)
## converting counts to integer mode
DESeqDataSet information
dds
## class: DESeqDataSet
## dim: 61267 6
## metadata(1): version
## assays(1): counts
## rownames(61267): ENSG00000276861.1 ENSG00000273730.1 ...
## ENSG00000284550.1 ENSG00000281780.1
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition
Questions:
dds
object?While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.
Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads per sample condition.
We use the collapseReplicates()
function from DESeq2 to
get the counts per condition and check if any of the conditions has more
than 10 counts.
pass_min_counts <- counts(collapseReplicates(dds, dds$condition)) > 10
keep <- as.data.frame(pass_min_counts) |>
mutate(keep = ifelse( MEKi | control, TRUE, FALSE)) |>
pull(keep)
head(pass_min_counts, 5)
## control MEKi
## ENSG00000276861.1 FALSE FALSE
## ENSG00000273730.1 FALSE TRUE
## ENSG00000202198.1 TRUE TRUE
## ENSG00000121410.12 TRUE TRUE
## ENSG00000268895.6 TRUE TRUE
head(keep, 5)
## [1] FALSE TRUE TRUE TRUE TRUE
Now we have our list of genes that have at least 10 reads in any of
the two conditions and we use it to filter the dds
dds <- dds[keep,]
Filtered DESeqDataSet information
dds
## class: DESeqDataSet
## dim: 17666 6
## metadata(1): version
## assays(1): counts
## rownames(17666): ENSG00000273730.1 ENSG00000202198.1 ...
## ENSG00000074755.15 ENSG00000036549.13
## rowData names(0):
## colnames(6): CaCo2_1 CaCo2_2 ... CaCo2_5 CaCo2_6
## colData names(2): sample condition
Questions:
Now we can save our filtered raw counts as files.
We may first add the gene symbols to the count table again by joining
the ensg_to_symbol
map object.
To do so, we get the count matrix from the dds
object
using the counts()
function and transform it to a tibble
and join it using the dplyr left_join()
function. We use
relocate()
to move the gene_name
column from
the last positions to the positin after gene_id
filtered_counts <- counts(dds) |>
as_tibble(rownames = "gene_id") |>
left_join(ensg_to_symbol) |>
relocate(gene_name, .after = gene_id)
The filtered_counts
map object looks as follow:
Now lets save the table as Excel file using the
write_xlsx()
function:
filtered_counts
filtered_counts.xlsx
and as TSV file using the write_tsv()
function:
filtered_counts
filtered_counts.xlsx
write_xlsx(...)
write_tsv(...)
We should now have the following filtered count files:
## [1] "filtered_counts.tsv" "filtered_counts.xlsx"
To manually normalize the raw count data we may utilize the
estimateSizeFactors
. To pull out the normalized counts we
again use counts()
but specify the option
normalized=TRUE
.
dds <- estimateSizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)
Repeating the steps from above (7. Save filtered raw counts) we can add the gene symbols and save the normalized counts as well:
write your own code here
first left_join(ensg_to_symbol)
then save xlsx and tsv
We should now have the following normalized count files:
## [1] "normalized_counts.tsv" "normalized_counts.xlsx"
Questions:
We want to make sure that our reference (control) condition is
control
. To do so we use the relevel()
function.
dds$condition = relevel(dds$condition, "control")
Finally run DESeq
dds <- DESeq(dds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
We can get the results table from the dds
object using
the results()
function.
res <- results(dds)
The result table should look as follows:
## log2 fold change (MLE): condition MEKi vs control
## Wald test p-value: condition MEKi vs control
## DataFrame with 5 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000273730.1 3.01099 5.0302493 3.672561 1.3696844 0.170785
## ENSG00000202198.1 33.79762 0.0315941 0.368231 0.0857996 0.931626
## ENSG00000121410.12 10.79336 0.2597592 0.603469 0.4304430 0.666873
## ENSG00000268895.6 13.31326 0.0518552 0.540735 0.0958977 0.923602
## ENSG00000148584.16 12.70348 0.5465669 0.562510 0.9716566 0.331221
## padj
## <numeric>
## ENSG00000273730.1 NA
## ENSG00000202198.1 0.965237
## ENSG00000121410.12 0.810786
## ENSG00000268895.6 0.961562
## ENSG00000148584.16 0.540260
Lets get some summary information:
summary(res)
Questions:
We may want to save result to an Excel file now. But first lets add
the gene names again. For this the result needs to be converted into a
dataframe
first.
res <- as.data.frame(res) |>
rownames_to_column(var = "gene_id") |>
as_tibble() |>
left_join(ensg_to_symbol) |>
relocate(gene_name, .after = gene_id)
The res
table should look as follows:
Get all significant differentially expressed genes
Now get all differential expressed genes with adjusted p-values below
FDR < 0.1 and sort by the smallest p-value. To do so we use the
filter()
function to filter padj
be < 0.1
and the arrange()
function to order the rows by
padj
res_significant <- res |>
filter(padj < 0.1) |>
arrange(padj)
The res_significant
table should look as follows:
Questions:
Save as Excel and TSV
write your own code here
check above for method to save xlsx and tsv
save full results and significant results
We should now have the following DESeq2 result files:
## [1] "DESeq_result_full.tsv" "DESeq_result_significant_fc.tsv"
## [3] "DESeq_result_significant.tsv"
## [1] "DESeq_result_full.xlsx" "DESeq_result_significant_fc.xlsx"
## [3] "DESeq_result_significant.xlsx"
To see if the MEKi treatment had a global effect on our cells, we will run a PCA analysis and plot the first two principal components in two dimensions and visually identify differences or similarities.
We will first perform a variance stabilizing transformations (VST) to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low.
For downstream analyses – e.g. for visualization or clustering (PCA) – it is useful to work with transformed versions of the count data.
We use the vst()
function and set the option
blind=FALSE
to perform the transformation.
vsd <- vst(dds, blind=FALSE)
Now we can use the vsd
(variance stabilized data) and
the plotPCA()
function to generate a PCA plot.
Note: we may use the function
geom_text_repel(aes(label=name),vjust=2))
from the
ggrepel
package to add labels to the data points. This
annotation can be added using the +
sign after
plotPCA()
which is producing a ggplot()
object. See also: https://ggplot2.tidyverse.org/reference/ggplot.html
write your own code here
use the plotPCA function and ggrepel to create a
PCA plot of the samples.
The resulting plot should look similar to this here:
## using ntop=500 top features by variance
Questions: