For this example we use bulk RNAseq data obtained from a colorectal cancer cell line Caco-2 which was treated with the MEK inhibitor. The treated cells should be compared with untreated control cells to determine the effects of the inhibitor on the transcriptional level.
The experiment was done with 3 replicates for each condition (treated, untreated).
A volcano plot is a type of scatterplot that shows statistical significance (p value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.
We will use the EnhancedVolcano()
function to generate a
volcano plot to visually summarize differences in the gene
expression.
write your own code here
use the EnhancedVolcano() function to plot the data in DESeq res
on the x axis you want to plot the log2FoldChange
on the y axis you want to plot the padj
set the pCutoff to 0.1
set the FCcutoff to 1
limit (xlim, ylim) the axis extend to -3,3 and 0, -log10(10e-06)
The resulting plot should look similar to this here:
Questions:
We now want to save genes that are differential expressed in the MEKi treated cells. We define the following cutoffs:
We can use the dplyr filter()
function to generate the
filtered results.
write your own code here
res_sig_fc <- res |> filter .....
Note: you can use multiple filter criteria in the filter() function by
separating them with (,)
The res_sig_fc
table should look as follows:
We will also save this result as Excel and TSV file:
write your own code here
it should save res_sig_fc as DESeq_result_significant_fc.xlsx and DESeq_result_significant_fc.tsv
We should now have the following DESeq2 result files:
## [1] "DESeq_result_significant_fc.tsv"
## [1] "DESeq_result_significant_fc.xlsx"
Questions:
Now that we have a list de genes we can use the TPM values of the TOP 30 up and down regulated genes and plot them as heatmap.
To do so we will load the TPM data obtained from the nf-core/rnaseq pipeline.
salmon.merged.gene_tpm.tsv
(see link above)We can use the read_tsv()
function from the
readr
package to read the file and assign it to
tpm_mat
Our tpm_mat
looks as follows:
As with the raw counts (see above) we will need to sum up the TPMs for genes that are present multiple times but have different Ensembl gene IDs.
tpm_mat_sum <- tpm_mat |>
group_by(gene_name) |>
summarize(across(starts_with("CaCo"), sum))
head(tpm_mat_sum, 5)
Now we need to re-add the gene_id
column which we lost
in the previous step. To add it using the dplyr left_join()
function together with the ens_to_symbol
object where we
have the mapping information. For multiple matches, we have a
one-to-many mapping, we just keep the first occouring match. We use
relocate()
to move the gene_id
column from the
last positions to the position before gene_name
.
tpm_mat <- tpm_mat_sum |>
left_join(ensg_to_symbol, multiple="first") |>
relocate(gene_id, .before = gene_name)
## Joining with `by = join_by(gene_name)`
Our tpm_mat
looks as follows:
We will now filter the TPM data to extract only the TOP 30 up/down
regulated de genes (FC > 2, padj < 0.1). We will use the dplyr
filter()
and our de gene list from above (12.) from which
we use the gene_id
column as filter.
We first select the top 15 genes based on the “log2FoldChange” column
using top_n(15, log2FoldChange)
. Then, we use
bind_rows()
to combine the top 15 genes based on the
highest “log2FoldChange” values with the bottom 15 genes based on the
lowest “log2FoldChange” top_n(-15, log2FoldChange)
top_bottom_genes <- res_sig_fc |>
top_n(15, log2FoldChange) |>
bind_rows(
res_sig_fc |>
top_n(-15, log2FoldChange)
)
tpm_mat
de_tpm <- tpm_mat |>
filter(gene_id %in% top_bottom_genes$gene_id)
dim(de_tpm)
## [1] 30 8
Our de_tpm
looks as follows:
For making the heatmap we now convert the de_tpm
to a
matrix
# gene_names <- de_tpm$gene_name
de_tpm_mat <- de_tpm |>
select(-gene_id) |>
column_to_rownames("gene_name") |>
as.matrix()
To better visualize the expression differences between we the scale (z-score) the TPM rowwise over the samples.
de_tpm_scaled <- t(scale(t(de_tpm_mat), center = TRUE, scale = TRUE))
Now we use Heatmap()
from ComplesHeatmap to produce the
heatmap.
https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html
The result should be similar to the one shown here
Questions:
Let’s make a volcano plot that highlights significant 2 fold regulated MAPK pathway genes
Download the corresponding gene set from the MsigDB
in TSV format and use the GENE_SYMBOLS
to highlight them in
the volcano plot.
The table looks like below. It has no column names, therefore we need
to use the col_names = FALSE
option in `read_tsv()``
So can use read_tsv()
function to read the geneset data
and filter()
get the row with the GENE_SYMBOLS
in which genes are stored as a comma-separated list of gene symbols. To
create a character vector of gene elements we use
str_split_1(",")
to split the string at the commas.
mapk_genes <- read_tsv("./KEGG_MAPK_SIGNALING_PATHWAY.v2023.2.Hs.tsv", col_names = FALSE) |>
filter(X1 == "GENE_SYMBOLS") |>
pull(X2) |>
str_split_1(",")
Now let’s get the significant MAPK signaling pathway genes:
mapk_genes_sig <- res_sig_fc |>
filter(gene_name %in% mapk_genes) |>
pull(gene_name)
Questions:
The plot should look similar to the following: