WM2 - Bioinformatics

DGE analysis with DESeq2

For this example we use bulk RNAseq data obtained from a colorectal cancer cell line Caco-2 which was treated with the MEK inhibitor. The treated cells should be compared with untreated control cells to determine the effects of the inhibitor on the transcriptional level.

The experiment was done with 3 replicates for each condition (treated, untreated).

12. Volcano plot

A volcano plot is a type of scatterplot that shows statistical significance (p value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.

We will use the EnhancedVolcano() function to generate a volcano plot to visually summarize differences in the gene expression.

write your own code here
use the EnhancedVolcano() function to plot the data in DESeq res 

on the x axis you want to plot the log2FoldChange
on the y axis you want to plot the padj
set the pCutoff to 0.1
set the FCcutoff to 1
limit (xlim, ylim) the axis extend to -3,3 and 0, -log10(10e-06)

The resulting plot should look similar to this here:

Questions:

Describe and discuss the volcano plot.

13. List of significant de genes with FC > 2

We now want to save genes that are differential expressed in the MEKi treated cells. We define the following cutoffs:

Foldchange > 2 : abs(log2FoldChange) > 1
p-adjusted < 0.1: padj < 0.1

We can use the dplyr filter() function to generate the filtered results.

write your own code here

res_sig_fc <- res |> filter .....

Note: you can use multiple filter criteria in the filter() function by
separating them with (,)

The res_sig_fc table should look as follows:

We will also save this result as Excel and TSV file:

write your own code here

it should save res_sig_fc as DESeq_result_significant_fc.xlsx and DESeq_result_significant_fc.tsv

We should now have the following DESeq2 result files:

## [1] "DESeq_result_significant_fc.tsv"

## [1] "DESeq_result_significant_fc.xlsx"

Questions:

How many genes are > 2 fold regulated and significant (FDR < 0.1)?

14. Heatmap of significant de genes with FC > 2

Now that we have a list de genes we can use the TPM values of the TOP 30 up and down regulated genes and plot them as heatmap.

To do so we will load the TPM data obtained from the nf-core/rnaseq pipeline.

salmon.merged.gene_tpm.tsv (see link above)

We can use the read_tsv() function from the readr package to read the file and assign it to tpm_mat

Our tpm_mat looks as follows:

As with the raw counts (see above) we will need to sum up the TPMs for genes that are present multiple times but have different Ensembl gene IDs.

tpm_mat_sum <- tpm_mat |>
  group_by(gene_name) |>
  summarize(across(starts_with("CaCo"), sum))
head(tpm_mat_sum, 5)

Now we need to re-add the gene_id column which we lost in the previous step. To add it using the dplyr left_join() function together with the ens_to_symbol object where we have the mapping information. For multiple matches, we have a one-to-many mapping, we just keep the first occouring match. We use relocate() to move the gene_id column from the last positions to the position before gene_name.

tpm_mat <- tpm_mat_sum |>
  left_join(ensg_to_symbol, multiple="first") |>
  relocate(gene_id, .before = gene_name)

## Joining with `by = join_by(gene_name)`

Our tpm_mat looks as follows:

We will now filter the TPM data to extract only the TOP 30 up/down regulated de genes (FC > 2, padj < 0.1). We will use the dplyr filter() and our de gene list from above (12.) from which we use the gene_id column as filter.

get the top 30 up/down regulated genes

We first select the top 15 genes based on the “log2FoldChange” column using top_n(15, log2FoldChange). Then, we use bind_rows() to combine the top 15 genes based on the highest “log2FoldChange” values with the bottom 15 genes based on the lowest “log2FoldChange” top_n(-15, log2FoldChange)

top_bottom_genes <- res_sig_fc |>
  top_n(15, log2FoldChange) |>
  bind_rows(
    res_sig_fc |>
      top_n(-15, log2FoldChange)
  )

Finally filter the tpm_mat

de_tpm <- tpm_mat |>
  filter(gene_id %in% top_bottom_genes$gene_id)

dim(de_tpm)

## [1] 30  8

Our de_tpm looks as follows:

For making the heatmap we now convert the de_tpm to a matrix

# gene_names <- de_tpm$gene_name
de_tpm_mat <- de_tpm |>
  select(-gene_id) |>
  column_to_rownames("gene_name") |>
  as.matrix()

To better visualize the expression differences between we the scale (z-score) the TPM rowwise over the samples.

de_tpm_scaled <- t(scale(t(de_tpm_mat), center = TRUE, scale = TRUE))

Now we use Heatmap() from ComplesHeatmap to produce the heatmap.

https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html

The result should be similar to the one shown here

Questions:

What type of clustering is used?
Discuss and interpret the heatmap.

15. Volcano plot that highlights the MAPK pathway genes

Let’s make a volcano plot that highlights significant 2 fold regulated MAPK pathway genes

Download the corresponding gene set from the MsigDB in TSV format and use the GENE_SYMBOLS to highlight them in the volcano plot.

The table looks like below. It has no column names, therefore we need to use the col_names = FALSE option in `read_tsv()``

So can use read_tsv() function to read the geneset data and filter() get the row with the GENE_SYMBOLS in which genes are stored as a comma-separated list of gene symbols. To create a character vector of gene elements we use str_split_1(",") to split the string at the commas.

mapk_genes <- read_tsv("./KEGG_MAPK_SIGNALING_PATHWAY.v2023.2.Hs.tsv", col_names = FALSE) |>
  filter(X1 == "GENE_SYMBOLS") |>
  pull(X2) |> 
  str_split_1(",")

Now let’s get the significant MAPK signaling pathway genes:

mapk_genes_sig <- res_sig_fc |>
  filter(gene_name %in% mapk_genes) |>
  pull(gene_name)

Questions:

Explain the code you use to get the significant MAPK signaling pathway genes.
Which genes of the pathway are up or down requlated?

The plot should look similar to the following: