Gene expression analysis with clustering algorithms

In this exercise (Task 1), we will explore how to retrieve gene expression data from the Gene Expression Omnibus (GEO). We will extract key information about the dataset and reformat the expression values to make them compatible with Genesis.

This task will be carried out using the R programming language, following the workflow outlined below.

1. Install required R packages

We first need to install some R packages which help us to run our gene expression analysis.

tidyverse

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. It includes the following packages:

  • ggplot2, for data visualisation.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern re-imagining of data frames.
  • stringr, for strings.
  • forcats, for factors.
  • lubridate, for date/times.

Other packages

  • conflicted: alternative conflict resolution
  • writexl: Excel file writing
  • BiocManager: a package manager to install and manage R Bioconductor packages
  • GEOquery: provides functions to retrieve GEO data
  • DT: provides interactive tables
install.packages("your package name here")
Toggle solution
install.packages("tidyverse")
install.packages("conflicted")
install.packages("writexl")
install.packages("BiocManager")

BiocManager::install("GEOquery")
BiocManager::install("DT") 


2. Load the required R packages

After installation of the packages we need to load them into our workspace.

library(your package)
Toggle solution
library(conflicted)
library(tidyverse)
library(writexl)
library(GEOquery)
library(DT)


We also need to set preferred libraries for conflicting functions:

conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)

3. Load the genexpression dataset

We can use the getGEO() function from the GEOQuery package to retrieve the GDS2034 dataset.

# fetch directly from the GEO server
gds <- getGEO(GEO = "your GEO ID here", GSEMatrix = FALSE)

#  alternatively we can load a GEO file from the local computer
gds <- getGEO(filename=system.file("your GEO file here", package="GEOquery"))
Toggle solution
# fetch directly from the GEO server
gds <- getGEO(GEO = "GDS2034", GSEMatrix = FALSE)


Now, gds contains the R data structure (of class GDS) that represents the GDS2034 entry from GEO.

4. Query the genexpression dataset (GDS)

The GEOquery data structure GDS consists of a metadata header Meta and a GEODataTable. The GEODataTable has two simple parts, a Columns part which describes the column headers on the Table part. There is also a show method for each class. For example, using the gds from above:

# Look at gds metadata (first 3 entries):
head(Meta(gds), 3)
## $channel_count
## [1] "2"
## 
## $dataset_id
## [1] "GDS2034" "GDS2034" "GDS2034" "GDS2034"
## 
## $description
## [1] "Expression profiling of prostate cancer cell line LNCaP treated for up to 8 hours with the synthetic androgen R1881. LNCaP cells grow slowly in medium devoid of steroids but resume growth upon the addition of androgens. The study aims to identify direct targets of the androgen receptor."
## [2] "2 h"                                                                                                                                                                                                                                                                                            
## [3] "4 h"                                                                                                                                                                                                                                                                                            
## [4] "6 h"                                                                                                                                                                                                                                                                                            
## [5] "8 h"

We can list all Meta fields by using the names() function

names(Meta(gds))
##  [1] "channel_count"            "dataset_id"              
##  [3] "description"              "email"                   
##  [5] "feature_count"            "institute"               
##  [7] "name"                     "order"                   
##  [9] "platform"                 "platform_organism"       
## [11] "platform_technology_type" "pubmed_id"               
## [13] "ref"                      "reference_series"        
## [15] "sample_count"             "sample_id"               
## [17] "sample_organism"          "sample_type"             
## [19] "title"                    "type"                    
## [21] "update_date"              "value_type"              
## [23] "web_link"

To access specific information we can directly query it from the Meta structure by using the field name.

E.g. to get sample_ids use;

Meta(gds)[["sample_id"]]
## [1] "GSM91874,GSM91875" "GSM91879,GSM92195" "GSM92196,GSM92197"
## [4] "GSM92198,GSM92199"


Provide the following information:

  • description, what was studied with the experiment?
  • sample_id, ids of the samples
  • sample_count, how many samples are included in the dataset
  • feature_count, how many features (genes) were measured
  • value_type, discuss what does it mean
  • Reference to the publication

We can get the description for the data columns by using the Columns() function and use datatable() to format the result:

# Look at Column descriptions:
column_description <- Columns(gds)
datatable(column_description)

Explanation:

  • In R we use <- to assign values/object to a variable, e.g.: column_description <- Columns(gds)
  • The datatable() function can be used to display taublar data as formatted table, e.g. : datatable(column_description)`

Generate a column description, discuss and report the sample description.


To inspect the expression data table we may use the Table() function.

# Look at data associated with the GDS:
# but restrict to only first 5 rows, for brevity
Table(gds)[1:5,]

Describe the structure of the gene expression data table.


Assign the data table to a variable named dataTable for further manipulation.

dataTable <- Table(gds)

4. Caclulate the mean expression values

As observed in the sample description, the dataset includes multiple time points, with several replicates for each time point.

Calculate the mean log2fold change values over the replicate pairs for the 4 time points in the dataTable :

We use the sample_id information from Meta to get the paired replicates

# assign a variable to the sample_id pairs
sample_ids <- Meta(gds)[["sample_id"]]

sample_ids
## [1] "GSM91874,GSM91875" "GSM91879,GSM92195" "GSM92196,GSM92197"
## [4] "GSM92198,GSM92199"


We see that there are 4 sample pairs, one for each time point. The samples of a pair are separated by a ,. Now we can use , as delimiter and split the sample pair into a list of vectors with two elements, which are time points with 2 sample_ids.

# Split the sample pairs string into individual pairs
sample_pairs <- str_split(sample_ids, ",")    # split at ","

sample_pairs
## [[1]]
## [1] "GSM91874" "GSM91875"
## 
## [[2]]
## [1] "GSM91879" "GSM92195"
## 
## [[3]]
## [1] "GSM92196" "GSM92197"
## 
## [[4]]
## [1] "GSM92198" "GSM92199"

We can now use sample_pairs to compute the average expression value for each time point (sample pair). This is done by iterating over the individual time points using a for loop.

Here is how it works:

Loop Over Column Pairs

for (pair in column_pairs) {...}

  • This for loop iterates over sample_pairs, which is a list where each element contains a pair of column names (sample_ids).
  • In each iteration the current element is assinge to the variable pair
  • Each pair is a vector with two elements, representing two column names.


Steps within the for(...) {...} loop:

1. Assign Column Names

col1 <- pair[1] # assign first column
col2 <- pair[2] # assign second column
  • The first element of pair is assigned to col1 (e.g., “GSM91874”).
  • The second element is assigned to col2 (e.g., “GSM91875”).
  • These represent the two columns whose average values we want to calculate.


2. Create a New Column Name

new_col_name <- paste(col1, col2, "avg", sep = "_")

  • The paste() function concatenates col1, col2, and “avg”, separated by underscores (_).

  • If col1 = "GSM91874" and col2 = "GSM91875", then:

    new_col_name will be "GSM91874_GSM91875_avg"


3. Compute the Row-Wise Mean and Add to Data Table

  dataTable <- dataTable |>
    mutate(!!(new_col_name) := rowMeans(select(dataTable, all_of(c(col1, col2))), na.rm = TRUE))

Breaking It Down:

  • dataTable <- dataTable ...: This part assigns the modified dataset (after calculation) back to dataTable. It means you’re updating the dataTable object with the results of the operation.

  • |> (Pipe operator): The |> is a pipe operator. It’s used to send the result of one operation to the next. So in this case, it’s sending the dataTable data to the mutate() function.

  • mutate(): Adds a new column to dataTable

  • !!(new_col_name) := ...:

    • !! (bang-bang operator) is used to dynamically create a column with the name stored in new_col_name.
  • rowMeans(select(dataTable, all_of(c(col1, col2))), na.rm = TRUE):

    • select(dataTable, all_of(c(col1, col2))): Selects the two columns from dataTable.
    • rowMeans(..., na.rm = TRUE): Computes the average of the two selected columns for each row, ignoring NA values.

So, this step creates a new column in dataTable that contains the row-wise mean of the two selected columns.

Toggle solution

Taken together:

# Calculate the average values for each paired column
for (pair in sample_pairs) {
  # Extract the two column names from the pair
  col1 <- pair[1]  
  col2 <- pair[2]  
  
  # Create a meaningful new column name for the average values
  new_col_name <- paste0(col1, "_", col2, "_avg")  
  
  # Compute the row-wise average of the two selected columns
  dataTable <- dataTable |>
    mutate(!!(new_col_name) := rowMeans(select(dataTable, all_of(c(col1, col2))), na.rm = TRUE))
}

# show the first 5 rows, using the head() function
head(dataTable, 5)


5. Reformat the gene expression dataset

To format the expression data table for further analysis, we need to modify and remove certain columns.

1. Removing unneeded columns

We remove the columns containing expression values for individual replicates, retaining only the mean expression values (columns that end with “_avg”) for each time point along with the ID_REF and IDENTIFIER columns.

We can use the dplyr select() function to select specific columns. The ends_with("string") function matches all column names that end with “string” (what ever string is specified):

# use select() to retain `ID_REF`, `IDENTIFIER` and columns ending with "_avg"
dataTable_formated <- dataTable |>
  select(col1, col2, ...)

# show the first 5 rows
head(dataTable_formated, 5)
Toggle solution
# use select() to retain `ID_REF`, `IDENTIFIER` and columns ending with "_avg"
dataTable_formated <- dataTable |>
  select(ID_REF, IDENTIFIER, ends_with("_avg"))

# show the first 5 rows
head(dataTable_formated, 5)


2. Renaming columns

We rename the ID_REF, IDENTIFIER and gene expression columns as follows:

Original column name New column name
ID_REF UNIQID
IDENTIFIER NAME
GSM91874_GSM91875_avg 2h
GSM91879_GSM92195_avg 4h
GSM92196_GSM92197_avg 6h
GSM92198_GSM92199_avg 8h

We use the dplyr rename() function to rename columns by specifying "new name" = "old name", ... as arguments.

dataTable <- dataTable_filtered |>
  rename(
    "new name" = "old name",
    ...
  )
Toggle solution
# use dplyr rename() to rename columns
dataTable_formated <- dataTable_formated |>
  rename(
    "UNIQID" = "ID_REF",
    "NAME" = "IDENTIFIER",
    "2h" = "GSM91874_GSM91875_avg",
    "4h" = "GSM91879_GSM92195_avg",
    "6h" = "GSM92196_GSM92197_avg",
    "8h" = "GSM92198_GSM92199_avg"
    )

# show the first 5 rows
head(dataTable_formated, 5)


6. Filter the gene expression dataset

Finally, we filter the dataset to keep only genes with expression values available for all four time points. Additionally, we focus on genes that are differential expressed in at least one time point, using a threshold of 1.8-fold change (either up or down) to identify differential expression.

1. Remove missing all values

We use the dplyr drop_na() function to remove rows with any missing values (NA or NaN).

Toggle solution
# filter out `NaN`
dataTable_filtered <- dataTable_formated |>
  drop_na()

# show the first 5 rows
head(dataTable_filtered, 5)


Questions:

  • How many rows were removed?
  • How many genes are left?

Hint: Use the dplyr count() function to count rows



2. Keep only differential expressed genes

We use the dplyr filter() function to retain only genes that exhibit at least a 1.8-fold change in expression in at least one of the four timepoints.

  • Note: the value_type is log2 ratio
# keep only genes the are diffential expressed
dataTable_filtered <- dataTable_filtered |>
  filter(
    if_any(ends_with("h"), ~abs(.) >= log2(1.8))
  )

# show the first 5 rows
head(dataTable_filtered, 5)

Breaking it down:

  • filter(): The filter() function from the dplyr package is used to keep only the rows that meet specific conditions (in this case, genes with significant fold changes).

    The function checks each row and keeps it only if it meets the condition you specify.

  • if_any(): The if_any() function checks if any column in each row meets a condition. This is useful when we want to filter based on multiple columns, not just one.

  • ends_with("h"): This checks if a column name ends with the letter "h". It’s used to select only the columns that represent expression values at specific time points, like 2h, 4h, etc.

  • ~abs(.) >= log2(1.8) This is the condition being applied to the selected columns (those that end with “h”).

    • The . refers to the value of the column in that row.
    • abs(.) takes the absolute value of the column value (so we don’t care if it’s a positive or negative fold change).
    • log2(1.8) computes the logarithmic value (base 2) of 1.8. This is the threshold for the fold change; a gene is kept if its fold change is at least 1.8 times (in either direction), which in log2 is greater than or equal to 0.847.

Questions:

  • How many genes are left?
  • How many genes are up-regulated in at least one time point?
  • How many genes are down-regulated in at least one time point?

Hint:

# use count() to count all rows
dataTable_filtered |> count()

# use filter to select up/down regulated genes and count() to count rows
# filter(...) we can adapt the filter we used before
dataTable_filtered |> filter(...) |> count()

7. Save the final table

We can use the write_tsv() and write_xlsx() functions to save the final data in dataTable_filtered to a tab separated value (TSV) and an Excel file.

write_tsv(dataTable_filtered, file = "GDS2034_final.tsv")
write_xlsx(dataTable_filtered, path = "GDS2034_final.xlsx")