In this exercise (Task 1), we will explore how to retrieve gene expression data from the Gene Expression Omnibus (GEO). We will extract key information about the dataset and reformat the expression values to make them compatible with Genesis.
This task will be carried out using the R programming language, following the workflow outlined below.
We first need to install some R packages which help us to run our gene expression analysis.
tidyverse
The tidyverse is a set of packages that work in harmony because they share common data representations and API design. It includes the following packages:
Other packages
install.packages("your package name here")
install.packages("tidyverse")
install.packages("conflicted")
install.packages("writexl")
install.packages("BiocManager")
BiocManager::install("GEOquery")
BiocManager::install("DT")
After installation of the packages we need to load them into our workspace.
library(your package)
library(conflicted)
library(tidyverse)
library(writexl)
library(GEOquery)
library(DT)
We also need to set preferred libraries for conflicting functions:
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
We can use the getGEO()
function from the
GEOQuery
package to retrieve the GDS2034
dataset.
# fetch directly from the GEO server
gds <- getGEO(GEO = "your GEO ID here", GSEMatrix = FALSE)
# alternatively we can load a GEO file from the local computer
gds <- getGEO(filename=system.file("your GEO file here", package="GEOquery"))
# fetch directly from the GEO server
gds <- getGEO(GEO = "GDS2034", GSEMatrix = FALSE)
Now, gds
contains the R data structure (of class GDS)
that represents the GDS2034 entry from GEO.
The GEOquery data structure GDS consists of a metadata header
Meta
and a GEODataTable. The GEODataTable has two simple
parts, a Columns
part which describes the column headers on
the Table
part. There is also a show method for each class.
For example, using the gds
from above:
# Look at gds metadata (first 3 entries):
head(Meta(gds), 3)
## $channel_count
## [1] "2"
##
## $dataset_id
## [1] "GDS2034" "GDS2034" "GDS2034" "GDS2034"
##
## $description
## [1] "Expression profiling of prostate cancer cell line LNCaP treated for up to 8 hours with the synthetic androgen R1881. LNCaP cells grow slowly in medium devoid of steroids but resume growth upon the addition of androgens. The study aims to identify direct targets of the androgen receptor."
## [2] "2 h"
## [3] "4 h"
## [4] "6 h"
## [5] "8 h"
We can list all Meta
fields by using the
names()
function
names(Meta(gds))
## [1] "channel_count" "dataset_id"
## [3] "description" "email"
## [5] "feature_count" "institute"
## [7] "name" "order"
## [9] "platform" "platform_organism"
## [11] "platform_technology_type" "pubmed_id"
## [13] "ref" "reference_series"
## [15] "sample_count" "sample_id"
## [17] "sample_organism" "sample_type"
## [19] "title" "type"
## [21] "update_date" "value_type"
## [23] "web_link"
To access specific information we can directly query it from the
Meta
structure by using the field name.
E.g. to get sample_ids
use;
Meta(gds)[["sample_id"]]
## [1] "GSM91874,GSM91875" "GSM91879,GSM92195" "GSM92196,GSM92197"
## [4] "GSM92198,GSM92199"
Provide the following information:
We can get the description for the data columns by using the
Columns()
function and use datatable()
to
format the result:
# Look at Column descriptions:
column_description <- Columns(gds)
datatable(column_description)
Explanation:
<-
to assign values/object to a
variable, e.g.: column_description <- Columns(gds)
datatable()
function can be used to display taublar
data as formatted table, e.g. : datatable(column_description)`Generate a column description, discuss and report the sample description.
To inspect the expression data table we may use the
Table()
function.
# Look at data associated with the GDS:
# but restrict to only first 5 rows, for brevity
Table(gds)[1:5,]
Describe the structure of the gene expression data table.
Assign the data table to a variable named dataTable
for
further manipulation.
dataTable <- Table(gds)
As observed in the sample description, the dataset includes multiple time points, with several replicates for each time point.
Calculate the mean log2fold change values over the replicate
pairs for the 4 time points in the dataTable
:
We use the sample_id
information from Meta
to get the paired replicates
# assign a variable to the sample_id pairs
sample_ids <- Meta(gds)[["sample_id"]]
sample_ids
## [1] "GSM91874,GSM91875" "GSM91879,GSM92195" "GSM92196,GSM92197"
## [4] "GSM92198,GSM92199"
We see that there are 4 sample pairs, one for each time point. The
samples of a pair are separated by a ,
. Now we can use
,
as delimiter and split the sample pair into a list of
vectors with two elements, which are time points with 2 sample_ids.
# Split the sample pairs string into individual pairs
sample_pairs <- str_split(sample_ids, ",") # split at ","
sample_pairs
## [[1]]
## [1] "GSM91874" "GSM91875"
##
## [[2]]
## [1] "GSM91879" "GSM92195"
##
## [[3]]
## [1] "GSM92196" "GSM92197"
##
## [[4]]
## [1] "GSM92198" "GSM92199"
We can now use sample_pairs
to compute the average
expression value for each time point (sample pair). This is done by
iterating over the individual time points using a for
loop.
Loop Over Column Pairs
for (pair in column_pairs) {...}
sample_pairs
, which is a
list where each element contains a pair of column names
(sample_ids
).pair
pair
is a vector with two elements, representing
two column names.Steps within the for(...) {...}
loop:
1. Assign Column Names
col1 <- pair[1] # assign first column
col2 <- pair[2] # assign second column
col1
(e.g.,
“GSM91874”).col2
(e.g.,
“GSM91875”).2. Create a New Column Name
new_col_name <- paste(col1, col2, "avg", sep = "_")
The paste()
function concatenates col1
,
col2
, and “avg
”, separated by underscores
(_
).
If col1 = "GSM91874"
and
col2 = "GSM91875"
, then:
new_col_name
will be
"GSM91874_GSM91875_avg"
3. Compute the Row-Wise Mean and Add to Data Table
dataTable <- dataTable |>
mutate(!!(new_col_name) := rowMeans(select(dataTable, all_of(c(col1, col2))), na.rm = TRUE))
Breaking It Down:
dataTable <- dataTable ...
: This part assigns the
modified dataset (after calculation) back to dataTable
. It
means you’re updating the dataTable
object with the results
of the operation.
|>
(Pipe operator): The |>
is a
pipe operator. It’s used to send the result of one operation to the
next. So in this case, it’s sending the dataTable
data to
the mutate()
function.
mutate()
: Adds a new column to
dataTable
!!(new_col_name) := ...
:
!!
(bang-bang operator) is used to dynamically create a
column with the name stored in new_col_name
.rowMeans(select(dataTable, all_of(c(col1, col2))), na.rm = TRUE)
:
select(dataTable, all_of(c(col1, col2)))
: Selects the
two columns from dataTable
.rowMeans(..., na.rm = TRUE)
: Computes the average of
the two selected columns for each row, ignoring NA values.So, this step creates a new column in dataTable
that
contains the row-wise mean of the two selected columns.
Taken together:
# Calculate the average values for each paired column
for (pair in sample_pairs) {
# Extract the two column names from the pair
col1 <- pair[1]
col2 <- pair[2]
# Create a meaningful new column name for the average values
new_col_name <- paste0(col1, "_", col2, "_avg")
# Compute the row-wise average of the two selected columns
dataTable <- dataTable |>
mutate(!!(new_col_name) := rowMeans(select(dataTable, all_of(c(col1, col2))), na.rm = TRUE))
}
# show the first 5 rows, using the head() function
head(dataTable, 5)
To format the expression data table for further analysis, we need to modify and remove certain columns.
1. Removing unneeded columns
We remove the columns containing expression values for individual
replicates, retaining only the mean expression values (columns that end
with “_avg
”) for each time point along with the
ID_REF
and IDENTIFIER
columns.
We can use the dplyr select()
function to select
specific columns. The ends_with("string")
function matches
all column names that end with “string
” (what ever string
is specified):
# use select() to retain `ID_REF`, `IDENTIFIER` and columns ending with "_avg"
dataTable_formated <- dataTable |>
select(col1, col2, ...)
# show the first 5 rows
head(dataTable_formated, 5)
# use select() to retain `ID_REF`, `IDENTIFIER` and columns ending with "_avg"
dataTable_formated <- dataTable |>
select(ID_REF, IDENTIFIER, ends_with("_avg"))
# show the first 5 rows
head(dataTable_formated, 5)
2. Renaming columns
We rename the ID_REF
, IDENTIFIER
and gene
expression columns as follows:
Original column name | New column name |
---|---|
ID_REF | UNIQID |
IDENTIFIER | NAME |
GSM91874_GSM91875_avg | 2h |
GSM91879_GSM92195_avg | 4h |
GSM92196_GSM92197_avg | 6h |
GSM92198_GSM92199_avg | 8h |
We use the dplyr rename()
function to rename columns by
specifying "new name" = "old name", ...
as arguments.
dataTable <- dataTable_filtered |>
rename(
"new name" = "old name",
...
)
# use dplyr rename() to rename columns
dataTable_formated <- dataTable_formated |>
rename(
"UNIQID" = "ID_REF",
"NAME" = "IDENTIFIER",
"2h" = "GSM91874_GSM91875_avg",
"4h" = "GSM91879_GSM92195_avg",
"6h" = "GSM92196_GSM92197_avg",
"8h" = "GSM92198_GSM92199_avg"
)
# show the first 5 rows
head(dataTable_formated, 5)
Finally, we filter the dataset to keep only genes with expression values available for all four time points. Additionally, we focus on genes that are differential expressed in at least one time point, using a threshold of 1.8-fold change (either up or down) to identify differential expression.
1. Remove missing all values
We use the dplyr drop_na()
function to remove rows with
any missing values (NA or NaN).
# filter out `NaN`
dataTable_filtered <- dataTable_formated |>
drop_na()
# show the first 5 rows
head(dataTable_filtered, 5)
Questions:
Hint: Use the dplyr count()
function to count rows
2. Keep only differential expressed genes
We use the dplyr filter()
function to retain only genes
that exhibit at least a 1.8-fold change in expression
in at least one of the four timepoints.
value_type
is
log2 ratio
# keep only genes the are diffential expressed
dataTable_filtered <- dataTable_filtered |>
filter(
if_any(ends_with("h"), ~abs(.) >= log2(1.8))
)
# show the first 5 rows
head(dataTable_filtered, 5)
Breaking it down:
filter()
: The filter()
function from
the dplyr package is used to keep only the rows that meet specific
conditions (in this case, genes with significant fold changes).
The function checks each row and keeps it only if it meets the condition you specify.
if_any()
: The if_any()
function checks
if any column in each row meets a condition. This is useful when we want
to filter based on multiple columns, not just one.
ends_with("h")
: This checks if a column name ends
with the letter "h"
. It’s used to select only the columns
that represent expression values at specific time points, like 2h, 4h,
etc.
~abs(.) >= log2(1.8)
This is the condition being
applied to the selected columns (those that end with “h”).
.
refers to the value of the column in that
row.abs(.)
takes the absolute value of the column value (so
we don’t care if it’s a positive or negative fold change).log2(1.8)
computes the logarithmic value (base 2) of
1.8. This is the threshold for the fold change; a gene is kept if its
fold change is at least 1.8 times (in either direction), which in log2
is greater than or equal to 0.847.Questions:
Hint:
# use count() to count all rows
dataTable_filtered |> count()
# use filter to select up/down regulated genes and count() to count rows
# filter(...) we can adapt the filter we used before
dataTable_filtered |> filter(...) |> count()
We can use the write_tsv()
and write_xlsx()
functions to save the final data in dataTable_filtered
to a
tab separated value (TSV) and an Excel file.
write_tsv(dataTable_filtered, file = "GDS2034_final.tsv")
write_xlsx(dataTable_filtered, path = "GDS2034_final.xlsx")