Differential Gene Expression
Can we cross the volcano?
Here is a a fundamental overview of high-throughput differential gene expression analysis (RNA-seq)
For this assignment you will be taking RNA-seq reads off the sequencer, and determining what genes are expressed higher in treatment group A compared to treatments group B. Why would someone want to do this? This can tell you something about the physiological response to a “treatment”, which generally speaking could be anything from environment, disease, developmental stage, tissue, species…
Generate a plot and table of differentially expressed genes.
Software
For this assignment we will be using kallisto (bash), DESeq2 (r).
Installing kallisto
Navigate through to a terminal and create directory in your home directory named programs
jovyan@jupyter-sr320:~$ pwd
/home/jovyan
jovyan@jupyter-sr320:~$ mkdir programs
Grab (wget
) the program from site listed above.
jovyan@jupyter-sr320:~$ cd programs/
jovyan@jupyter-sr320:~/programs$ wget https://github.com/pachterlab/kallisto/releases/download/v0.46.1/kallisto_linux-v0.46.1.tar.gz
Uncompress the file.
jovyan@jupyter-sr320:~/programs$ cd kallisto
jovyan@jupyter-sr320:~/programs/kallisto$ ls
kallisto license.txt README.md test
jovyan@jupyter-sr320:~/programs/kallisto$ ./kallisto
kallisto 0.46.1
Usage: kallisto <CMD> [arguments] ..
Where <CMD> can be one of:
index Builds a kallisto index
quant Runs the quantification algorithm
bus Generate BUS files for single-cell data
pseudo Runs the pseudoalignment step
merge Merges several batch runs
h5dump Converts HDF5-formatted results to plaintext
inspect Inspects and gives information about an index
version Prints version information
cite Prints citation information
Running kallisto <CMD> without arguments prints usage information for <CMD>
Commit early and often but ignore
files that are larger that 100 MB (or you will likely lose everything since prior commit).
You can use Git’s built-in hooks to automatically ignore files larger than 100 MB. Here are the steps to follow:
Create a new file in the .git/hooks/
directory of your repository called pre-commit
.
Add the following code to the pre-commit
file:
#!/bin/bash
# Maximum file size (in bytes)
max_file_size=104857600
# Find all files larger than max_file_size and add them to the .gitignore file
find . -type f -size +$max_file_sizec -exec echo "{}" >> .gitignore \;
This code sets the max_file_size variable to 100 MB and then uses the find command to locate all files in the repository that are larger than the specified max_file_size. The exec option of the find command appends the name of each file that matches the criteria to the .gitignore file.
Save the pre-commit file and make it executable by running the following command:
chmod +x .git/hooks/pre-commit
With these changes, whenever you run a git commit command, Git will first execute the pre-commit hook, which will automatically add any files larger than 100 MB to the .gitignore file. This will prevent Git from tracking these files in the repository going forward.
This might also work - git reset --mixed HEAD~1
Running kallisto
kallisto is already installed on raven (/home/shared/kallisto/kallisto
).
When accessing raven off-campus you have to use Husky OnNet
Downloading reference
This code grabs the Pacific oyster fasta file of genes and does so ignoring the fact that gannet does not have a security certificate to authenticate (--insecure
). This is usually not recommended however we know the server.
```{bash}
cd ../data
curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna
```
Creating index can take some time
This code is indexing the file rna.fna
while also renaming it as cgigas_roslin_rna.index
.
```{bash}
/home/shared/kallisto/kallisto \
-i \
index \
../data/cgigas_roslin_rna.index
../data/rna.fna```
Downloading sequence reads
Sequence reads are on a public server at https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/
Sample | SampleID |
D-control | D54 |
D-control | D55 |
D-control | D56 |
D-control | D57 |
D-control | D58 |
D-control | D59 |
D-control | M45 |
D-control | M46 |
D-control | M48 |
D-control | M49 |
D-control | M89 |
D-control | M90 |
D-desiccation | N48 |
D-desiccation | N49 |
D-desiccation | N50 |
D-desiccation | N51 |
D-desiccation | N52 |
D-desiccation | N53 |
D-desiccation | N54 |
D-desiccation | N55 |
D-desiccation | N56 |
D-desiccation | N57 |
D-desiccation | N58 |
D-desiccation | N59 |
This code uses recursive feature of wget
(see this weeks’ reading) to get all 24 files. Additionally as with curl
above we are ignoring the fact there is not security certificate with --no-check-certificate
```{bash}
cd ../data
wget --recursive --no-parent --no-directories \
\
--no-check-certificate '*.fastq.gz' \
--accept
https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/```
The next chunk first creates a subdirectory
Then performs the following steps:
The xargs
command in Unix-like systems is used to build and execute command lines from standard input. It’s often combined with other commands to perform complex operations. In your example, xargs
is used twice in a pipeline that starts with the find
command. Here’s a breakdown of what each part of the command does:
find ../data/*fastq.gz
:- This command finds all files in the
../data/
directory (and its subdirectories) with names ending in*fastq.gz
.
- This command finds all files in the
| xargs basename -s _L001_R1_001.fastq.gz
:The output of
find
(paths to.fastq.gz
files) is piped (|
) toxargs
, which then applies thebasename -s _L001_R1_001.fastq.gz
command to each path.basename
is used to strip the directory and suffix from filenames. The-s
option specifies a suffix to remove.In this case,
basename
removes the directory path and the suffix_L001_R1_001.fastq.gz
from each filename.
| xargs -I{} /home/shared/kallisto/kallisto quant -i ../data/cgigas_roslin_rna.index -o ../output/kallisto_01/{} -t 4 --single -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz
:The output from the previous
xargs
(which are now the modified filenames) is piped to anotherxargs
command.-I{}
is used to specify a replacement string{}
. This string is replaced by each input line (filename) in the subsequent command.The command
/home/shared/kallisto/kallisto quant...
is executed for each input line, with{}
being replaced by the input filename (without path and specific suffix).This part of the command runs the
kallisto quant
program for RNA sequence quantification, using various options and input files. The{}
placeholder is replaced by the current filename (from the previous steps) in two places: for the output directory and for the input.fastq.gz
file.
In summary, this command sequence finds .fastq.gz
files, modifies their names by removing paths and a specific suffix, and then runs a kallisto quant
command on each file, directing the output to a specific directory and using certain program options. This is a common pattern in bioinformatics workflows, where operations need to be applied to multiple files in an automated manner.
-t 4
: Use 4 threads for the computation.--single -l 100 -s 10
: Specify that the input file contains single-end reads (–single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10).
```{bash}
mkdir ../output/kallisto_01
find ../data/*fastq.gz \
| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
-i ../data/cgigas_roslin_rna.index \
quant \
-o ../output/kallisto_01/{} \
-t 4 -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz
--single ```
This command runs the abundance_estimates_to_matrix.pl
script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files.
The specific options and arguments used in the command are as follows:
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl
: Run the abundance_estimates_to_matrix.pl script from Trinity.--est_method kallisto
: Specify that the abundance estimates were generated using kallisto.--gene_trans_map none
: Do not use a gene-to-transcript mapping file.--out_prefix ../output/kallisto_01
: Use ../output/kallisto_01 as the output directory and prefix for the gene expression matrix file.--name_sample_by_basedir
: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix.
- And then there are the kallisto abundance files to use as input for creating the gene expression matrix.
```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
\
--est_method kallisto --gene_trans_map none \
--out_prefix ../output/kallisto_01 \
--name_sample_by_basedir \
\
../output/kallisto_01/D54_S145/abundance.tsv \
../output/kallisto_01/D56_S136/abundance.tsv \
../output/kallisto_01/D58_S144/abundance.tsv \
../output/kallisto_01/M45_S140/abundance.tsv \
../output/kallisto_01/M48_S137/abundance.tsv \
../output/kallisto_01/M89_S138/abundance.tsv \
../output/kallisto_01/D55_S146/abundance.tsv \
../output/kallisto_01/D57_S143/abundance.tsv \
../output/kallisto_01/D59_S142/abundance.tsv \
../output/kallisto_01/M46_S141/abundance.tsv \
../output/kallisto_01/M49_S139/abundance.tsv \
../output/kallisto_01/M90_S147/abundance.tsv \
../output/kallisto_01/N48_S194/abundance.tsv \
../output/kallisto_01/N50_S187/abundance.tsv \
../output/kallisto_01/N52_S184/abundance.tsv \
../output/kallisto_01/N54_S193/abundance.tsv \
../output/kallisto_01/N56_S192/abundance.tsv \
../output/kallisto_01/N58_S195/abundance.tsv \
../output/kallisto_01/N49_S185/abundance.tsv \
../output/kallisto_01/N51_S186/abundance.tsv \
../output/kallisto_01/N53_S188/abundance.tsv \
../output/kallisto_01/N55_S190/abundance.tsv \
../output/kallisto_01/N57_S191/abundance.tsv
../output/kallisto_01/N59_S189/abundance.tsv```
Running DESeq2
This code performs differential expression analysis to identify differentially expressed genes (DEGs) between a control condition and a desiccated condition.
First, it reads in a count matrix of isoform counts generated by Kallisto, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers.
Next, it creates a data.frame
containing information about the experimental conditions and sets row names to match the column names in the count matrix. It uses this information to create a DESeqDataSet
object, which is then passed to the DESeq()
function to fit a negative binomial model and estimate dispersions. The results()
function is used to extract the results table, which is ordered by gene/transcript ID.
The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called “DEGlist.tab”.
The below code could be a single script (or single chunk). I like separating to assist in troubleshooting and check output at various steps.
Load packages:
```{r}
library(DESeq2)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(data.table)
```
Might need to install first eg
```{r}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("DESeq2")
BiocManager```
Read in count matrix
```{r}
<- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
countmatrix rownames(countmatrix) <- countmatrix$X
<- countmatrix[,-1]
countmatrix head(countmatrix)
```
Round integers up to hole numbers for further analysis:
```{r}
<- round(countmatrix, 0)
countmatrix str(countmatrix)
```
Get DEGs based on Desiccation
This code snippet is part of a typical analysis pipeline using the DESeq2 package in R, which is widely used for differential gene expression analysis based on count data (like RNA-Seq or other forms of count data). Let’s break down the code:
deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))), type=factor(rep("single-read", 24)))
This line creates a data frame named
deseq2.colData
.The data frame has two columns:
condition
andtype
.The
condition
column is made up of 24 elements, where the first 12 elements are the word “control” and the next 12 are “desicated”. These are likely to represent two experimental conditions, andfactor
indicates that these are categorical variables.The
type
column consists of 24 repetitions of the term “single-read”, which might indicate the type of sequencing or experiment performed.Overall, this suggests an experiment with 24 samples, where 12 are control and 12 are under some desiccation treatment, and all are single-read type experiments.
rownames(deseq2.colData) <- colnames(data)
This line sets the row names of the
deseq2.colData
data frame to be the same as the column names of another data frame or matrix calleddata
.This is important for matching the metadata (in
deseq2.colData
) to the actual count data (presumably indata
), ensuring that each column of count data has the corresponding experimental condition and type information.
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix, colData = deseq2.colData, design = ~ condition)
This line creates a DESeqDataSet object, which is the core data structure used in DESeq2 for managing count data alongside its corresponding experimental metadata.
countData = countmatrix
specifies that the actual count data is contained in a matrix calledcountmatrix
.colData = deseq2.colData
attaches the metadata we created earlier to the count data. This metadata includes information about the experimental conditions and types for each sample.design = ~ condition
defines the model design for the differential expression analysis, indicating that the primary variable of interest for differential expression is thecondition
(i.e., control vs. desicated).
In summary, this code is setting up for a differential expression analysis comparing two conditions (control and desiccated) across 24 samples (assumed to be single-read type) using the DESeq2 package. The DESeqDataSet
object created will be used in subsequent steps of the analysis to perform normalization, differential expression testing, and other statistical evaluations.
```{r}
<- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))),
deseq2.colData type=factor(rep("single-read", 24)))
rownames(deseq2.colData) <- colnames(data)
<- DESeqDataSetFromMatrix(countData = countmatrix,
deseq2.dds colData = deseq2.colData,
design = ~ condition)
```
```{r}
<- DESeq(deseq2.dds)
deseq2.dds <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
deseq2.res ```
```{r}
head(deseq2.res)
```
```{r}
<- vst(deseq2.dds, blind = FALSE)
vsd plotPCA(vsd, intgroup = "condition")
```
```{r}
# Select top 50 differentially expressed genes
<- results(deseq2.dds)
res <- res[order(res$padj), ]
res_ordered <- row.names(res_ordered)[1:50]
top_genes
# Extract counts and normalize
<- counts(deseq2.dds, normalized = TRUE)
counts <- counts[top_genes, ]
counts_top
# Log-transform counts
<- log2(counts_top + 1)
log_counts_top
# Generate heatmap
pheatmap(log_counts_top, scale = "row")
```
```{r}
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
```
```{r}
<- deseq2.res
tmp # The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
main="DEG Dessication (pval <= 0.05)",
xlab="mean of normalized counts",
ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
<- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
tmp.sig points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
```
```{r}
write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)
```