Differential Gene Expression
Can we cross the volcano?
For this assignment you will be taking RNA-seq reads off the sequencer, and determining what genes are expressed higher in treatment group A compared to treatments group B. Why would someone want to do this? This can tell you something about the physiological response to a “treatment”, which generally speaking could be anything from environment, disease, developmental stage, tissue, species…
As opposed to last week these files will be a little larger and compute effort will increase. It is good to pause here and decide what platform(s) you might want to use for this assignment.
Software
For this assignment we will be using kallisto (bash), DESeq2 (r).
Installing kallisto
Navigate through to a terminal and create directory in your home directory named programs
jovyan@jupyter-sr320:~$ pwd
/home/jovyan
jovyan@jupyter-sr320:~$ mkdir programs
Grab (wget
) the program from site listed above.
jovyan@jupyter-sr320:~$ cd programs/
jovyan@jupyter-sr320:~/programs$ wget https://github.com/pachterlab/kallisto/releases/download/v0.46.1/kallisto_linux-v0.46.1.tar.gz
Uncompress the file.
jovyan@jupyter-sr320:~/programs$ cd kallisto
jovyan@jupyter-sr320:~/programs/kallisto$ ls
kallisto license.txt README.md test
jovyan@jupyter-sr320:~/programs/kallisto$ ./kallisto
kallisto 0.46.1
Usage: kallisto <CMD> [arguments] ..
Where <CMD> can be one of:
index Builds a kallisto index
quant Runs the quantification algorithm
bus Generate BUS files for single-cell data
pseudo Runs the pseudoalignment step
merge Merges several batch runs
h5dump Converts HDF5-formatted results to plaintext
inspect Inspects and gives information about an index
version Prints version information
cite Prints citation information
Running kallisto <CMD> without arguments prints usage information for <CMD>
Running kallisto
kallisto is already installed on raven (/home/shared/kallisto/kallisto
).
Downloading reference
This code grabs the Pacific oyster fasta file of genes and does so ignoring the fact that gannet does not have a security certificate to authenticate (--insecure
). This is usually not recommended however we know the server.
```{bash}
cd ../data
curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna
```
This code is indexing the file rna.fna
while also renaming it as cgigas_roslin_rna.index
.
```{bash}
/home/shared/kallisto/kallisto \
-i \
index \
../data/cgigas_roslin_rna.index
../data/rna.fna```
Downloading sequence reads
Sequence reads are on a public server at https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/
Sample | SampleID |
D-control | D54 |
D-control | D55 |
D-control | D56 |
D-control | D57 |
D-control | D58 |
D-control | D59 |
D-control | M45 |
D-control | M46 |
D-control | M48 |
D-control | M49 |
D-control | M89 |
D-control | M90 |
D-desiccation | N48 |
D-desiccation | N49 |
D-desiccation | N50 |
D-desiccation | N51 |
D-desiccation | N52 |
D-desiccation | N53 |
D-desiccation | N54 |
D-desiccation | N55 |
D-desiccation | N56 |
D-desiccation | N57 |
D-desiccation | N58 |
D-desiccation | N59 |
This code uses recursive feature of wget
(see this weeks’ reading) to get all 24 files. Additionally as with curl
above we are ignoring the fact there is not security certificate with --no-check-certificate
cd ../data
wget --recursive --no-parent --no-directories \
\
--no-check-certificate '*.fastq.gz' \
--accept https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/
The next chunk first creates a subdirectory
Then performs the following steps:
- Uses the
find
utility to search for all files in the../data/
directory that match the pattern*fastq.gz
.
- Uses the
basename
command to extract the base filename of each file (i.e., the filename without the directory path), and removes the suffix_L001_R1_001.fastq.gz
. - Runs the kallisto
quant
command on each input file, with the following options:
-i ../data/cgigas_roslin_rna.index
: Use the kallisto index file located at../data/cgigas_roslin_rna.index
.-o ../output/kallisto_01/{}
: Write the output files to a directory called../output/kallisto_01/
with a subdirectory named after the base filename of the input file (the {} is a placeholder for the base filename).-t 40
: Use 40 threads for the computation.--single -l 100 -s 10
: Specify that the input file contains single-end reads (–single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10).- The input file to process is specified using the {} placeholder, which is replaced by the base filename from the previous step.
```{bash}
mkdir ../output/kallisto_01
find ../data/*fastq.gz \
| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
-i ../data/cgigas_roslin_rna.index \
quant \
-o ../output/kallisto_01/{} \
-t 40 -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz
--single ```
This command runs the abundance_estimates_to_matrix.pl
script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files.
The specific options and arguments used in the command are as follows:
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl
: Run the abundance_estimates_to_matrix.pl script from Trinity.--est_method kallisto
: Specify that the abundance estimates were generated using kallisto.--gene_trans_map none
: Do not use a gene-to-transcript mapping file.--out_prefix ../output/kallisto_01
: Use ../output/kallisto_01 as the output directory and prefix for the gene expression matrix file.--name_sample_by_basedir
: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix.
- And then there are the kallisto abundance files to use as input for creating the gene expression matrix.
```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
\
--est_method kallisto --gene_trans_map none \
--out_prefix ../output/kallisto_01 \
--name_sample_by_basedir \
\
../output/kallisto_01/D54_S145/abundance.tsv \
../output/kallisto_01/D56_S136/abundance.tsv \
../output/kallisto_01/D58_S144/abundance.tsv \
../output/kallisto_01/M45_S140/abundance.tsv \
../output/kallisto_01/M48_S137/abundance.tsv \
../output/kallisto_01/M89_S138/abundance.tsv \
../output/kallisto_01/D55_S146/abundance.tsv \
../output/kallisto_01/D57_S143/abundance.tsv \
../output/kallisto_01/D59_S142/abundance.tsv \
../output/kallisto_01/M46_S141/abundance.tsv \
../output/kallisto_01/M49_S139/abundance.tsv \
../output/kallisto_01/M90_S147/abundance.tsv \
../output/kallisto_01/N48_S194/abundance.tsv \
../output/kallisto_01/N50_S187/abundance.tsv \
../output/kallisto_01/N52_S184/abundance.tsv \
../output/kallisto_01/N54_S193/abundance.tsv \
../output/kallisto_01/N56_S192/abundance.tsv \
../output/kallisto_01/N58_S195/abundance.tsv \
../output/kallisto_01/N49_S185/abundance.tsv \
../output/kallisto_01/N51_S186/abundance.tsv \
../output/kallisto_01/N53_S188/abundance.tsv \
../output/kallisto_01/N55_S190/abundance.tsv \
../output/kallisto_01/N57_S191/abundance.tsv
../output/kallisto_01/N59_S189/abundance.tsv```
Running DESeq2
This code performs differential expression analysis to identify differentially expressed genes (DEGs) between a control condition and a desiccated condition.
First, it reads in a count matrix of isoform counts generated by Kallisto, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers.
Next, it creates a data.frame
containing information about the experimental conditions and sets row names to match the column names in the count matrix. It uses this information to create a DESeqDataSet
object, which is then passed to the DESeq()
function to fit a negative binomial model and estimate dispersions. The results()
function is used to extract the results table, which is ordered by gene/transcript ID.
The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called “DEGlist.tab”.
Load packages:
```{r}
library(DESeq2)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(data.table)
```
Might need to install first eg
```{r}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("DESeq2")
BiocManager```
Read in count matrix
```{r}
<- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
countmatrix rownames(countmatrix) <- countmatrix$X
<- countmatrix[,-1]
countmatrix head(countmatrix)
```
Round integers up to hole numbers for further analysis:
```{r}
<- round(countmatrix, 0)
countmatrix str(countmatrix)
```
Get DEGs based on Desication
```{r}
<- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))),
deseq2.colData type=factor(rep("single-read", 24)))
rownames(deseq2.colData) <- colnames(data)
<- DESeqDataSetFromMatrix(countData = countmatrix,
deseq2.dds colData = deseq2.colData,
design = ~ condition)
```
```{r}
<- DESeq(deseq2.dds)
deseq2.dds <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
deseq2.res ```
```{r}
head(deseq2.res)
```
```{r}
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
```
```{r}
<- deseq2.res
tmp # The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
main="DEG Dessication (pval <= 0.05)",
xlab="mean of normalized counts",
ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
<- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
tmp.sig points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
```
```{r}
write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)
```