Differential Gene Expression

Can we cross the volcano?

Note

See screen recording here for some context though note below code is now revised.

For this assignment you will be taking RNA-seq reads off the sequencer, and determining what genes are expressed higher in treatment group A compared to treatments group B. Why would someone want to do this? This can tell you something about the physiological response to a “treatment”, which generally speaking could be anything from environment, disease, developmental stage, tissue, species…

Assignment

Generate a plot and table of differentially expressed genes.

As opposed to last week these files will be a little larger and compute effort will increase. It is good to pause here and decide what platform(s) you might want to use for this assignment.

Software

For this assignment we will be using kallisto (bash), DESeq2 (r).

Installing kallisto

Just for reference as will be using raven for assignment (recommended).

Navigate through to a terminal and create directory in your home directory named programs

jovyan@jupyter-sr320:~$ pwd
/home/jovyan
jovyan@jupyter-sr320:~$ mkdir programs

Grab (wget) the program from site listed above.

jovyan@jupyter-sr320:~$ cd programs/
jovyan@jupyter-sr320:~/programs$ wget https://github.com/pachterlab/kallisto/releases/download/v0.46.1/kallisto_linux-v0.46.1.tar.gz

Uncompress the file.

jovyan@jupyter-sr320:~/programs$ cd kallisto
jovyan@jupyter-sr320:~/programs/kallisto$ ls
kallisto  license.txt  README.md  test
jovyan@jupyter-sr320:~/programs/kallisto$ ./kallisto 
kallisto 0.46.1

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    bus           Generate BUS files for single-cell data 
    pseudo        Runs the pseudoalignment step 
    merge         Merges several batch runs 
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>

GitHub file size limit

Commit early and often but ignore files that are larger that 100 MB (or you will likely lose everything since prior commit).

You can use Git’s built-in hooks to automatically ignore files larger than 100 MB. Here are the steps to follow:

Create a new file in the .git/hooks/ directory of your repository called pre-commit.

Add the following code to the pre-commit file:

#!/bin/bash

# Maximum file size (in bytes)
max_file_size=104857600

# Find all files larger than max_file_size and add them to the .gitignore file
find . -type f -size +$max_file_sizec -exec echo "{}" >> .gitignore \;

This code sets the max_file_size variable to 100 MB and then uses the find command to locate all files in the repository that are larger than the specified max_file_size. The exec option of the find command appends the name of each file that matches the criteria to the .gitignore file.

Save the pre-commit file and make it executable by running the following command:

chmod +x .git/hooks/pre-commit

With these changes, whenever you run a git commit command, Git will first execute the pre-commit hook, which will automatically add any files larger than 100 MB to the .gitignore file. This will prevent Git from tracking these files in the repository going forward.

This might also work - git reset --mixed HEAD~1

Running kallisto

kallisto is already installed on raven (/home/shared/kallisto/kallisto).

Important

When accessing raven off-campus you have to use Husky OnNet

Downloading reference

This code grabs the Pacific oyster fasta file of genes and does so ignoring the fact that gannet does not have a security certificate to authenticate (--insecure). This is usually not recommended however we know the server.

```{bash}
cd ../data
curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna
```

Note

Creating index can take some time

This code is indexing the file rna.fna while also renaming it as cgigas_roslin_rna.index.

```{bash}
/home/shared/kallisto/kallisto \
index -i \
../data/cgigas_roslin_rna.index \
../data/rna.fna
```

Downloading sequence reads

Sequence reads are on a public server at https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/

Sample	SampleID
D-control	D54
D-control	D55
D-control	D56
D-control	D57
D-control	D58
D-control	D59
D-control	M45
D-control	M46
D-control	M48
D-control	M49
D-control	M89
D-control	M90
D-desiccation	N48
D-desiccation	N49
D-desiccation	N50
D-desiccation	N51
D-desiccation	N52
D-desiccation	N53
D-desiccation	N54
D-desiccation	N55
D-desiccation	N56
D-desiccation	N57
D-desiccation	N58
D-desiccation	N59

This code uses recursive feature of wget (see this weeks’ reading) to get all 24 files. Additionally as with curl above we are ignoring the fact there is not security certificate with --no-check-certificate

cd ../data 
wget --recursive --no-parent --no-directories \
--no-check-certificate \
--accept '*.fastq.gz' \
https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/

The next chunk first creates a subdirectory

Then performs the following steps:

Uses the find utility to search for all files in the ../data/ directory that match the pattern *fastq.gz.
Uses the basename command to extract the base filename of each file (i.e., the filename without the directory path), and removes the suffix _L001_R1_001.fastq.gz.
Runs the kallisto quant command on each input file, with the following options:

-i ../data/cgigas_roslin_rna.index: Use the kallisto index file located at ../data/cgigas_roslin_rna.index.
-o ../output/kallisto_01/{}: Write the output files to a directory called ../output/kallisto_01/ with a subdirectory named after the base filename of the input file (the {} is a placeholder for the base filename).
-t 40: Use 40 threads for the computation.
--single -l 100 -s 10: Specify that the input file contains single-end reads (–single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10).
The input file to process is specified using the {} placeholder, which is replaced by the base filename from the previous step.

```{bash}
mkdir ../output/kallisto_01

find ../data/*fastq.gz \
| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../data/cgigas_roslin_rna.index \
-o ../output/kallisto_01/{} \
-t 40 \
--single -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz
```

This command runs the abundance_estimates_to_matrix.pl script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files.

The specific options and arguments used in the command are as follows:

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl: Run the abundance_estimates_to_matrix.pl script from Trinity.
--est_method kallisto: Specify that the abundance estimates were generated using kallisto.
--gene_trans_map none: Do not use a gene-to-transcript mapping file.
--out_prefix ../output/kallisto_01: Use ../output/kallisto_01 as the output directory and prefix for the gene expression matrix file.
--name_sample_by_basedir: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix.
And then there are the kallisto abundance files to use as input for creating the gene expression matrix.

```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix ../output/kallisto_01 \
    --name_sample_by_basedir \
    ../output/kallisto_01/D54_S145/abundance.tsv \
    ../output/kallisto_01/D56_S136/abundance.tsv \
    ../output/kallisto_01/D58_S144/abundance.tsv \
    ../output/kallisto_01/M45_S140/abundance.tsv \
    ../output/kallisto_01/M48_S137/abundance.tsv \
    ../output/kallisto_01/M89_S138/abundance.tsv \
    ../output/kallisto_01/D55_S146/abundance.tsv \
    ../output/kallisto_01/D57_S143/abundance.tsv \
    ../output/kallisto_01/D59_S142/abundance.tsv \
    ../output/kallisto_01/M46_S141/abundance.tsv \
    ../output/kallisto_01/M49_S139/abundance.tsv \
    ../output/kallisto_01/M90_S147/abundance.tsv \
    ../output/kallisto_01/N48_S194/abundance.tsv \
    ../output/kallisto_01/N50_S187/abundance.tsv \
    ../output/kallisto_01/N52_S184/abundance.tsv \
    ../output/kallisto_01/N54_S193/abundance.tsv \
    ../output/kallisto_01/N56_S192/abundance.tsv \
    ../output/kallisto_01/N58_S195/abundance.tsv \
    ../output/kallisto_01/N49_S185/abundance.tsv \
    ../output/kallisto_01/N51_S186/abundance.tsv \
    ../output/kallisto_01/N53_S188/abundance.tsv \
    ../output/kallisto_01/N55_S190/abundance.tsv \
    ../output/kallisto_01/N57_S191/abundance.tsv \
    ../output/kallisto_01/N59_S189/abundance.tsv
```

Running DESeq2

This code performs differential expression analysis to identify differentially expressed genes (DEGs) between a control condition and a desiccated condition.

First, it reads in a count matrix of isoform counts generated by Kallisto, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers.

Next, it creates a data.frame containing information about the experimental conditions and sets row names to match the column names in the count matrix. It uses this information to create a DESeqDataSet object, which is then passed to the DESeq() function to fit a negative binomial model and estimate dispersions. The results() function is used to extract the results table, which is ordered by gene/transcript ID.

The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called “DEGlist.tab”.

Note

The below code could be a single script (or single chunk). I like separating to assist in troubleshooting and check output at various steps.

Load packages:

```{r}
library(DESeq2)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(data.table)
```

Might need to install first eg

```{r}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")
```

Read in count matrix

```{r}
countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
```

Round integers up to hole numbers for further analysis:

```{r}
countmatrix <- round(countmatrix, 0)
str(countmatrix)
```

Get DEGs based on Desication

```{r}
deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))), 
                             type=factor(rep("single-read", 24)))
rownames(deseq2.colData) <- colnames(data)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)
```

```{r}
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
```

```{r}
head(deseq2.res)
```

```{r}
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
```

```{r}
tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG Dessication  (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
```

```{r}
write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)
```

--- title: "Differential Gene Expression" subtitle: "Can we cross the volcano?" format: html: code-fold: false code-tools: true code-copy: true highlight-style: github code-overflow: wrap --- ::: callout-note See screen recording [here](https://sr320.github.io/course-fish546-2023/lectures/02-rna-seq.html) for some context though note below code is now revised. ::: For this assignment you will be taking RNA-seq reads off the sequencer, and determining what genes are expressed higher in treatment group A compared to treatments group B. Why would someone want to do this? This can tell you something about the physiological response to a "treatment", which generally speaking could be anything from environment, disease, developmental stage, tissue, species... ::: callout-important ## Assignment Generate a plot and table of differentially expressed genes. ::: As opposed to last week these files will be a little larger and compute effort will increase. It is good to pause here and decide what platform(s) you might want to use for this assignment. # Software For this assignment we will be using [kallisto](https://pachterlab.github.io/kallisto/) (bash), DESeq2 (r). # Installing kallisto ::: callout-note ## Just for reference as will be using raven for assignment (recommended). ::: Navigate through to a terminal and create directory in your home directory named `programs` ``` bash jovyan@jupyter-sr320:~$ pwd /home/jovyan jovyan@jupyter-sr320:~$ mkdir programs ``` Grab (`wget`) the program from site listed above. ``` bash jovyan@jupyter-sr320:~$ cd programs/ jovyan@jupyter-sr320:~/programs$ wget https://github.com/pachterlab/kallisto/releases/download/v0.46.1/kallisto_linux-v0.46.1.tar.gz ``` Uncompress the file. ``` bash jovyan@jupyter-sr320:~/programs$ cd kallisto jovyan@jupyter-sr320:~/programs/kallisto$ ls kallisto license.txt README.md test jovyan@jupyter-sr320:~/programs/kallisto$ ./kallisto kallisto 0.46.1 Usage: kallisto <CMD> [arguments] .. Where <CMD> can be one of: index Builds a kallisto index quant Runs the quantification algorithm bus Generate BUS files for single-cell data pseudo Runs the pseudoalignment step merge Merges several batch runs h5dump Converts HDF5-formatted results to plaintext inspect Inspects and gives information about an index version Prints version information cite Prints citation information Running kallisto <CMD> without arguments prints usage information for <CMD> ``` ::: callout-warning ## GitHub file size limit Commit early and often but `ignore` files that are larger that 100 MB (or you will likely lose everything since prior commit). You can use Git's built-in hooks to automatically ignore files larger than 100 MB. Here are the steps to follow: Create a new file in the `.git/hooks/` directory of your repository called `pre-commit`. Add the following code to the `pre-commit` file: ``` bash #!/bin/bash # Maximum file size (in bytes) max_file_size=104857600 # Find all files larger than max_file_size and add them to the .gitignore file find . -type f -size +$max_file_sizec -exec echo "{}" >> .gitignore \; ``` This code sets the max_file_size variable to 100 MB and then uses the find command to locate all files in the repository that are larger than the specified max_file_size. The exec option of the find command appends the name of each file that matches the criteria to the .gitignore file. Save the pre-commit file and make it executable by running the following command: ``` bash chmod +x .git/hooks/pre-commit ``` With these changes, whenever you run a git commit command, Git will first execute the pre-commit hook, which will automatically add any files larger than 100 MB to the .gitignore file. This will prevent Git from tracking these files in the repository going forward. This might also work - `git reset --mixed HEAD~1` ::: # Running kallisto kallisto is already installed on raven (`/home/shared/kallisto/kallisto`). ::: callout-important When accessing raven off-campus you have to use [Husky OnNet](https://itconnect.uw.edu/tools-services-support/networks-connectivity/husky-onnet/) ::: ## Downloading reference This code grabs the Pacific oyster fasta file of genes and does so ignoring the fact that *gannet* does not have a security certificate to authenticate (`--insecure`). This is usually not recommended however we know the server. ``` {{bash}} cd ../data curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna ``` ::: callout-note Creating index can take some time ::: This code is indexing the file `rna.fna` while also renaming it as `cgigas_roslin_rna.index`. ``` {{bash}} /home/shared/kallisto/kallisto \ index -i \ ../data/cgigas_roslin_rna.index \ ../data/rna.fna ``` ## Downloading sequence reads Sequence reads are on a public server at https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/ | | | |---------------|----------| | Sample | SampleID | | D-control | D54 | | D-control | D55 | | D-control | D56 | | D-control | D57 | | D-control | D58 | | D-control | D59 | | D-control | M45 | | D-control | M46 | | D-control | M48 | | D-control | M49 | | D-control | M89 | | D-control | M90 | | D-desiccation | N48 | | D-desiccation | N49 | | D-desiccation | N50 | | D-desiccation | N51 | | D-desiccation | N52 | | D-desiccation | N53 | | D-desiccation | N54 | | D-desiccation | N55 | | D-desiccation | N56 | | D-desiccation | N57 | | D-desiccation | N58 | | D-desiccation | N59 | This code uses recursive feature of `wget` (see this weeks' reading) to get all 24 files. Additionally as with `curl` above we are ignoring the fact there is not security certificate with `--no-check-certificate` ``` bash cd ../data wget --recursive --no-parent --no-directories \ --no-check-certificate \ --accept '*.fastq.gz' \ https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/ ``` The next chunk first creates a subdirectory Then performs the following steps: 1. Uses the `find` utility to search for all files in the `../data/` directory that match the pattern `*fastq.gz`.\ 2. Uses the `basename` command to extract the base filename of each file (i.e., the filename without the directory path), and removes the suffix `_L001_R1_001.fastq.gz`. 3. Runs the kallisto `quant` command on each input file, with the following options: - `-i ../data/cgigas_roslin_rna.index`: Use the kallisto index file located at `../data/cgigas_roslin_rna.index`. - `-o ../output/kallisto_01/{}`: Write the output files to a directory called `../output/kallisto_01/` with a subdirectory named after the base filename of the input file (the {} is a placeholder for the base filename). - `-t 40`: Use 40 threads for the computation. - `--single -l 100 -s 10`: Specify that the input file contains single-end reads (--single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10). - The input file to process is specified using the {} placeholder, which is replaced by the base filename from the previous step. ``` {{bash}} mkdir ../output/kallisto_01 find ../data/*fastq.gz \ | xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \ quant -i ../data/cgigas_roslin_rna.index \ -o ../output/kallisto_01/{} \ -t 40 \ --single -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz ``` This command runs the `abundance_estimates_to_matrix.pl` script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files. The specific options and arguments used in the command are as follows: - `perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl`: Run the abundance_estimates_to_matrix.pl script from Trinity. - `--est_method kallisto`: Specify that the abundance estimates were generated using kallisto. - `--gene_trans_map none`: Do not use a gene-to-transcript mapping file. - `--out_prefix ../output/kallisto_01`: Use ../output/kallisto_01 as the output directory and prefix for the gene expression matrix file. - `--name_sample_by_basedir`: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix.\ - And then there are the kallisto abundance files to use as input for creating the gene expression matrix. ``` {{bash}} perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map none \ --out_prefix ../output/kallisto_01 \ --name_sample_by_basedir \ ../output/kallisto_01/D54_S145/abundance.tsv \ ../output/kallisto_01/D56_S136/abundance.tsv \ ../output/kallisto_01/D58_S144/abundance.tsv \ ../output/kallisto_01/M45_S140/abundance.tsv \ ../output/kallisto_01/M48_S137/abundance.tsv \ ../output/kallisto_01/M89_S138/abundance.tsv \ ../output/kallisto_01/D55_S146/abundance.tsv \ ../output/kallisto_01/D57_S143/abundance.tsv \ ../output/kallisto_01/D59_S142/abundance.tsv \ ../output/kallisto_01/M46_S141/abundance.tsv \ ../output/kallisto_01/M49_S139/abundance.tsv \ ../output/kallisto_01/M90_S147/abundance.tsv \ ../output/kallisto_01/N48_S194/abundance.tsv \ ../output/kallisto_01/N50_S187/abundance.tsv \ ../output/kallisto_01/N52_S184/abundance.tsv \ ../output/kallisto_01/N54_S193/abundance.tsv \ ../output/kallisto_01/N56_S192/abundance.tsv \ ../output/kallisto_01/N58_S195/abundance.tsv \ ../output/kallisto_01/N49_S185/abundance.tsv \ ../output/kallisto_01/N51_S186/abundance.tsv \ ../output/kallisto_01/N53_S188/abundance.tsv \ ../output/kallisto_01/N55_S190/abundance.tsv \ ../output/kallisto_01/N57_S191/abundance.tsv \ ../output/kallisto_01/N59_S189/abundance.tsv ``` # Running DESeq2 This code performs differential expression analysis to identify differentially expressed genes (DEGs) between a control condition and a desiccated condition. First, it reads in a count matrix of isoform counts generated by Kallisto, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers. Next, it creates a `data.frame` containing information about the experimental conditions and sets row names to match the column names in the count matrix. It uses this information to create a `DESeqDataSet` object, which is then passed to the `DESeq()` function to fit a negative binomial model and estimate dispersions. The `results()` function is used to extract the results table, which is ordered by gene/transcript ID. The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called "DEGlist.tab". ::: callout-note The below code could be a single script (or single chunk). I like separating to assist in troubleshooting and check output at various steps. ::: Load packages: ``` {{r}} library(DESeq2) library(tidyverse) library(pheatmap) library(RColorBrewer) library(data.table) ``` Might need to install first eg ``` {{r}} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2") ``` ### Read in count matrix ``` {{r}} countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X countmatrix <- countmatrix[,-1] head(countmatrix) ``` Round integers up to hole numbers for further analysis: ``` {{r}} countmatrix <- round(countmatrix, 0) str(countmatrix) ``` ## Get DEGs based on Desication ``` {{r}} deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))), type=factor(rep("single-read", 24))) rownames(deseq2.colData) <- colnames(data) deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix, colData = deseq2.colData, design = ~ condition) ``` ``` {{r}} deseq2.dds <- DESeq(deseq2.dds) deseq2.res <- results(deseq2.dds) deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] ``` ``` {{r}} head(deseq2.res) ``` ``` {{r}} # Count number of hits with adjusted p-value less then 0.05 dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]) ``` ``` {{r}} tmp <- deseq2.res # The main plot plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray", main="DEG Dessication (pval <= 0.05)", xlab="mean of normalized counts", ylab="Log2 Fold Change") # Getting the significant points and plotting them again so they're a different color tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ] points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red") # 2 FC lines abline(h=c(-1,1), col="blue") ``` ``` {{r}} write.table(tmp.sig, "../output/DEGlist.tab", row.names = T) ```