RNAs-seq
Counting on expression
Screen Recording
Reading
Opportunities in Functional Genomics: A Primer on Lab and Computational Aspects
Textbook- Retrieving Bioinformatics Data 109-124
Objectives
Learn how to take raw RNA sequence data and perform differential gene expression analysis.
Introduction
RNA sequencing (RNA-seq) is a powerful technology used to analyze gene expression levels across different conditions or tissues. The raw data generated from RNA-seq experiments is typically stored in fastq files, which contain millions of short sequencing reads that need to be preprocessed before performing differential gene expression analysis. In this tutorial, we will use Kallisto, a fast and accurate transcript quantification tool, to analyze RNA-seq data and identify differentially expressed genes.
On Thursday Dr. Ariana Huffmyer will provide a lecture on on RNA-seq that will provide both important experimental considerations but should also stimulated ideas on what some folks will be doing for their research project. Those lectures slides are available here
Sequence file format
One common sequence product is a large amount of “short-reads”. This often refers to single-end or paired-end reads with a link of ~ 150bp. When data comes off of a sequencer it is often in FASTQ format. This file format contains both sequence and quality informaiton.
A FASTQ file has four line-separated fields per sequence:
- Field 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
- Field 2 is the raw sequence letters.
- Field 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
- Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.
https://en.wikipedia.org/wiki/FASTQ_format
See also page 341 of the textbook.
Quality Control
The first step in analyzing RNA-seq data is to perform quality control checks on the raw fastq files. This step is crucial to ensure that the data is of high quality and can be accurately quantified. One popular tool for quality control is FastQC, which generates various quality metrics such as per-base sequence quality, adapter contamination, and GC content.
To perform quality control using FastQC, run the following command:
fastqc input.fastq
This will generate a HTML report that can be viewed in a web browser.
Another popular quality control program is fastp
. Here is a very nice tutorial on using fastp
Transcript Quantification (Pseudo - alignment)
After quality control, the next step is to quantify the transcript expression levels using Kallisto. Kallisto uses a pseudoalignment approach, which is much faster than traditional alignment-based methods and does not require a reference genome. Kallisto generates transcript-level abundance estimates that can be used for differential expression analysis.
To quantify the transcript expression levels using Kallisto, run the following command:
kallisto quant -i index_file -o output_folder input.fastq
Replace index_file with the path to the Kallisto index file for the reference transcriptome, and replace output_folder with the name of the output folder.
Transcript Quantification (Alignment)
Benefits to using HISAT2
for alignments:
Fast.
Can detect exon/intron junctions (i.e. alternative isoform splice sites).
For RNA-Seq, HISAT2
alignments are frequently followed up using StringTie
for transcript assembly and quantitation of splice variants.
General usage:
Build a
HISAT2
reference sequence index:# Create Hisat2 exons tab file "${programs_array[hisat2_exons]}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisat2 splice sites tab file "${programs_array[hisat2_splice_sites]}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${programs_array[hisat2_build]}" \ "${genome_fasta}" \ "${genome_index_name}" \ "${exons}" \ --exon "${splice_sites}" \ --ss "${threads}" \ -p > hisat2-build_stats.txt 2
Perform alignment(s):
# Hisat2 alignments "${programs_array[hisat2]}" \ "${genome_index_name}" \ -x "${fastq_list_R1}" \ -1 "${fastq_list_R2}" \ -2 "${sample_name}".sam \ -S "${threads}" \ --threads > "${sample_name}"-hisat2_stats.txt 2 # Sort SAM files, convert to BAM, and index ${programs_array[samtools_view]} \ "${threads}" \ -@ "${sample_name}".sam \ -Su | ${programs_array[samtools_sort]} - \ "${threads}" \ -@ "${sample_name}".sorted.bam -o ${programs_array[samtools_index]} "${sample_name}".sorted.bam # Delete unneccessary index files rm "${genome_index_name}"*.ht2 # Delete unneeded SAM files rm ./*.sam
See links in the “use cases” section below for full-fledged scripts and advanced usage (e.g. assigning read groups to alignment files (SAM) for improved downstream handling/accessiblity).
Differential Expression Analysis (Kallisto)
A final step is to perform differential expression analysis using Sleuth, which is a tool for analyzing Kallisto output files. Sleuth performs hypothesis testing using a statistical model based on the negative binomial distribution and generates various output files, including a table of differentially expressed genes.
To perform differential expression analysis using Sleuth, run the following commands:
library(sleuth)
<- sleuth_prep("output_folder", ~ condition)
so <- sleuth_fit(so)
so <- sleuth_lrt(so, "condition_treatment vs condition_control")
so <- sleuth_results(so, "condition_treatment vs condition_control")
results write.table(results, "results_table.txt", sep="\t")
Replace condition with the name of the column in the sample information file that contains the condition labels, and replace condition_treatment and condition_control with the names of the conditions to be compared.
This will perform differential expression analysis and generate a table of differentially expressed genes in the file results_table.txt.
Differential Expression Analysis (Hisat)
You’ll need to quantify gene expression levels from the BAM files generated by Hisat. This can be done using a tool such as featureCounts or HTSeq, which counts the number of reads that overlap with each gene in the reference genome.
To account for differences in sequencing depth and other technical factors, it’s important to normalize the gene expression counts before conducting differential gene expression analysis. This can be done using a tool such as DESeq2, which normalizes the data using the median-of-ratios method.
Now you’re ready to conduct differential gene expression analysis! This involves comparing the gene expression levels between two or more experimental groups to identify genes that are differentially expressed. You can do this using a tool such as DESeq2 or edgeR, which use statistical models to identify differentially expressed genes based on the normalized gene expression counts.
DESeq2
User Guides - Analyzing RNA-seq data with DESeq2