05.3-Apul-lncRNA-discovery
Steven Roberts 03 November, 2023
1 Run HiSat on RNA-seq
1.1 Grab Trimmed RNA-seq Reads
wget -r \
--no-parent \
--no-directories \
-P ../data/fastq/ "*fastq.gz" https://gannet.fish.washington.edu/Atumefaciens/20230519-E5_coral-fastqc-fastp-multiqc-RNAseq/A_pulchra/trimmed/ -A
1.2 Genome
cd ../data
curl -O http://gannet.fish.washington.edu/seashell/snaps/GCF_013753865.1_Amil_v2.1_genomic.fna
1.3 HiSat
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
\
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf > ../output/05.32-lncRNA-discovery/m_exon.tab
head ../output/05.32-lncRNA-discovery/m_exon.tab
## NC_058066.1 1961 2118 -
## NC_058066.1 15360 15663 +
## NC_058066.1 18710 18774 -
## NC_058066.1 21000 21092 -
## NC_058066.1 22158 22442 +
## NC_058066.1 23084 23220 -
## NC_058066.1 24770 25066 +
## NC_058066.1 26652 26912 +
## NC_058066.1 27071 27358 +
## NC_058066.1 27658 27951 +
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
\
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf > ../output/05.32-lncRNA-discovery/m_splice_sites.tab
head ../output/05.32-lncRNA-discovery/m_splice_sites.tab
## NC_058066.1 2118 18710 -
## NC_058066.1 15663 22158 +
## NC_058066.1 18774 21000 -
## NC_058066.1 21092 23084 -
## NC_058066.1 22442 24770 +
## NC_058066.1 25066 26652 +
## NC_058066.1 26912 27071 +
## NC_058066.1 27358 27658 +
## NC_058066.1 27951 28247 +
## NC_058066.1 28534 29196 +
/home/shared/hisat2-2.2.1/hisat2-build \
\
../data/GCF_013753865.1_Amil_v2.1_genomic.fna \
../data/GCF_013753865.1_Amil_v2.1.index \
--exon ../output/05.32-lncRNA-discovery/m_exon.tab \
--ss ../output/05.32-lncRNA-discovery/m_splice_sites.tab \
-p 40 \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf > ../output/05.32-lncRNA-discovery/hisat2-build_stats.txt 2
find ../data/fastq/*gz
## ../data/fastq/RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \
| xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
echo {}
## RNA-ACR-140
## RNA-ACR-145
## RNA-ACR-150
## RNA-ACR-173
## RNA-ACR-178
find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \
| xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
\
/home/shared/hisat2-2.2.1/hisat2 \
-x ../data/GCF_013753865.1_Amil_v2.1.index \
-p 20 \
-1 ../data/fastq/{}-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz \
-2 ../data/fastq/{}-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz -S ../output/05.32-lncRNA-discovery/{}.sam
47710408 reads; of these:
47710408 (100.00%) were paired; of these:
27056517 (56.71%) aligned concordantly 0 times
19229695 (40.31%) aligned concordantly exactly 1 time
1424196 (2.99%) aligned concordantly >1 times
----
27056517 pairs aligned concordantly 0 times; of these:
1316579 (4.87%) aligned discordantly 1 time
----
25739938 pairs aligned 0 times concordantly or discordantly; of these:
51479876 mates make up the pairs; of these:
41152624 (79.94%) aligned 0 times
9290398 (18.05%) aligned exactly 1 time
1036854 (2.01%) aligned >1 times
56.87% overall alignment rate
42864294 reads; of these:
42864294 (100.00%) were paired; of these:
23649791 (55.17%) aligned concordantly 0 times
17703533 (41.30%) aligned concordantly exactly 1 time
1510970 (3.53%) aligned concordantly >1 times
----
23649791 pairs aligned concordantly 0 times; of these:
1250605 (5.29%) aligned discordantly 1 time
----
22399186 pairs aligned 0 times concordantly or discordantly; of these:
44798372 mates make up the pairs; of these:
35369621 (78.95%) aligned 0 times
8218798 (18.35%) aligned exactly 1 time
1209953 (2.70%) aligned >1 times
58.74% overall alignment rate
43712298 reads; of these:
43712298 (100.00%) were paired; of these:
30355359 (69.44%) aligned concordantly 0 times
12131465 (27.75%) aligned concordantly exactly 1 time
1225474 (2.80%) aligned concordantly >1 times
----
30355359 pairs aligned concordantly 0 times; of these:
848325 (2.79%) aligned discordantly 1 time
----
29507034 pairs aligned 0 times concordantly or discordantly; of these:
59014068 mates make up the pairs; of these:
51031599 (86.47%) aligned 0 times
6695839 (11.35%) aligned exactly 1 time
1286630 (2.18%) aligned >1 times
41.63% overall alignment rate
47501524 reads; of these:
47501524 (100.00%) were paired; of these:
27827915 (58.58%) aligned concordantly 0 times
18298532 (38.52%) aligned concordantly exactly 1 time
1375077 (2.89%) aligned concordantly >1 times
----
27827915 pairs aligned concordantly 0 times; of these:
1321896 (4.75%) aligned discordantly 1 time
----
26506019 pairs aligned 0 times concordantly or discordantly; of these:
53012038 mates make up the pairs; of these:
42654805 (80.46%) aligned 0 times
9174277 (17.31%) aligned exactly 1 time
1182956 (2.23%) aligned >1 times
55.10% overall alignment rate
42677752 reads; of these:
42677752 (100.00%) were paired; of these:
25646587 (60.09%) aligned concordantly 0 times
15654238 (36.68%) aligned concordantly exactly 1 time
1376927 (3.23%) aligned concordantly >1 times
----
25646587 pairs aligned concordantly 0 times; of these:
1105089 (4.31%) aligned discordantly 1 time
----
24541498 pairs aligned 0 times concordantly or discordantly; of these:
49082996 mates make up the pairs; of these:
38229072 (77.89%) aligned 0 times
9393012 (19.14%) aligned exactly 1 time
1460912 (2.98%) aligned >1 times
55.21% overall alignment rate
1.4 convert to bams
for samfile in ../output/05.32-lncRNA-discovery/*.sam; do
bamfile="${samfile%.sam}.bam"
sorted_bamfile="${samfile%.sam}.sorted.bam"
# Convert SAM to BAM
/home/shared/samtools-1.12/samtools view -bS "$samfile" > "$bamfile"
# Sort BAM
/home/shared/samtools-1.12/samtools sort "$bamfile" -o "$sorted_bamfile"
# Index sorted BAM
/home/shared/samtools-1.12/samtools index "$sorted_bamfile"
done
2 StringTie
StringTie uses the sorted BAM files to assemble transcripts for each sample, outputting them as GTF (Gene Transfer Format) files. And then merges all individual GTF assemblies into a single merged GTF file. This step extracts transcript information and merges GTFs from all samples–an important step in creating a canonical list of lncRNAs across all samples included in the pipeline.
find ../output/05.32-lncRNA-discovery/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
\
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 8 \
-G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-o ../output/05.32-lncRNA-discovery/{}.gtf ../output/05.32-lncRNA-discovery/{}.sorted.bam
wc -l ../output/05.32-lncRNA-discovery/RNA*.gtf
ls ../output/05.32-lncRNA-discovery/RNA*.gtf
## 373617 ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf
## 322109 ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf
## 340380 ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf
## 367972 ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf
## 426804 ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf
## 1830882 total
## ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf
Merges all individual GTF assemblies into a single merged GTF file.
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
\
--merge \
-G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf *.gtf ../output/05.32-lncRNA-discovery/
wc -l ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
head ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
## 726445 ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
## # /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie --merge -G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff -o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf
## # StringTie version 2.2.1
## NC_058066.1 StringTie transcript 345 1190 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1";
## NC_058066.1 StringTie exon 345 1190 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1";
## NC_058066.1 StringTie transcript 1962 23221 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522";
## NC_058066.1 StringTie exon 1962 2119 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "1"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522";
## NC_058066.1 StringTie exon 18711 18775 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522";
## NC_058066.1 StringTie exon 21001 21093 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "3"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522";
## NC_058066.1 StringTie exon 23085 23221 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "4"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522";
## NC_058066.1 StringTie transcript 3278 4416 1000 . . gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";
3 GFFcompare
/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \
\
-r ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-G \
-o ../output/05.32-lncRNA-discovery/gffcompare_merged ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
ls ../output/05.32-lncRNA-discovery/gffcompare_merged*
## ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf
## ../output/05.32-lncRNA-discovery/gffcompare_merged.loci
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stats
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.refmap
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.tmap
## ../output/05.32-lncRNA-discovery/gffcompare_merged.tracking
head -3 ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf
## NC_058066.1 StringTie transcript 9330 48114 . + . transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; gene_name "LOC114963508"; xloc "XLOC_000001"; ref_gene_id "gene-LOC114963508"; cmp_ref "rna-XM_029342736.2"; class_code "j"; tss_id "TSS1";
## NC_058066.1 StringTie exon 9330 9651 . + . transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "1";
## NC_058066.1 StringTie exon 12840 12890 . + . transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "2";
4 Filter
Filters the combined GTF output from GFFcompare to select only the lines representing “transcripts” and excluding lines starting with “#” (these are lines in the output format from GFFcompare that don’t contain transcript information). This step further filters for a class code of “u”, and keep only those with lengths greater than 199 bases. The “u’ class code from the GFFcompare step is for”unknown” transcripts, that is those that are not previously annotated in our reference GFF as protein coding. The size filter of +200nt is a common filtering step for isolating lncRNAs.
awk '$3 == "transcript" && $1 !~ /^#/ {print}' \
| grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf
head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf
## NC_058066.1 StringTie transcript 468619 469943 . + . transcript_id "MSTRG.40.1"; gene_id "MSTRG.40"; xloc "XLOC_000020"; class_code "u"; tss_id "TSS38";
## NC_058066.1 StringTie transcript 1153399 1165634 . + . transcript_id "MSTRG.120.1"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1 StringTie transcript 1153404 1165634 . + . transcript_id "MSTRG.120.2"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1 StringTie transcript 1153410 1165634 . + . transcript_id "MSTRG.120.3"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1 StringTie transcript 1154207 1155609 . + . transcript_id "MSTRG.120.4"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS96";
## NC_058066.1 StringTie transcript 1155787 1165634 . + . transcript_id "MSTRG.120.5"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS97";
## NC_058066.1 StringTie transcript 1222445 1227226 . + . transcript_id "MSTRG.126.1"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98";
## NC_058066.1 StringTie transcript 1222539 1227226 . + . transcript_id "MSTRG.126.2"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98";
## NC_058066.1 StringTie transcript 1256163 1257382 . + . transcript_id "MSTRG.136.3"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103";
## NC_058066.1 StringTie transcript 1256163 1264306 . + . transcript_id "MSTRG.136.2"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103";
5 Bedtools
Extracts the sequence data from the $FASTA
reference file based on the coordinates from the filtered GTF. The resulting sequences represent potential lncRNA candidates.
/home/shared/bedtools2/bin/fastaFromBed \
\
-fi ../data/GCF_013753865.1_Amil_v2.1_genomic.fna \
-bed ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf \
-fo ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta -split -name
fgrep -c ">" ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta
head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta
6 CPC2
Initializes a conda environment (Anaconda) and runs CPC2, a software to predict whether a transcript is coding or non-coding. The results are saved to the $OUTPUT_DIR. CPC2 uses ORF (Open Reading Frame) Analysis, Isometric Feature Mapping (Isomap), Sequence Homology, RNA Sequence Features, and Quality of Translation to assess coding potential and flag any transcripts we would want to exclude using the FASTA generated in the previous step.
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py \
\
-i ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta -o ../output/05.32-lncRNA-discovery/Apul_CPC2
wc -l head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
## wc: head: No such file or directory
## 16695 ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
## 16695 total
## #ID transcript_length peptide_length Fickett_score pI ORF_integrity coding_probability label
## transcript::NC_058066.1:468618-469943 1325 73 0.3891 11.48389835357666 1 0.0882804 noncoding
## transcript::NC_058066.1:1153398-1165634 12236 134 0.3118 9.09901943206787 1 0.295457 noncoding
## transcript::NC_058066.1:1153403-1165634 12231 134 0.3118 9.09901943206787 1 0.295457 noncoding
## transcript::NC_058066.1:1153409-1165634 12225 134 0.3118 9.09901943206787 1 0.295457 noncoding
## transcript::NC_058066.1:1154206-1155609 1403 66 0.37373 8.802658271789554 1 0.0495107 noncoding
## transcript::NC_058066.1:1155786-1165634 9848 134 0.29479 9.09901943206787 1 0.241898 noncoding
## transcript::NC_058066.1:1222444-1227226 4782 62 0.32077 9.30048313140869 1 0.0302275 noncoding
## transcript::NC_058066.1:1222538-1227226 4688 62 0.32077 9.30048313140869 1 0.0302275 noncoding
## transcript::NC_058066.1:1256162-1257382 1220 41 0.34867000000000004 9.098761558532715 1 0.0140974 noncoding
#Filter Filters the CPC2 results to get only noncoding transcripts (using the class “noncoding” from the CPC2 results) and extracts their IDs and matches these IDs with the sequences from the previous step to generate a GTF of long noncoding transcripts.
Matches these IDs with the sequences from the previous step to generate a GTF of noncoding transcripts.
awk '$8 == "noncoding" {print $1}' ../output/05.32-lncRNA-discovery/Apul_CPC2.txt > ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt
grep -Fwf ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta > ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
wc -l ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
head ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
## 16122 ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
## >transcript::NC_058066.1:468618-469943
## >transcript::NC_058066.1:1153398-1165634
## >transcript::NC_058066.1:1153403-1165634
## >transcript::NC_058066.1:1153409-1165634
## >transcript::NC_058066.1:1154206-1155609
## >transcript::NC_058066.1:1155786-1165634
## >transcript::NC_058066.1:1222444-1227226
## >transcript::NC_058066.1:1222538-1227226
## >transcript::NC_058066.1:1256162-1257382
## >transcript::NC_058066.1:1256162-1264306
7 De-duplicate GTF
Removes duplicate entries from the GTF file. This results in a list of lncRNAs with transcript IDs that show scaffold and base range. awk ‘!seen\[\$0\]++’ $OUTPUT_DIR/apul_merged_final_lncRNAs.gtf > $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf
8 Reformat to BED
Converts the GTF format to BED (Browser Extensible Data) format, a commonly used format for representing genomic intervals. This allows us to use bedtools to generate a FASTA of lncRNAs. awk -F”:|-” ‘{print $3 “ $4” $5}’ $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf > $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed
9 BEDTools
This extracts sequences from the reference FASTA based on the BED file’s coordinates. The resulting sequences represent the final set of non-redundant lncRNAs. $BEDTOOLS_PATH/bedtools
getfasta -fi $FASTA -bed $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed -fo $OUTPUT_DIR/apul_bedtools_lncRNAs.fasta -name ```
The results of this pipleine provide the user with GTF files containing transcript IDs (scaffold and sequence range), BED files displaying the same transcript IDs in BED format for use with BEDtools, and FASTA files containing the sequence of identified lncRNAs.