Apul lncRNA Discovery

deepdive
R
lncrna
e5
Author
Affiliation

Steven Roberts

Published

November 3, 2023

05.3-Apul-lncRNA-discovery

Steven Roberts 03 November, 2023

1 Run HiSat on RNA-seq

1.1 Grab Trimmed RNA-seq Reads


wget -r \
--no-directories --no-parent \
-P ../data/fastq/ \
-A "*fastq.gz" https://gannet.fish.washington.edu/Atumefaciens/20230519-E5_coral-fastqc-fastp-multiqc-RNAseq/A_pulchra/trimmed/

1.2 Genome

cd ../data

curl -O http://gannet.fish.washington.edu/seashell/snaps/GCF_013753865.1_Amil_v2.1_genomic.fna

1.3 HiSat

/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
> ../output/05.32-lncRNA-discovery/m_exon.tab
head ../output/05.32-lncRNA-discovery/m_exon.tab
## NC_058066.1  1961    2118    -
## NC_058066.1  15360   15663   +
## NC_058066.1  18710   18774   -
## NC_058066.1  21000   21092   -
## NC_058066.1  22158   22442   +
## NC_058066.1  23084   23220   -
## NC_058066.1  24770   25066   +
## NC_058066.1  26652   26912   +
## NC_058066.1  27071   27358   +
## NC_058066.1  27658   27951   +

/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
> ../output/05.32-lncRNA-discovery/m_splice_sites.tab
head ../output/05.32-lncRNA-discovery/m_splice_sites.tab
## NC_058066.1  2118    18710   -
## NC_058066.1  15663   22158   +
## NC_058066.1  18774   21000   -
## NC_058066.1  21092   23084   -
## NC_058066.1  22442   24770   +
## NC_058066.1  25066   26652   +
## NC_058066.1  26912   27071   +
## NC_058066.1  27358   27658   +
## NC_058066.1  27951   28247   +
## NC_058066.1  28534   29196   +
/home/shared/hisat2-2.2.1/hisat2-build \
../data/GCF_013753865.1_Amil_v2.1_genomic.fna \
../data/GCF_013753865.1_Amil_v2.1.index \
--exon ../output/05.32-lncRNA-discovery/m_exon.tab \
--ss ../output/05.32-lncRNA-discovery/m_splice_sites.tab \
-p 40 \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
2> ../output/05.32-lncRNA-discovery/hisat2-build_stats.txt
find ../data/fastq/*gz
## ../data/fastq/RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \
| xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
echo {}
## RNA-ACR-140
## RNA-ACR-145
## RNA-ACR-150
## RNA-ACR-173
## RNA-ACR-178
find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \
| xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../data/GCF_013753865.1_Amil_v2.1.index \
-p 20 \
-1 ../data/fastq/{}-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz \
-2 ../data/fastq/{}-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz \
-S ../output/05.32-lncRNA-discovery/{}.sam
47710408 reads; of these:
  47710408 (100.00%) were paired; of these:
    27056517 (56.71%) aligned concordantly 0 times
    19229695 (40.31%) aligned concordantly exactly 1 time
    1424196 (2.99%) aligned concordantly >1 times
    ----
    27056517 pairs aligned concordantly 0 times; of these:
      1316579 (4.87%) aligned discordantly 1 time
    ----
    25739938 pairs aligned 0 times concordantly or discordantly; of these:
      51479876 mates make up the pairs; of these:
        41152624 (79.94%) aligned 0 times
        9290398 (18.05%) aligned exactly 1 time
        1036854 (2.01%) aligned >1 times
56.87% overall alignment rate
42864294 reads; of these:
  42864294 (100.00%) were paired; of these:
    23649791 (55.17%) aligned concordantly 0 times
    17703533 (41.30%) aligned concordantly exactly 1 time
    1510970 (3.53%) aligned concordantly >1 times
    ----
    23649791 pairs aligned concordantly 0 times; of these:
      1250605 (5.29%) aligned discordantly 1 time
    ----
    22399186 pairs aligned 0 times concordantly or discordantly; of these:
      44798372 mates make up the pairs; of these:
        35369621 (78.95%) aligned 0 times
        8218798 (18.35%) aligned exactly 1 time
        1209953 (2.70%) aligned >1 times
58.74% overall alignment rate
43712298 reads; of these:
  43712298 (100.00%) were paired; of these:
    30355359 (69.44%) aligned concordantly 0 times
    12131465 (27.75%) aligned concordantly exactly 1 time
    1225474 (2.80%) aligned concordantly >1 times
    ----
    30355359 pairs aligned concordantly 0 times; of these:
      848325 (2.79%) aligned discordantly 1 time
    ----
    29507034 pairs aligned 0 times concordantly or discordantly; of these:
      59014068 mates make up the pairs; of these:
        51031599 (86.47%) aligned 0 times
        6695839 (11.35%) aligned exactly 1 time
        1286630 (2.18%) aligned >1 times
41.63% overall alignment rate
47501524 reads; of these:
  47501524 (100.00%) were paired; of these:
    27827915 (58.58%) aligned concordantly 0 times
    18298532 (38.52%) aligned concordantly exactly 1 time
    1375077 (2.89%) aligned concordantly >1 times
    ----
    27827915 pairs aligned concordantly 0 times; of these:
      1321896 (4.75%) aligned discordantly 1 time
    ----
    26506019 pairs aligned 0 times concordantly or discordantly; of these:
      53012038 mates make up the pairs; of these:
        42654805 (80.46%) aligned 0 times
        9174277 (17.31%) aligned exactly 1 time
        1182956 (2.23%) aligned >1 times
55.10% overall alignment rate
42677752 reads; of these:
  42677752 (100.00%) were paired; of these:
    25646587 (60.09%) aligned concordantly 0 times
    15654238 (36.68%) aligned concordantly exactly 1 time
    1376927 (3.23%) aligned concordantly >1 times
    ----
    25646587 pairs aligned concordantly 0 times; of these:
      1105089 (4.31%) aligned discordantly 1 time
    ----
    24541498 pairs aligned 0 times concordantly or discordantly; of these:
      49082996 mates make up the pairs; of these:
        38229072 (77.89%) aligned 0 times
        9393012 (19.14%) aligned exactly 1 time
        1460912 (2.98%) aligned >1 times
55.21% overall alignment rate

1.4 convert to bams

for samfile in ../output/05.32-lncRNA-discovery/*.sam; do
  bamfile="${samfile%.sam}.bam"
  sorted_bamfile="${samfile%.sam}.sorted.bam"
  
  # Convert SAM to BAM
  /home/shared/samtools-1.12/samtools view -bS "$samfile" > "$bamfile"
  
  # Sort BAM
  /home/shared/samtools-1.12/samtools sort "$bamfile" -o "$sorted_bamfile"
  
  # Index sorted BAM
  /home/shared/samtools-1.12/samtools index "$sorted_bamfile"
done

2 StringTie

StringTie uses the sorted BAM files to assemble transcripts for each sample, outputting them as GTF (Gene Transfer Format) files. And then merges all individual GTF assemblies into a single merged GTF file. This step extracts transcript information and merges GTFs from all samples–an important step in creating a canonical list of lncRNAs across all samples included in the pipeline.

find ../output/05.32-lncRNA-discovery/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 8 \
-G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-o ../output/05.32-lncRNA-discovery/{}.gtf \
../output/05.32-lncRNA-discovery/{}.sorted.bam
wc -l ../output/05.32-lncRNA-discovery/RNA*.gtf
ls ../output/05.32-lncRNA-discovery/RNA*.gtf
##    373617 ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf
##    322109 ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf
##    340380 ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf
##    367972 ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf
##    426804 ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf
##   1830882 total
## ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf

Merges all individual GTF assemblies into a single merged GTF file.

/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge \
-G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf \
../output/05.32-lncRNA-discovery/*.gtf

wc -l ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
head ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
## 726445 ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
## # /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie --merge -G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff -o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf
## # StringTie version 2.2.1
## NC_058066.1  StringTie   transcript  345 1190    1000    .   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
## NC_058066.1  StringTie   exon    345 1190    1000    .   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
## NC_058066.1  StringTie   transcript  1962    23221   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    1962    2119    1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "1"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    18711   18775   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    21001   21093   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "3"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    23085   23221   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "4"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   transcript  3278    4416    1000    .   .   gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";

3 GFFcompare

/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \
-r ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-G \
-o ../output/05.32-lncRNA-discovery/gffcompare_merged \
../output/05.32-lncRNA-discovery/stringtie_merged.gtf
ls ../output/05.32-lncRNA-discovery/gffcompare_merged*
## ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf
## ../output/05.32-lncRNA-discovery/gffcompare_merged.loci
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stats
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.refmap
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.tmap
## ../output/05.32-lncRNA-discovery/gffcompare_merged.tracking
head -3 ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf
## NC_058066.1  StringTie   transcript  9330    48114   .   +   .   transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; gene_name "LOC114963508"; xloc "XLOC_000001"; ref_gene_id "gene-LOC114963508"; cmp_ref "rna-XM_029342736.2"; class_code "j"; tss_id "TSS1";
## NC_058066.1  StringTie   exon    9330    9651    .   +   .   transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "1";
## NC_058066.1  StringTie   exon    12840   12890   .   +   .   transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "2";

4 Filter

Filters the combined GTF output from GFFcompare to select only the lines representing “transcripts” and excluding lines starting with “#” (these are lines in the output format from GFFcompare that don’t contain transcript information). This step further filters for a class code of “u”, and keep only those with lengths greater than 199 bases. The “u’ class code from the GFFcompare step is for”unknown” transcripts, that is those that are not previously annotated in our reference GFF as protein coding. The size filter of +200nt is a common filtering step for isolating lncRNAs.

awk '$3 == "transcript" && $1 !~ /^#/ {print}' \
../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf
head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf
## NC_058066.1  StringTie   transcript  468619  469943  .   +   .   transcript_id "MSTRG.40.1"; gene_id "MSTRG.40"; xloc "XLOC_000020"; class_code "u"; tss_id "TSS38";
## NC_058066.1  StringTie   transcript  1153399 1165634 .   +   .   transcript_id "MSTRG.120.1"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1  StringTie   transcript  1153404 1165634 .   +   .   transcript_id "MSTRG.120.2"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1  StringTie   transcript  1153410 1165634 .   +   .   transcript_id "MSTRG.120.3"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1  StringTie   transcript  1154207 1155609 .   +   .   transcript_id "MSTRG.120.4"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS96";
## NC_058066.1  StringTie   transcript  1155787 1165634 .   +   .   transcript_id "MSTRG.120.5"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS97";
## NC_058066.1  StringTie   transcript  1222445 1227226 .   +   .   transcript_id "MSTRG.126.1"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98";
## NC_058066.1  StringTie   transcript  1222539 1227226 .   +   .   transcript_id "MSTRG.126.2"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98";
## NC_058066.1  StringTie   transcript  1256163 1257382 .   +   .   transcript_id "MSTRG.136.3"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103";
## NC_058066.1  StringTie   transcript  1256163 1264306 .   +   .   transcript_id "MSTRG.136.2"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103";

5 Bedtools

Extracts the sequence data from the $FASTA reference file based on the coordinates from the filtered GTF. The resulting sequences represent potential lncRNA candidates.

/home/shared/bedtools2/bin/fastaFromBed \
-fi ../data/GCF_013753865.1_Amil_v2.1_genomic.fna \
-bed ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf \
-fo ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta \
-name -split
fgrep -c ">" ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta
head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta

6 CPC2

Initializes a conda environment (Anaconda) and runs CPC2, a software to predict whether a transcript is coding or non-coding. The results are saved to the $OUTPUT_DIR. CPC2 uses ORF (Open Reading Frame) Analysis, Isometric Feature Mapping (Isomap), Sequence Homology, RNA Sequence Features, and Quality of Translation to assess coding potential and flag any transcripts we would want to exclude using the FASTA generated in the previous step.

eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py \
-i ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta \
-o ../output/05.32-lncRNA-discovery/Apul_CPC2
wc -l head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
## wc: head: No such file or directory
##   16695 ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
##   16695 total
## #ID  transcript_length   peptide_length  Fickett_score   pI  ORF_integrity   coding_probability  label
## transcript::NC_058066.1:468618-469943    1325    73  0.3891  11.48389835357666   1   0.0882804   noncoding
## transcript::NC_058066.1:1153398-1165634  12236   134 0.3118  9.09901943206787    1   0.295457    noncoding
## transcript::NC_058066.1:1153403-1165634  12231   134 0.3118  9.09901943206787    1   0.295457    noncoding
## transcript::NC_058066.1:1153409-1165634  12225   134 0.3118  9.09901943206787    1   0.295457    noncoding
## transcript::NC_058066.1:1154206-1155609  1403    66  0.37373 8.802658271789554   1   0.0495107   noncoding
## transcript::NC_058066.1:1155786-1165634  9848    134 0.29479 9.09901943206787    1   0.241898    noncoding
## transcript::NC_058066.1:1222444-1227226  4782    62  0.32077 9.30048313140869    1   0.0302275   noncoding
## transcript::NC_058066.1:1222538-1227226  4688    62  0.32077 9.30048313140869    1   0.0302275   noncoding
## transcript::NC_058066.1:1256162-1257382  1220    41  0.34867000000000004 9.098761558532715   1   0.0140974   noncoding

#Filter Filters the CPC2 results to get only noncoding transcripts (using the class “noncoding” from the CPC2 results) and extracts their IDs and matches these IDs with the sequences from the previous step to generate a GTF of long noncoding transcripts.

Matches these IDs with the sequences from the previous step to generate a GTF of noncoding transcripts.

awk '$8 == "noncoding" {print $1}' ../output/05.32-lncRNA-discovery/Apul_CPC2.txt > ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt

grep -Fwf ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta > ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
wc -l ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
head ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
## 16122 ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
## >transcript::NC_058066.1:468618-469943
## >transcript::NC_058066.1:1153398-1165634
## >transcript::NC_058066.1:1153403-1165634
## >transcript::NC_058066.1:1153409-1165634
## >transcript::NC_058066.1:1154206-1155609
## >transcript::NC_058066.1:1155786-1165634
## >transcript::NC_058066.1:1222444-1227226
## >transcript::NC_058066.1:1222538-1227226
## >transcript::NC_058066.1:1256162-1257382
## >transcript::NC_058066.1:1256162-1264306

7 De-duplicate GTF

Removes duplicate entries from the GTF file. This results in a list of lncRNAs with transcript IDs that show scaffold and base range. awk ‘!seen\[\$0\]++’ $OUTPUT_DIR/apul_merged_final_lncRNAs.gtf > $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf

8 Reformat to BED

Converts the GTF format to BED (Browser Extensible Data) format, a commonly used format for representing genomic intervals. This allows us to use bedtools to generate a FASTA of lncRNAs. awk -F”:|-” ‘{print $3 “ $4” $5}’ $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf > $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed

9 BEDTools

This extracts sequences from the reference FASTA based on the BED file’s coordinates. The resulting sequences represent the final set of non-redundant lncRNAs. $BEDTOOLS_PATH/bedtools
getfasta -fi $FASTA -bed $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed -fo $OUTPUT_DIR/apul_bedtools_lncRNAs.fasta -name ```

The results of this pipleine provide the user with GTF files containing transcript IDs (scaffold and sequence range), BED files displaying the same transcript IDs in BED format for use with BEDtools, and FASTA files containing the sequence of identified lncRNAs.