05.3-Apul-lncRNA-discovery

Steven Roberts 03 November, 2023

1 Run HiSat on RNA-seq
2 StringTie
3 GFFcompare
4 Filter
5 Bedtools
6 CPC2
7 De-duplicate GTF
8 Reformat to BED
9 BEDTools

1 Run HiSat on RNA-seq

1.1 Grab Trimmed RNA-seq Reads


wget -r \
--no-directories --no-parent \
-P ../data/fastq/ \
-A "*fastq.gz" https://gannet.fish.washington.edu/Atumefaciens/20230519-E5_coral-fastqc-fastp-multiqc-RNAseq/A_pulchra/trimmed/

1.2 Genome

cd ../data

curl -O http://gannet.fish.washington.edu/seashell/snaps/GCF_013753865.1_Amil_v2.1_genomic.fna

1.3 HiSat

/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
> ../output/05.32-lncRNA-discovery/m_exon.tab

head ../output/05.32-lncRNA-discovery/m_exon.tab

## NC_058066.1  1961    2118    -
## NC_058066.1  15360   15663   +
## NC_058066.1  18710   18774   -
## NC_058066.1  21000   21092   -
## NC_058066.1  22158   22442   +
## NC_058066.1  23084   23220   -
## NC_058066.1  24770   25066   +
## NC_058066.1  26652   26912   +
## NC_058066.1  27071   27358   +
## NC_058066.1  27658   27951   +


/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
> ../output/05.32-lncRNA-discovery/m_splice_sites.tab

head ../output/05.32-lncRNA-discovery/m_splice_sites.tab

## NC_058066.1  2118    18710   -
## NC_058066.1  15663   22158   +
## NC_058066.1  18774   21000   -
## NC_058066.1  21092   23084   -
## NC_058066.1  22442   24770   +
## NC_058066.1  25066   26652   +
## NC_058066.1  26912   27071   +
## NC_058066.1  27358   27658   +
## NC_058066.1  27951   28247   +
## NC_058066.1  28534   29196   +

/home/shared/hisat2-2.2.1/hisat2-build \
../data/GCF_013753865.1_Amil_v2.1_genomic.fna \
../data/GCF_013753865.1_Amil_v2.1.index \
--exon ../output/05.32-lncRNA-discovery/m_exon.tab \
--ss ../output/05.32-lncRNA-discovery/m_splice_sites.tab \
-p 40 \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
2> ../output/05.32-lncRNA-discovery/hisat2-build_stats.txt

find ../data/fastq/*gz

## ../data/fastq/RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz
## ../data/fastq/RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz

find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \
| xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
echo {}

## RNA-ACR-140
## RNA-ACR-145
## RNA-ACR-150
## RNA-ACR-173
## RNA-ACR-178

find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \
| xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../data/GCF_013753865.1_Amil_v2.1.index \
-p 20 \
-1 ../data/fastq/{}-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz \
-2 ../data/fastq/{}-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz \
-S ../output/05.32-lncRNA-discovery/{}.sam

47710408 reads; of these:
  47710408 (100.00%) were paired; of these:
    27056517 (56.71%) aligned concordantly 0 times
    19229695 (40.31%) aligned concordantly exactly 1 time
    1424196 (2.99%) aligned concordantly >1 times
    ----
    27056517 pairs aligned concordantly 0 times; of these:
      1316579 (4.87%) aligned discordantly 1 time
    ----
    25739938 pairs aligned 0 times concordantly or discordantly; of these:
      51479876 mates make up the pairs; of these:
        41152624 (79.94%) aligned 0 times
        9290398 (18.05%) aligned exactly 1 time
        1036854 (2.01%) aligned >1 times
56.87% overall alignment rate
42864294 reads; of these:
  42864294 (100.00%) were paired; of these:
    23649791 (55.17%) aligned concordantly 0 times
    17703533 (41.30%) aligned concordantly exactly 1 time
    1510970 (3.53%) aligned concordantly >1 times
    ----
    23649791 pairs aligned concordantly 0 times; of these:
      1250605 (5.29%) aligned discordantly 1 time
    ----
    22399186 pairs aligned 0 times concordantly or discordantly; of these:
      44798372 mates make up the pairs; of these:
        35369621 (78.95%) aligned 0 times
        8218798 (18.35%) aligned exactly 1 time
        1209953 (2.70%) aligned >1 times
58.74% overall alignment rate
43712298 reads; of these:
  43712298 (100.00%) were paired; of these:
    30355359 (69.44%) aligned concordantly 0 times
    12131465 (27.75%) aligned concordantly exactly 1 time
    1225474 (2.80%) aligned concordantly >1 times
    ----
    30355359 pairs aligned concordantly 0 times; of these:
      848325 (2.79%) aligned discordantly 1 time
    ----
    29507034 pairs aligned 0 times concordantly or discordantly; of these:
      59014068 mates make up the pairs; of these:
        51031599 (86.47%) aligned 0 times
        6695839 (11.35%) aligned exactly 1 time
        1286630 (2.18%) aligned >1 times
41.63% overall alignment rate
47501524 reads; of these:
  47501524 (100.00%) were paired; of these:
    27827915 (58.58%) aligned concordantly 0 times
    18298532 (38.52%) aligned concordantly exactly 1 time
    1375077 (2.89%) aligned concordantly >1 times
    ----
    27827915 pairs aligned concordantly 0 times; of these:
      1321896 (4.75%) aligned discordantly 1 time
    ----
    26506019 pairs aligned 0 times concordantly or discordantly; of these:
      53012038 mates make up the pairs; of these:
        42654805 (80.46%) aligned 0 times
        9174277 (17.31%) aligned exactly 1 time
        1182956 (2.23%) aligned >1 times
55.10% overall alignment rate
42677752 reads; of these:
  42677752 (100.00%) were paired; of these:
    25646587 (60.09%) aligned concordantly 0 times
    15654238 (36.68%) aligned concordantly exactly 1 time
    1376927 (3.23%) aligned concordantly >1 times
    ----
    25646587 pairs aligned concordantly 0 times; of these:
      1105089 (4.31%) aligned discordantly 1 time
    ----
    24541498 pairs aligned 0 times concordantly or discordantly; of these:
      49082996 mates make up the pairs; of these:
        38229072 (77.89%) aligned 0 times
        9393012 (19.14%) aligned exactly 1 time
        1460912 (2.98%) aligned >1 times
55.21% overall alignment rate

1.4 convert to bams

for samfile in ../output/05.32-lncRNA-discovery/*.sam; do
  bamfile="${samfile%.sam}.bam"
  sorted_bamfile="${samfile%.sam}.sorted.bam"
  
  # Convert SAM to BAM
  /home/shared/samtools-1.12/samtools view -bS "$samfile" > "$bamfile"
  
  # Sort BAM
  /home/shared/samtools-1.12/samtools sort "$bamfile" -o "$sorted_bamfile"
  
  # Index sorted BAM
  /home/shared/samtools-1.12/samtools index "$sorted_bamfile"
done

2 StringTie

StringTie uses the sorted BAM files to assemble transcripts for each sample, outputting them as GTF (Gene Transfer Format) files. And then merges all individual GTF assemblies into a single merged GTF file. This step extracts transcript information and merges GTFs from all samples–an important step in creating a canonical list of lncRNAs across all samples included in the pipeline.

find ../output/05.32-lncRNA-discovery/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 8 \
-G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-o ../output/05.32-lncRNA-discovery/{}.gtf \
../output/05.32-lncRNA-discovery/{}.sorted.bam

wc -l ../output/05.32-lncRNA-discovery/RNA*.gtf
ls ../output/05.32-lncRNA-discovery/RNA*.gtf

##    373617 ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf
##    322109 ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf
##    340380 ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf
##    367972 ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf
##    426804 ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf
##   1830882 total
## ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf
## ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf

Merges all individual GTF assemblies into a single merged GTF file.

/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge \
-G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf \
../output/05.32-lncRNA-discovery/*.gtf


wc -l ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
head ../output/05.32-lncRNA-discovery/stringtie_merged.gtf

## 726445 ../output/05.32-lncRNA-discovery/stringtie_merged.gtf
## # /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie --merge -G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff -o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf
## # StringTie version 2.2.1
## NC_058066.1  StringTie   transcript  345 1190    1000    .   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
## NC_058066.1  StringTie   exon    345 1190    1000    .   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
## NC_058066.1  StringTie   transcript  1962    23221   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    1962    2119    1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "1"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    18711   18775   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    21001   21093   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "3"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   exon    23085   23221   1000    -   .   gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "4"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; 
## NC_058066.1  StringTie   transcript  3278    4416    1000    .   .   gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";

3 GFFcompare

/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \
-r ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \
-G \
-o ../output/05.32-lncRNA-discovery/gffcompare_merged \
../output/05.32-lncRNA-discovery/stringtie_merged.gtf

ls ../output/05.32-lncRNA-discovery/gffcompare_merged*

## ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf
## ../output/05.32-lncRNA-discovery/gffcompare_merged.loci
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stats
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.refmap
## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.tmap
## ../output/05.32-lncRNA-discovery/gffcompare_merged.tracking

head -3 ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf

## NC_058066.1  StringTie   transcript  9330    48114   .   +   .   transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; gene_name "LOC114963508"; xloc "XLOC_000001"; ref_gene_id "gene-LOC114963508"; cmp_ref "rna-XM_029342736.2"; class_code "j"; tss_id "TSS1";
## NC_058066.1  StringTie   exon    9330    9651    .   +   .   transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "1";
## NC_058066.1  StringTie   exon    12840   12890   .   +   .   transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "2";

4 Filter

Filters the combined GTF output from GFFcompare to select only the lines representing “transcripts” and excluding lines starting with “#” (these are lines in the output format from GFFcompare that don’t contain transcript information). This step further filters for a class code of “u”, and keep only those with lengths greater than 199 bases. The “u’ class code from the GFFcompare step is for”unknown” transcripts, that is those that are not previously annotated in our reference GFF as protein coding. The size filter of +200nt is a common filtering step for isolating lncRNAs.

awk '$3 == "transcript" && $1 !~ /^#/ {print}' \
../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf

head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf

## NC_058066.1  StringTie   transcript  468619  469943  .   +   .   transcript_id "MSTRG.40.1"; gene_id "MSTRG.40"; xloc "XLOC_000020"; class_code "u"; tss_id "TSS38";
## NC_058066.1  StringTie   transcript  1153399 1165634 .   +   .   transcript_id "MSTRG.120.1"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1  StringTie   transcript  1153404 1165634 .   +   .   transcript_id "MSTRG.120.2"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1  StringTie   transcript  1153410 1165634 .   +   .   transcript_id "MSTRG.120.3"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95";
## NC_058066.1  StringTie   transcript  1154207 1155609 .   +   .   transcript_id "MSTRG.120.4"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS96";
## NC_058066.1  StringTie   transcript  1155787 1165634 .   +   .   transcript_id "MSTRG.120.5"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS97";
## NC_058066.1  StringTie   transcript  1222445 1227226 .   +   .   transcript_id "MSTRG.126.1"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98";
## NC_058066.1  StringTie   transcript  1222539 1227226 .   +   .   transcript_id "MSTRG.126.2"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98";
## NC_058066.1  StringTie   transcript  1256163 1257382 .   +   .   transcript_id "MSTRG.136.3"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103";
## NC_058066.1  StringTie   transcript  1256163 1264306 .   +   .   transcript_id "MSTRG.136.2"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103";

5 Bedtools

Extracts the sequence data from the $FASTA reference file based on the coordinates from the filtered GTF. The resulting sequences represent potential lncRNA candidates.

/home/shared/bedtools2/bin/fastaFromBed \
-fi ../data/GCF_013753865.1_Amil_v2.1_genomic.fna \
-bed ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf \
-fo ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta \
-name -split

fgrep -c ">" ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta
head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta

6 CPC2

Initializes a conda environment (Anaconda) and runs CPC2, a software to predict whether a transcript is coding or non-coding. The results are saved to the $OUTPUT_DIR. CPC2 uses ORF (Open Reading Frame) Analysis, Isometric Feature Mapping (Isomap), Sequence Homology, RNA Sequence Features, and Quality of Translation to assess coding potential and flag any transcripts we would want to exclude using the FASTA generated in the previous step.

eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py \
-i ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta \
-o ../output/05.32-lncRNA-discovery/Apul_CPC2

wc -l head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt

## wc: head: No such file or directory
##   16695 ../output/05.32-lncRNA-discovery/Apul_CPC2.txt
##   16695 total
## #ID  transcript_length   peptide_length  Fickett_score   pI  ORF_integrity   coding_probability  label
## transcript::NC_058066.1:468618-469943    1325    73  0.3891  11.48389835357666   1   0.0882804   noncoding
## transcript::NC_058066.1:1153398-1165634  12236   134 0.3118  9.09901943206787    1   0.295457    noncoding
## transcript::NC_058066.1:1153403-1165634  12231   134 0.3118  9.09901943206787    1   0.295457    noncoding
## transcript::NC_058066.1:1153409-1165634  12225   134 0.3118  9.09901943206787    1   0.295457    noncoding
## transcript::NC_058066.1:1154206-1155609  1403    66  0.37373 8.802658271789554   1   0.0495107   noncoding
## transcript::NC_058066.1:1155786-1165634  9848    134 0.29479 9.09901943206787    1   0.241898    noncoding
## transcript::NC_058066.1:1222444-1227226  4782    62  0.32077 9.30048313140869    1   0.0302275   noncoding
## transcript::NC_058066.1:1222538-1227226  4688    62  0.32077 9.30048313140869    1   0.0302275   noncoding
## transcript::NC_058066.1:1256162-1257382  1220    41  0.34867000000000004 9.098761558532715   1   0.0140974   noncoding

#Filter Filters the CPC2 results to get only noncoding transcripts (using the class “noncoding” from the CPC2 results) and extracts their IDs and matches these IDs with the sequences from the previous step to generate a GTF of long noncoding transcripts.

Matches these IDs with the sequences from the previous step to generate a GTF of noncoding transcripts.

awk '$8 == "noncoding" {print $1}' ../output/05.32-lncRNA-discovery/Apul_CPC2.txt > ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt

grep -Fwf ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta > ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf

wc -l ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
head ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf

## 16122 ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf
## >transcript::NC_058066.1:468618-469943
## >transcript::NC_058066.1:1153398-1165634
## >transcript::NC_058066.1:1153403-1165634
## >transcript::NC_058066.1:1153409-1165634
## >transcript::NC_058066.1:1154206-1155609
## >transcript::NC_058066.1:1155786-1165634
## >transcript::NC_058066.1:1222444-1227226
## >transcript::NC_058066.1:1222538-1227226
## >transcript::NC_058066.1:1256162-1257382
## >transcript::NC_058066.1:1256162-1264306

7 De-duplicate GTF

Removes duplicate entries from the GTF file. This results in a list of lncRNAs with transcript IDs that show scaffold and base range. awk ‘!seen\[\$0\]++’ $OUTPUT_DIR/apul_merged_final_lncRNAs.gtf > $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf

8 Reformat to BED

Converts the GTF format to BED (Browser Extensible Data) format, a commonly used format for representing genomic intervals. This allows us to use bedtools to generate a FASTA of lncRNAs. awk -F”:|-” ‘{print $3 “ $4” $5}’ $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf > $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed

9 BEDTools

This extracts sequences from the reference FASTA based on the BED file’s coordinates. The resulting sequences represent the final set of non-redundant lncRNAs. $BEDTOOLS_PATH/bedtools
getfasta -fi $FASTA -bed $OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed -fo $OUTPUT_DIR/apul_bedtools_lncRNAs.fasta -name ```

The results of this pipleine provide the user with GTF files containing transcript IDs (scaffold and sequence range), BED files displaying the same transcript IDs in BED format for use with BEDtools, and FASTA files containing the sequence of identified lncRNAs.

--- title: "Apul lncRNA Discovery" description: "Coral long non-coding RNA discovery and analysis" author: - name: Steven Roberts url: orcid: 0000-0001-8302-1138 affiliation: Professor, UW - School of Aquatic and Fishery Sciences affiliation-url: https://robertslab.info date: 11-03-2023 categories: [deepdive, R, lncRNA, e5] # self-defined categories #citation: #url: # self-defined image: http://gannet.fish.washington.edu/seashell/snaps/Monosnap_deep-dive__RStudio_Server_2023-11-03_06-26-50.png # finding a good image draft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready! format: html: code-fold: true code-tools: true code-copy: true highlight-style: github code-overflow: wrap --- 05.3-Apul-lncRNA-discovery ================ Steven Roberts 03 November, 2023 - <a href="#1-run-hisat-on-rna-seq" id="toc-1-run-hisat-on-rna-seq">1 Run HiSat on RNA-seq</a> - <a href="#11-grab-trimmed-rna-seq-reads" id="toc-11-grab-trimmed-rna-seq-reads">1.1 Grab Trimmed RNA-seq Reads</a> - <a href="#12-genome" id="toc-12-genome">1.2 Genome</a> - <a href="#13-hisat" id="toc-13-hisat">1.3 HiSat</a> - <a href="#14-convert-to-bams" id="toc-14-convert-to-bams">1.4 convert to bams</a> - <a href="#2-stringtie" id="toc-2-stringtie">2 StringTie</a> - <a href="#3-gffcompare" id="toc-3-gffcompare">3 GFFcompare</a> - <a href="#4-filter" id="toc-4-filter">4 Filter</a> - <a href="#5-bedtools" id="toc-5-bedtools">5 Bedtools</a> - <a href="#6-cpc2" id="toc-6-cpc2">6 CPC2</a> - <a href="#7-de-duplicate-gtf" id="toc-7-de-duplicate-gtf">7 De-duplicate GTF</a> - <a href="#8-reformat-to-bed" id="toc-8-reformat-to-bed">8 Reformat to BED</a> - <a href="#9-bedtools" id="toc-9-bedtools">9 BEDTools</a> # 1 Run HiSat on RNA-seq ## 1.1 Grab Trimmed RNA-seq Reads ``` bash wget -r \ --no-directories --no-parent \ -P ../data/fastq/ \ -A "*fastq.gz" https://gannet.fish.washington.edu/Atumefaciens/20230519-E5_coral-fastqc-fastp-multiqc-RNAseq/A_pulchra/trimmed/ ``` ## 1.2 Genome ``` bash cd ../data curl -O http://gannet.fish.washington.edu/seashell/snaps/GCF_013753865.1_Amil_v2.1_genomic.fna ``` ## 1.3 HiSat ``` bash /home/shared/hisat2-2.2.1/hisat2_extract_exons.py \ ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \ > ../output/05.32-lncRNA-discovery/m_exon.tab ``` ``` bash head ../output/05.32-lncRNA-discovery/m_exon.tab ``` ## NC_058066.1 1961 2118 - ## NC_058066.1 15360 15663 + ## NC_058066.1 18710 18774 - ## NC_058066.1 21000 21092 - ## NC_058066.1 22158 22442 + ## NC_058066.1 23084 23220 - ## NC_058066.1 24770 25066 + ## NC_058066.1 26652 26912 + ## NC_058066.1 27071 27358 + ## NC_058066.1 27658 27951 + ``` bash /home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \ ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \ > ../output/05.32-lncRNA-discovery/m_splice_sites.tab ``` ``` bash head ../output/05.32-lncRNA-discovery/m_splice_sites.tab ``` ## NC_058066.1 2118 18710 - ## NC_058066.1 15663 22158 + ## NC_058066.1 18774 21000 - ## NC_058066.1 21092 23084 - ## NC_058066.1 22442 24770 + ## NC_058066.1 25066 26652 + ## NC_058066.1 26912 27071 + ## NC_058066.1 27358 27658 + ## NC_058066.1 27951 28247 + ## NC_058066.1 28534 29196 + ``` bash /home/shared/hisat2-2.2.1/hisat2-build \ ../data/GCF_013753865.1_Amil_v2.1_genomic.fna \ ../data/GCF_013753865.1_Amil_v2.1.index \ --exon ../output/05.32-lncRNA-discovery/m_exon.tab \ --ss ../output/05.32-lncRNA-discovery/m_splice_sites.tab \ -p 40 \ ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \ 2> ../output/05.32-lncRNA-discovery/hisat2-build_stats.txt ``` ``` bash find ../data/fastq/*gz ``` ## ../data/fastq/RNA-ACR-140-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-140-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-145-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-145-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-150-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-150-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-173-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-173-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-178-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz ## ../data/fastq/RNA-ACR-178-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz ``` bash find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \ | xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \ echo {} ``` ## RNA-ACR-140 ## RNA-ACR-145 ## RNA-ACR-150 ## RNA-ACR-173 ## RNA-ACR-178 ``` bash find ../data/fastq/*R2_001.fastp-trim.20230519.fastq.gz \ | xargs basename -s -S1-TP2_R2_001.fastp-trim.20230519.fastq.gz | xargs -I{} \ /home/shared/hisat2-2.2.1/hisat2 \ -x ../data/GCF_013753865.1_Amil_v2.1.index \ -p 20 \ -1 ../data/fastq/{}-S1-TP2_R1_001.fastp-trim.20230519.fastq.gz \ -2 ../data/fastq/{}-S1-TP2_R2_001.fastp-trim.20230519.fastq.gz \ -S ../output/05.32-lncRNA-discovery/{}.sam ``` 47710408 reads; of these: 47710408 (100.00%) were paired; of these: 27056517 (56.71%) aligned concordantly 0 times 19229695 (40.31%) aligned concordantly exactly 1 time 1424196 (2.99%) aligned concordantly >1 times ---- 27056517 pairs aligned concordantly 0 times; of these: 1316579 (4.87%) aligned discordantly 1 time ---- 25739938 pairs aligned 0 times concordantly or discordantly; of these: 51479876 mates make up the pairs; of these: 41152624 (79.94%) aligned 0 times 9290398 (18.05%) aligned exactly 1 time 1036854 (2.01%) aligned >1 times 56.87% overall alignment rate 42864294 reads; of these: 42864294 (100.00%) were paired; of these: 23649791 (55.17%) aligned concordantly 0 times 17703533 (41.30%) aligned concordantly exactly 1 time 1510970 (3.53%) aligned concordantly >1 times ---- 23649791 pairs aligned concordantly 0 times; of these: 1250605 (5.29%) aligned discordantly 1 time ---- 22399186 pairs aligned 0 times concordantly or discordantly; of these: 44798372 mates make up the pairs; of these: 35369621 (78.95%) aligned 0 times 8218798 (18.35%) aligned exactly 1 time 1209953 (2.70%) aligned >1 times 58.74% overall alignment rate 43712298 reads; of these: 43712298 (100.00%) were paired; of these: 30355359 (69.44%) aligned concordantly 0 times 12131465 (27.75%) aligned concordantly exactly 1 time 1225474 (2.80%) aligned concordantly >1 times ---- 30355359 pairs aligned concordantly 0 times; of these: 848325 (2.79%) aligned discordantly 1 time ---- 29507034 pairs aligned 0 times concordantly or discordantly; of these: 59014068 mates make up the pairs; of these: 51031599 (86.47%) aligned 0 times 6695839 (11.35%) aligned exactly 1 time 1286630 (2.18%) aligned >1 times 41.63% overall alignment rate 47501524 reads; of these: 47501524 (100.00%) were paired; of these: 27827915 (58.58%) aligned concordantly 0 times 18298532 (38.52%) aligned concordantly exactly 1 time 1375077 (2.89%) aligned concordantly >1 times ---- 27827915 pairs aligned concordantly 0 times; of these: 1321896 (4.75%) aligned discordantly 1 time ---- 26506019 pairs aligned 0 times concordantly or discordantly; of these: 53012038 mates make up the pairs; of these: 42654805 (80.46%) aligned 0 times 9174277 (17.31%) aligned exactly 1 time 1182956 (2.23%) aligned >1 times 55.10% overall alignment rate 42677752 reads; of these: 42677752 (100.00%) were paired; of these: 25646587 (60.09%) aligned concordantly 0 times 15654238 (36.68%) aligned concordantly exactly 1 time 1376927 (3.23%) aligned concordantly >1 times ---- 25646587 pairs aligned concordantly 0 times; of these: 1105089 (4.31%) aligned discordantly 1 time ---- 24541498 pairs aligned 0 times concordantly or discordantly; of these: 49082996 mates make up the pairs; of these: 38229072 (77.89%) aligned 0 times 9393012 (19.14%) aligned exactly 1 time 1460912 (2.98%) aligned >1 times 55.21% overall alignment rate ## 1.4 convert to bams ``` bash for samfile in ../output/05.32-lncRNA-discovery/*.sam; do bamfile="${samfile%.sam}.bam" sorted_bamfile="${samfile%.sam}.sorted.bam" # Convert SAM to BAM /home/shared/samtools-1.12/samtools view -bS "$samfile" > "$bamfile" # Sort BAM /home/shared/samtools-1.12/samtools sort "$bamfile" -o "$sorted_bamfile" # Index sorted BAM /home/shared/samtools-1.12/samtools index "$sorted_bamfile" done ``` # 2 StringTie StringTie uses the sorted BAM files to assemble transcripts for each sample, outputting them as GTF (Gene Transfer Format) files. And then merges all individual GTF assemblies into a single merged GTF file. This step extracts transcript information and merges GTFs from all samples–an important step in creating a canonical list of lncRNAs across all samples included in the pipeline. ``` bash find ../output/05.32-lncRNA-discovery/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 8 \ -G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \ -o ../output/05.32-lncRNA-discovery/{}.gtf \ ../output/05.32-lncRNA-discovery/{}.sorted.bam ``` ``` bash wc -l ../output/05.32-lncRNA-discovery/RNA*.gtf ls ../output/05.32-lncRNA-discovery/RNA*.gtf ``` ## 373617 ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf ## 322109 ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf ## 340380 ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf ## 367972 ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf ## 426804 ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf ## 1830882 total ## ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf ## ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf ## ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf ## ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf ## ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf Merges all individual GTF assemblies into a single merged GTF file. ``` bash /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ --merge \ -G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \ -o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf \ ../output/05.32-lncRNA-discovery/*.gtf ``` ``` bash wc -l ../output/05.32-lncRNA-discovery/stringtie_merged.gtf head ../output/05.32-lncRNA-discovery/stringtie_merged.gtf ``` ## 726445 ../output/05.32-lncRNA-discovery/stringtie_merged.gtf ## # /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie --merge -G ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff -o ../output/05.32-lncRNA-discovery/stringtie_merged.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-140.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-145.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-150.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-173.gtf ../output/05.32-lncRNA-discovery/RNA-ACR-178.gtf ## # StringTie version 2.2.1 ## NC_058066.1 StringTie transcript 345 1190 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; ## NC_058066.1 StringTie exon 345 1190 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; ## NC_058066.1 StringTie transcript 1962 23221 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; ## NC_058066.1 StringTie exon 1962 2119 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "1"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; ## NC_058066.1 StringTie exon 18711 18775 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "2"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; ## NC_058066.1 StringTie exon 21001 21093 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "3"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; ## NC_058066.1 StringTie exon 23085 23221 1000 - . gene_id "MSTRG.2"; transcript_id "rna-XR_003825913.2"; exon_number "4"; gene_name "LOC114963522"; ref_gene_id "gene-LOC114963522"; ## NC_058066.1 StringTie transcript 3278 4416 1000 . . gene_id "MSTRG.3"; transcript_id "MSTRG.3.1"; # 3 GFFcompare ``` bash /home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \ -r ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff \ -G \ -o ../output/05.32-lncRNA-discovery/gffcompare_merged \ ../output/05.32-lncRNA-discovery/stringtie_merged.gtf ``` ``` bash ls ../output/05.32-lncRNA-discovery/gffcompare_merged* ``` ## ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf ## ../output/05.32-lncRNA-discovery/gffcompare_merged.loci ## ../output/05.32-lncRNA-discovery/gffcompare_merged.stats ## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.refmap ## ../output/05.32-lncRNA-discovery/gffcompare_merged.stringtie_merged.gtf.tmap ## ../output/05.32-lncRNA-discovery/gffcompare_merged.tracking ``` bash head -3 ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf ``` ## NC_058066.1 StringTie transcript 9330 48114 . + . transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; gene_name "LOC114963508"; xloc "XLOC_000001"; ref_gene_id "gene-LOC114963508"; cmp_ref "rna-XM_029342736.2"; class_code "j"; tss_id "TSS1"; ## NC_058066.1 StringTie exon 9330 9651 . + . transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "1"; ## NC_058066.1 StringTie exon 12840 12890 . + . transcript_id "MSTRG.5.2"; gene_id "MSTRG.5"; exon_number "2"; # 4 Filter Filters the combined GTF output from GFFcompare to select only the lines representing “transcripts” and excluding lines starting with “\#” (these are lines in the output format from GFFcompare that don’t contain transcript information). This step further filters for a class code of “u”, and keep only those with lengths greater than 199 bases. The “u’ class code from the GFFcompare step is for”unknown” transcripts, that is those that are not previously annotated in our reference GFF as protein coding. The size filter of +200nt is a common filtering step for isolating lncRNAs. ``` bash awk '$3 == "transcript" && $1 !~ /^#/ {print}' \ ../output/05.32-lncRNA-discovery/gffcompare_merged.annotated.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf ``` ``` bash head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf ``` ## NC_058066.1 StringTie transcript 468619 469943 . + . transcript_id "MSTRG.40.1"; gene_id "MSTRG.40"; xloc "XLOC_000020"; class_code "u"; tss_id "TSS38"; ## NC_058066.1 StringTie transcript 1153399 1165634 . + . transcript_id "MSTRG.120.1"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95"; ## NC_058066.1 StringTie transcript 1153404 1165634 . + . transcript_id "MSTRG.120.2"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95"; ## NC_058066.1 StringTie transcript 1153410 1165634 . + . transcript_id "MSTRG.120.3"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS95"; ## NC_058066.1 StringTie transcript 1154207 1155609 . + . transcript_id "MSTRG.120.4"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS96"; ## NC_058066.1 StringTie transcript 1155787 1165634 . + . transcript_id "MSTRG.120.5"; gene_id "MSTRG.120"; xloc "XLOC_000058"; class_code "u"; tss_id "TSS97"; ## NC_058066.1 StringTie transcript 1222445 1227226 . + . transcript_id "MSTRG.126.1"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98"; ## NC_058066.1 StringTie transcript 1222539 1227226 . + . transcript_id "MSTRG.126.2"; gene_id "MSTRG.126"; xloc "XLOC_000059"; class_code "u"; tss_id "TSS98"; ## NC_058066.1 StringTie transcript 1256163 1257382 . + . transcript_id "MSTRG.136.3"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103"; ## NC_058066.1 StringTie transcript 1256163 1264306 . + . transcript_id "MSTRG.136.2"; gene_id "MSTRG.136"; xloc "XLOC_000062"; class_code "u"; tss_id "TSS103"; # 5 Bedtools Extracts the sequence data from the `$FASTA` reference file based on the coordinates from the filtered GTF. The resulting sequences represent potential lncRNA candidates. ``` bash /home/shared/bedtools2/bin/fastaFromBed \ -fi ../data/GCF_013753865.1_Amil_v2.1_genomic.fna \ -bed ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.gtf \ -fo ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta \ -name -split ``` ``` bash fgrep -c ">" ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta head ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta ``` # 6 CPC2 Initializes a conda environment (Anaconda) and runs CPC2, a software to predict whether a transcript is coding or non-coding. The results are saved to the \$OUTPUT_DIR. CPC2 uses ORF (Open Reading Frame) Analysis, Isometric Feature Mapping (Isomap), Sequence Homology, RNA Sequence Features, and Quality of Translation to assess coding potential and flag any transcripts we would want to exclude using the FASTA generated in the previous step. ``` bash eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py \ -i ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta \ -o ../output/05.32-lncRNA-discovery/Apul_CPC2 ``` ``` bash wc -l head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt head ../output/05.32-lncRNA-discovery/Apul_CPC2.txt ``` ## wc: head: No such file or directory ## 16695 ../output/05.32-lncRNA-discovery/Apul_CPC2.txt ## 16695 total ## #ID transcript_length peptide_length Fickett_score pI ORF_integrity coding_probability label ## transcript::NC_058066.1:468618-469943 1325 73 0.3891 11.48389835357666 1 0.0882804 noncoding ## transcript::NC_058066.1:1153398-1165634 12236 134 0.3118 9.09901943206787 1 0.295457 noncoding ## transcript::NC_058066.1:1153403-1165634 12231 134 0.3118 9.09901943206787 1 0.295457 noncoding ## transcript::NC_058066.1:1153409-1165634 12225 134 0.3118 9.09901943206787 1 0.295457 noncoding ## transcript::NC_058066.1:1154206-1155609 1403 66 0.37373 8.802658271789554 1 0.0495107 noncoding ## transcript::NC_058066.1:1155786-1165634 9848 134 0.29479 9.09901943206787 1 0.241898 noncoding ## transcript::NC_058066.1:1222444-1227226 4782 62 0.32077 9.30048313140869 1 0.0302275 noncoding ## transcript::NC_058066.1:1222538-1227226 4688 62 0.32077 9.30048313140869 1 0.0302275 noncoding ## transcript::NC_058066.1:1256162-1257382 1220 41 0.34867000000000004 9.098761558532715 1 0.0140974 noncoding \#Filter Filters the CPC2 results to get only noncoding transcripts (using the class “noncoding” from the CPC2 results) and extracts their IDs and matches these IDs with the sequences from the previous step to generate a GTF of long noncoding transcripts. Matches these IDs with the sequences from the previous step to generate a GTF of noncoding transcripts. ``` bash awk '$8 == "noncoding" {print $1}' ../output/05.32-lncRNA-discovery/Apul_CPC2.txt > ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt grep -Fwf ../output/05.32-lncRNA-discovery/Apul_noncoding_transcripts_ids.txt ../output/05.32-lncRNA-discovery/Apul_lncRNA_candidates.fasta > ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf ``` ``` bash wc -l ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf head ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf ``` ## 16122 ../output/05.32-lncRNA-discovery/Apul_lncRNAs.gtf ## >transcript::NC_058066.1:468618-469943 ## >transcript::NC_058066.1:1153398-1165634 ## >transcript::NC_058066.1:1153403-1165634 ## >transcript::NC_058066.1:1153409-1165634 ## >transcript::NC_058066.1:1154206-1155609 ## >transcript::NC_058066.1:1155786-1165634 ## >transcript::NC_058066.1:1222444-1227226 ## >transcript::NC_058066.1:1222538-1227226 ## >transcript::NC_058066.1:1256162-1257382 ## >transcript::NC_058066.1:1256162-1264306 # 7 De-duplicate GTF Removes duplicate entries from the GTF file. This results in a list of lncRNAs with transcript IDs that show scaffold and base range. awk ‘!seen$$\$0$$++’ \$OUTPUT_DIR/apul_merged_final_lncRNAs.gtf \> \$OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf # 8 Reformat to BED Converts the GTF format to BED (Browser Extensible Data) format, a commonly used format for representing genomic intervals. This allows us to use bedtools to generate a FASTA of lncRNAs. awk -F”:\|-” ‘{print \$3 “ \$4” \$5}’ \$OUTPUT_DIR/apul_deduplicated_final_lncRNAs.gtf \> \$OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed # 9 BEDTools This extracts sequences from the reference FASTA based on the BED file’s coordinates. The resulting sequences represent the final set of non-redundant lncRNAs. \$BEDTOOLS_PATH/bedtools getfasta -fi \$FASTA -bed \$OUTPUT_DIR/apul_deduplicated_final_lncRNAs.bed -fo \$OUTPUT_DIR/apul_bedtools_lncRNAs.fasta -name \`\`\` The results of this pipleine provide the user with GTF files containing transcript IDs (scaffold and sequence range), BED files displaying the same transcript IDs in BED format for use with BEDtools, and FASTA files containing the sequence of identified lncRNAs.