I started with a bed file from discovery (see below) and used it to count the number of lncRNAs in each sample from Time Series data (Apul)
Converted Bed to GTF
awk 'BEGIN{OFS="\t"; count=1} {printf "%s\t.\tlncRNA\t%d\t%d\t.\t+\t.\tgene_id \"lncRNA_%03d\";\n", $1, $2, $3, count++;}' /home/shared/8TB_HDD_03/sr320/github/deep-dive-expression/D-Apul/output/10.1-Apul-lncRNA/Apul_lncRNA.bed \
> ../output/05-Apul-lncRNA/lncRNAs.gtf
==> /home/shared/8TB_HDD_03/sr320/github/deep-dive-expression/D-Apul/output/10.1-Apul-lncRNA/Apul_lncRNA.bed <==
ntLink_0 84514 93551
ntLink_0 15627 19151
ntLink_0 23443 23874
ntLink_1 7484 9525
ntLink_1 51265 51766
ntLink_2 217051 217761
ntLink_2 330551 331392
ntLink_3 56721 62176
ntLink_3 76352 81676
ntLink_3 95705 100623
GTF
ntLink_0 . lncRNA 84514 93551 . + . gene_id "lncRNA_001";
ntLink_0 . lncRNA 15627 19151 . + . gene_id "lncRNA_002";
ntLink_0 . lncRNA 23443 23874 . + . gene_id "lncRNA_003";
ntLink_1 . lncRNA 7484 9525 . + . gene_id "lncRNA_004";
ntLink_1 . lncRNA 51265 51766 . + . gene_id "lncRNA_005";
ntLink_2 . lncRNA 217051 217761 . + . gene_id "lncRNA_006";
Feature Count
/home/shared/subread-2.0.5-Linux-x86_64/bin/featureCounts \
\
-T 42 \
-a ../output/05-Apul-lncRNA/lncRNAs.gtf \
-o ../output/05-Apul-lncRNA/counts.txt \
-t lncRNA \
-g gene_id \
-p *sorted.bam ../data/
Count Matrix
https://github.com/urol-e5/timeseries_molecular/blob/main/D-Apul/output/05-Apul-lncRNA/counts.txt
Before I did that, first lncRNA were discovered. This was done with the deep-dive expression samples (n=5)
Discovery
StringTie
StringTie uses the sorted BAM files to assemble transcripts for each sample, outputting them as GTF (Gene Transfer Format) files. And then merges all individual GTF assemblies into a single merged GTF file. This step extracts transcript information and merges GTFs from all samples–an important step in creating a canonical list of lncRNAs across all samples included in the pipeline.
find ../output/07-Apul-Hisat/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
\
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 42 \
-G ../data/Apulcra-genome.gff \
-o ../output/10.1-Apul-lncRNA/{}.gtf ../output/07-Apul-Hisat/{}.sorted.bam
Merges all individual GTF assemblies into a single merged GTF file.
This is used to create a non-redundant set of transcripts after running StringTie separately on multiple RNA-Seq datasets.
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
\
--merge \
-G ../data/Apulcra-genome.gff \
-o ../output/10.1-Apul-lncRNA/stringtie_merged.gtf *.gtf ../output/10.1-Apul-lncRNA/RNA
GFFcompare
https://ccb.jhu.edu/software/stringtie/gffcompare.shtml
/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \
\
-r ../data/Apulcra-genome.gff \
-o ../output/10.1-Apul-lncRNA/gffcompare_merged ../output/10.1-Apul-lncRNA/stringtie_merged.gtf
Filter
Filters the combined GTF output from GFFcompare to select only the lines representing “transcripts” and excluding lines starting with “#” (these are lines in the output format from GFFcompare that don’t contain transcript information). This step further filters for a class code of “u”, and keep only those with lengths greater than 199 bases. The “u’ class code from the GFFcompare step is for”unknown” transcripts, that is those that are not previously annotated in our reference GFF as protein coding. The size filter of +200nt is a common filtering step for isolating lncRNAs.
awk '$3 == "transcript" && $1 !~ /^#/' \
| grep 'class_code "u"\|class_code "x"|\class_code "i"\|class_code "y"' | awk '($5 - $4 > 199) || ($4 - $5 > 199)' > ../output/10.1-Apul-lncRNA/Apul_lncRNA_candidates.gtf ../output/10.1-Apul-lncRNA/gffcompare_merged.annotated.gtf
Bedtools
Extracts the sequence data from the $FASTA
reference file based on the coordinates from the filtered GTF. The resulting sequences represent potential lncRNA candidates.
/home/shared/bedtools2/bin/fastaFromBed \
\
-fi ../data/Apulcra-genome.fa \
-bed ../output/10.1-Apul-lncRNA/Apul_lncRNA_candidates.gtf \
-fo ../output/10.1-Apul-lncRNA/Apul_lncRNA_candidates.fasta -split -name
fgrep -c ">" ../output/10.1-Apul-lncRNA/Apul_lncRNA_candidates.fasta
head ../output/10.1-Apul-lncRNA/Apul_lncRNA_candidates.fasta
CPC2
Initializes a conda environment (Anaconda) and runs CPC2, a software to predict whether a transcript is coding or non-coding. The results are saved to the $OUTPUT_DIR. CPC2 uses ORF (Open Reading Frame) Analysis, Isometric Feature Mapping (Isomap), Sequence Homology, RNA Sequence Features, and Quality of Translation to assess coding potential and flag any transcripts we would want to exclude using the FASTA generated in the previous step.
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py \
\
-i ../output/10.1-Apul-lncRNA/Apul_lncRNA_candidates.fasta -o ../output/10.1-Apul-lncRNA/Apul_CPC2
Filter
Filters the CPC2 results to get only noncoding transcripts (using the class “noncoding” from the CPC2 results) and extracts their IDs and matches these IDs with the sequences from the previous step to generate a GTF of long noncoding transcripts.
Matches these IDs with the sequences from the previous step to generate a GTF of noncoding transcripts.
awk '$8 == "noncoding" {print $1}' ../output/10.1-Apul-lncRNA/Apul_CPC2.txt > ../output/10.1-Apul-lncRNA/Apul_noncoding_transcripts_ids.txt
subsetting fasta
/home/shared/samtools-1.12/samtools faidx ../output/10.1-Apul-lncRNA/Apul_lncRNA_candidates.fasta \
> ../output/10.1-Apul-lncRNA/Apul_lncRNA.fasta -r ../output/10.1-Apul-lncRNA/Apul_noncoding_transcripts_ids.txt
Getting genome feature track
# Open the input file and the output file
with open('../output/10.1-Apul-lncRNA/Apul_noncoding_transcripts_ids.txt', 'r') as infile, open('../output/10.1-Apul-lncRNA/Apul_lncRNA.bed', 'w') as outfile:
# Process each line in the input file
for line in infile:
# Remove 'transcript::' and then split the line by ':' to extract the relevant parts
= line.strip().replace('transcript::', '').split(':')
parts = parts[0]
chromosome # Split the position part by '-' to get start and end positions
= parts[1].split('-')
start, end
# BED format requires the start position to be 0-based
# Convert the start position to 0-based by subtracting 1
= str(int(start) - 1)
start
# Write the chromosome, start, and end positions to the output file
# Separate each field with a tab character
f'{chromosome}\t{start}\t{end}\n')
outfile.write(
# After running this script, 'output.bed' will contain the converted data in BED format.
- note that when you take this bed and make gtf you get -1 starts that make program fail - of course the 0-based / 1-based issue that will eventually need to be looked into.