Went in a few directions but will try to layout properly here. Experiment - 3 populations, 4 pooled (n=5) libraires - thus 20 individuals represented per population.
Data as provided
Combining
Here is an example from one species
find /home/shared/16TB_HDD_01/valentina/data/Quilhua/barcode01 -name "*.fastq.gz" -print0 | xargs -0 zcat | gzip > ../output/02-RNAseq/Qui_barcode01_combined.fastq.gz
find /home/shared/16TB_HDD_01/valentina/data/Quilhua/barcode01_1 -name "*.fastq.gz" -print0 | xargs -0 zcat | gzip > ../output/02-RNAseq/Qui_barcode01_1_combined.fastq.gz
find /home/shared/16TB_HDD_01/valentina/data/Quilhua/barcode02_1 -name "*.fastq.gz" -print0 | xargs -0 zcat | gzip > ../output/02-RNAseq/Qui_barcode02_1_combined.fastq.gz
find /home/shared/16TB_HDD_01/valentina/data/Quilhua/barcode12 -name "*.fastq.gz" -print0 | xargs -0 zcat | gzip > ../output/02-RNAseq/Qui_barcode12_combined.fastq.gz
Align
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
# Set number of threads
threads=42
# Set paths
ref_genome="../data/Och_HapA_assembly.fa"
input_dir="../output/02-RNAseq"
output_dir="../output/02-RNAseq"
# Loop over each FASTQ file
for fq in ${input_dir}/*.fastq.gz; do
# Extract sample name (e.g., Pum_barcode02 from Pum_barcode02_combined.fastq.gz)
sample=$(basename "$fq" .fastq.gz)
echo "Processing $sample..."
# Run minimap2 and samtools sort
minimap2 -ax splice -k14 --MD -t $threads "$ref_genome" "$fq" | \
samtools sort -@ $threads -o "${output_dir}/${sample}.sorted.bam"
# Index the sorted BAM
samtools index "${output_dir}/${sample}.sorted.bam"
done
Stringtie
mkdir -p ../output/02-RNAseq/stringtie_out
for bam in ../output/02-RNAseq/*combined.sorted.bam; do
sample=$(basename "$bam" _combined.sorted.bam)
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie "$bam" \
-L \
-o ../output/02-RNAseq/stringtie_out/${sample}.gtf \
-p 32 \
-v
done
ls ../output/02-RNAseq/stringtie_out/*.gtf > ../output/02-RNAseq/stringtie_out/mergelist.txt
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie --merge \
-L \
-G ../data/GN.gene.gtf \
-o ../output/02-RNAseq/stringtie_out/merged.gtf \
../output/02-RNAseq/stringtie_out/mergelist.txt
mkdir -p ../output/02-RNAseq/stringtie_quant
for bam in ../output/02-RNAseq/*combined.sorted.bam; do
sample=$(basename "$bam" _combined.sorted.bam)
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie "$bam" \
-L \
-e \
-B \
-G ../output/02-RNAseq/stringtie_out/merged.gtf \
-o ../output/02-RNAseq/stringtie_quant/${sample}.gtf
done
python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \
\
-i ../output/02-RNAseq/stringtie_quant/samplelist.txt \
-g ../output/02-RNAseq/stringtie_quant/gene_count_matrix.csv -t ../output/02-RNAseq/stringtie_quant/transcript_count_matrix.csv
Next?
Normalize, QC?, Annotate (many are novel genes)