Here we take the first set of time series methylation data to CpG level percent methylation.
Data was trimmed
To reproduce the trimming and quality control (QC) of Acropora pulchra Whole Genome Bisulfite Sequencing (WGBS) data, obtain the raw FastQ files and use fastp to trim 25bp from the 5’ end while removing adapters, polyG, and polyA sequences; if errors occur, attempt repair with BBTools’ repair.sh before re-trimming. Evaluate trimmed reads with FastQC, then summarize results using MultiQC. The complete workflow, including commands and scripts, is available in the project repository, and output files can be accessed here.
Reads were aligned
To align Acropora pulchra WGBS reads using Bismark on the Hyak cluster, trimmed FastQ files were prepared, and a fastq_pairs.txt file was generated. The alignment process was initiated by submitting a SLURM array job using sbatch –array=1-$(wc -l < ../output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC/fastq_pairs.txt) 02.20-D-Apul-WGBS-alignment-SLURM-job.sh, which allowed each node to process a unique FastQ pair. The job script (02.20-D-Apul-WGBS-alignment-SLURM_array-bismark.sh) configured directories, allocated computational resources (bismark_threads=5), extracted FastQ filenames, and executed Bismark with Bowtie2 using the parameters –score_min L,0,-0.6 –non_directional –gzip -p 5. The SLURM script (02.20-D-Apul-WGBS-alignment-SLURM-job.sh) defined job parameters (#SBATCH –cpus-per-task=20 –mem=100G –time=72:00:00) and executed the containerized Bismark workflow via Apptainer.
Taking BAMS to tables
Deduplicated
find /gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/02.20-D-Apul-WGBS-alignment-SLURM_array-bismark/*.bam | \
xargs -n 1 basename -s _R1_001.fastp-trim_bismark_bt2_pe.bam | \
parallel -j 8 deduplicate_bismark \
\
--bam \
--paired \
--output_dir ../output/15-Apul-bismark /gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/02.20-D-Apul-WGBS-alignment-SLURM_array-bismark//{}_R1_001.fastp-trim_bismark_bt2_pe.bam
Sorted
find ../output/15-Apul-bismark/*deduplicated.bam | \
xargs basename -s _R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bam | \
xargs -I{} samtools \
--threads 24 \
sort \
../output/15-Apul-bismark/{}_R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bam -o ../output/15-Apul-bismark/{}.sorted.bam
Methylation Extraction
find ../output/15-Apul-bismark/*deduplicated.bam | xargs -n 1 -I{} \
--bedGraph --counts --comprehensive --merge_non_CpG \
bismark_methylation_extractor --buffer_size 75% --output ../output/15-Apul-bismark "{}" --multicore 24
Methylation call
find ../output/15-Apul-bismark/*deduplicated.bismark.cov.gz | \
xargs -n 1 basename -s _pe.deduplicated.bismark.cov.gz | \
parallel -j 24 coverage2cytosine \
\
--genome_folder /gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/data/ \
-o ../output/15-Apul-bismark/{} \
--merge_CpG \
--zero_based ../output/15-Apul-bismark/{}_pe.deduplicated.bismark.cov.gz
Rename cov files
# Read the CSV file and create an associative array - hand edited 3 filenames first
declare -A sample_map
while IFS=, read -r col1 sample_name col3 col4 azenta_sample_name col6; do
sample_map["$azenta_sample_name"]="$sample_name"
done < ../../M-multi-species/data/e5_DNA_Azenta_metadata.csv
cd ../output/15.5-Apul-bismark/
# Iterate over all *.cov files and rename them
for file in *.cov; do
# Extract the portion before the first underscore
prefix="${file%%_*}"
if [[ -n "${sample_map[$prefix]}" ]]; then
new_file="${file//$prefix/${sample_map[$prefix]}}"
mv "$file" "$new_file"
fi
done
Made bedgraph
cd ../output/15.5-Apul-bismark/
for f in *merged_CpG_evidence.cov
do
STEM=$(basename "${f}" _R1_001.fastp-trim_bismark_bt2.CpG_report.merged_CpG_evidence.cov)
cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
> "${STEM}"_10x.bedgraph
done
Created modified tables
for file in ../output/15.5-Apul-bismark/*10x.bedgraph; do
awk '{print "CpG_"_$1"_"$2, $4}' "$file" > "${file%.bedgraph}_processed.txt"
done
Now there are files for each of the Apul samples @ https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/