Apul Time Series Methylation

It’s in the name…
e5
methylation
coral
bismark
Author
Affiliation

Steven Roberts

Published

February 8, 2025

Here we take the first set of time series methylation data to CpG level percent methylation.

Data was trimmed

To reproduce the trimming and quality control (QC) of Acropora pulchra Whole Genome Bisulfite Sequencing (WGBS) data, obtain the raw FastQ files and use fastp to trim 25bp from the 5’ end while removing adapters, polyG, and polyA sequences; if errors occur, attempt repair with BBTools’ repair.sh before re-trimming. Evaluate trimmed reads with FastQC, then summarize results using MultiQC. The complete workflow, including commands and scripts, is available in the project repository, and output files can be accessed here.

Reads were aligned

To align Acropora pulchra WGBS reads using Bismark on the Hyak cluster, trimmed FastQ files were prepared, and a fastq_pairs.txt file was generated. The alignment process was initiated by submitting a SLURM array job using sbatch –array=1-$(wc -l < ../output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC/fastq_pairs.txt) 02.20-D-Apul-WGBS-alignment-SLURM-job.sh, which allowed each node to process a unique FastQ pair. The job script (02.20-D-Apul-WGBS-alignment-SLURM_array-bismark.sh) configured directories, allocated computational resources (bismark_threads=5), extracted FastQ filenames, and executed Bismark with Bowtie2 using the parameters –score_min L,0,-0.6 –non_directional –gzip -p 5. The SLURM script (02.20-D-Apul-WGBS-alignment-SLURM-job.sh) defined job parameters (#SBATCH –cpus-per-task=20 –mem=100G –time=72:00:00) and executed the containerized Bismark workflow via Apptainer.

Taking BAMS to tables

Deduplicated

find /gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/02.20-D-Apul-WGBS-alignment-SLURM_array-bismark/*.bam | \
xargs -n 1 basename -s _R1_001.fastp-trim_bismark_bt2_pe.bam | \
parallel -j 8 deduplicate_bismark \
--bam \
--paired \
--output_dir ../output/15-Apul-bismark \
/gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/02.20-D-Apul-WGBS-alignment-SLURM_array-bismark//{}_R1_001.fastp-trim_bismark_bt2_pe.bam

Sorted

find ../output/15-Apul-bismark/*deduplicated.bam | \
xargs basename -s _R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bam | \
xargs -I{} samtools \
sort --threads 24 \
../output/15-Apul-bismark/{}_R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bam \
-o ../output/15-Apul-bismark/{}.sorted.bam

Methylation Extraction

find ../output/15-Apul-bismark/*deduplicated.bam | xargs -n 1 -I{} \
bismark_methylation_extractor --bedGraph --counts --comprehensive --merge_non_CpG \
--multicore 24 --buffer_size 75% --output ../output/15-Apul-bismark "{}"

Methylation call

find ../output/15-Apul-bismark/*deduplicated.bismark.cov.gz | \
xargs -n 1 basename -s _pe.deduplicated.bismark.cov.gz | \
parallel -j 24 coverage2cytosine \
--genome_folder /gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/data/ \
-o ../output/15-Apul-bismark/{} \
--merge_CpG \
--zero_based \
../output/15-Apul-bismark/{}_pe.deduplicated.bismark.cov.gz

Rename cov files

# Read the CSV file and create an associative array - hand edited 3 filenames first
declare -A sample_map
while IFS=, read -r col1 sample_name col3 col4 azenta_sample_name col6; do
  sample_map["$azenta_sample_name"]="$sample_name"
done < ../../M-multi-species/data/e5_DNA_Azenta_metadata.csv

cd ../output/15.5-Apul-bismark/

# Iterate over all *.cov files and rename them
for file in *.cov; do
  # Extract the portion before the first underscore
  prefix="${file%%_*}"
  if [[ -n "${sample_map[$prefix]}" ]]; then
    new_file="${file//$prefix/${sample_map[$prefix]}}"
    mv "$file" "$new_file"
  fi
done

Made bedgraph

cd ../output/15.5-Apul-bismark/

for f in *merged_CpG_evidence.cov
do
  STEM=$(basename "${f}" _R1_001.fastp-trim_bismark_bt2.CpG_report.merged_CpG_evidence.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
  > "${STEM}"_10x.bedgraph
done

Created modified tables

for file in ../output/15.5-Apul-bismark/*10x.bedgraph; do
    awk '{print "CpG_"_$1"_"$2, $4}' "$file" > "${file%.bedgraph}_processed.txt"
done

Now there are files for each of the Apul samples @ https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/