This is the write-up of the code/18 series in project-lake-trout. We already had differential methylation (DMRs/DMCs) and differential presence/absence variants (PAVs) for the lean vs siscowet lake trout comparison on the GCF_016432855.1 (SaNama_1.0) reference. What we did not have was any sense of which genes those signals fall in, what those genes do, and which ecotype phenotypes they could plausibly touch. Step 18 closes that gap — it does no new variant calling, it just asks where the existing signals land and reasons about consequences.
The whole plan, with the per-step status notes, lives in code/18-diff-annotation-phenotype-plan.md.
The caveat that shapes everything
The reference genome is a problem we have to lead with, not bury. GCF_016432855.1 was assembled from a female gynogenetic doubled haploid of the Seneca Lake (NY) brood stock — a lean morphotype. So every ecotype-asymmetric result is lean-reference biased: siscowet diverges more from the reference, which inflates apparent siscowet-specific PAVs and reduces siscowet CpG coverage exactly in the regions where the ecotypes differ most. Two consequences I kept in front of me the whole way:
- A “siscowet-specific deletion” is enriched for lean-present / siscowet-absent sequence — it is a candidate, not a confirmed loss in siscowet.
- Siscowet-specific vs lean-specific counts are not magnitude-comparable. The least-biased classes are siscowet-specific deletions and lean-specific insertions.
Everything below is framed as prioritized hypotheses, not findings.
Step 0 — Build a gene function table
The repo had gene coordinates and symbols but no functional annotation, so that was the one new data dependency. Rather than reach for an ortholog-mapping tool first, I pulled the NCBI RefSeq annotation for the assembly directly:
*_genomic.gff.gz— gene models and transcriptproduct=names (the join backbone)*_gene_ontology.gaf.gz— GO annotations keyed on NCBI GeneID
code/18-build-gene-function-table.py parses both and writes analyses/18-annotation/gene_function_table.tsv, keyed on the same gene-XXX IDs used in the gene BED so every downstream join is trivial. It covers all 46,359 genes — 9,738 with a real symbol, 46,231 with a product name, and 34,367 with at least one GO term. Because GO came straight from the GAF, the planned eggNOG/DIAMOND ortholog pass turned out to be unnecessary for first-pass coverage. The script QCs itself against the gene BED to confirm the join key matches.
Steps 1–2 — Assign DMRs, DMCs, and PAVs to genes
code/18.1-assign-features-to-genes.py does the positional assignment, pure stdlib (no bedtools), joining onto the Step 0 table. Two design decisions matter:
- Proximity classes, not just overlap. Regulatory methylation often sits in promoters, so each feature–gene pair is classified
promoter(TSS ± 2 kb, strand-aware) >exon>intron>upstream ≤5 kb>downstream ≤5 kb, with the gene body ± 5 kb as the outer window. - Lead with the stringent PAV set. The
stringent.siscowet_specific.deletions.bedset (3,465 deletions >100 bp, all-4-vs-none) is the high-confidence signal; the lenient lean/siscowet sets are carried only as a lower-confidence per-gene burden table, with the reference-bias caveat written into the file.
Outputs: dmr_gene_assignments.tsv, dmc_gene_assignments.tsv, pav_gene_assignments.tsv, and pav_gene_burden.tsv. Headline numbers — 149/302 DMRs within 5 kb of a gene (88 landing in promoters), and 1,543/3,465 stringent deletions near a gene, 54 of them overlapping an exon.
Step 3 — Integrate the layers + expression
code/18.2-integrate-candidates.py merges the methylation and PAV assignments into one ranked candidate table (integrated_candidate_genes.tsv, 2,036 genes) and flags convergent genes hit by both a DMR and a stringent siscowet-specific deletion — the strongest candidates (there are 4). A few judgment calls baked in:
- Liver RNAseq (
whole_tx_table.csv) is joined ongene_idfor expression context, but it is from a separate parasite study with different individuals and one tissue, so it is carried as orthogonal support, never confirmation. - Repetitive ncRNA biotypes (rRNA/snRNA/snoRNA/tRNA) get a
cautionflag and are deprioritized in the ranked view, because promoter “DMRs” there are usually rDNA-array methylation artifacts rather than gene regulation. - A simple additive
rank_scorerewards convergence, promoter DMRs, exonic deletions, expression, and methylation→expression concordance.
Step 4 — GO enrichment
code/18.3-go-enrichment.py runs a self-contained hypergeometric over-representation analysis on the DMR set, the stringent PAV set, and their union. I deliberately used the local GO annotations (from Step 0) with full DAG propagation (go-basic.obo, is_a + part_of, true-path rule) and BH-FDR within each set — an ortholog-based tool like g:Profiler would have dropped most of the symbol-less LOC genes.
The one robust result is calcium ion transport / channel enrichment in the PAV set (FDR ≈ 3e-3). The DMR enrichment is a histone + znf883 tandem-cluster artifact, and lipid/growth terms are only suggestive (FDR > 0.1). The README and the script both flag the confounders explicitly: gene-length bias (long channel genes accumulate deletions by chance), reference bias, IEA-only annotations, and gene clustering. KEGG was deferred — GO is the substantive result and KEGG needs KO mapping we don’t have yet.
Step 5 — Phenotype synthesis
The narrative report is code/18-diff-annotation-phenotype.Rmd (rendered: 18-diff-annotation-phenotype.html). It walks per-axis evidence blocks — lipid/energy, body-shape/growth/muscle, calcium/sensory-neural, immune, and a chromatin block flagged as artifact — each anchored to the morphometric phenotype data in data/measurements.xlsx, with the full causality and reference-bias caveats. The cleanest takeaways:
- A small convergent set (led by a
znf883paralog), partly reflecting repetitive mapping, so flagged rather than asserted. - The calcium-transport/channel GO signal in the PAV set (with the gene-length caveat).
- Individual lipid genes (
angptl5,mogat2) and a high-effect muscle-development promoter DMR (rbm24b, hyper-methylated in siscowet, Δ33%) that line up with the known lipid- and body-shape differences between the ecotypes.
None of these are confirmed. They are prioritized hypotheses for follow-up on ecotype-specific (hifiasm) assemblies and in matched tissue — the per-ecotype track is the needed complement to a lean-reference-only analysis.
Reproducing
Everything is plain Python 3 (stdlib only) plus a base-R Rmd; no external bioinformatics dependencies. From the repo root, after dropping the three NCBI RefSeq files into analyses/18-annotation/raw/ (the GFF, the GAF, and go-basic.obo):
python3 code/18-build-gene-function-table.py # Step 0 -> gene_function_table.tsv
python3 code/18.1-assign-features-to-genes.py # Steps 1-2 -> *_gene_assignments.tsv
python3 code/18.2-integrate-candidates.py # Step 3 -> integrated_candidate_genes.tsv
python3 code/18.3-go-enrichment.py # Step 4 -> go_enrichment_{dmr,pav,union}.tsv
# then knit code/18-diff-annotation-phenotype.Rmd for the Step 5 report + figureEach script prints its own QC to stderr (gene counts, join-key checks, hit counts, FDR tallies), so a successful run is self-verifying.