Liftoff Annotation of Lean & Siscowet Lake Trout Ecotype Assemblies

Overview

Quick write-up of the Liftoff-only structural annotation of the two ecotype-specific PacBio HiFi assemblies (lean, siscowet). This is Track A of code/20-ecotype-genomes.md; the full executable workflow is in code/20.1-liftoff-annotation.Rmd. This post just records what was run and what came out.

Both assemblies are the same species as the annotated NCBI reference GCF_016432855.1 (SaNama_1.0), so rather than predicting genes de novo I transferred the reference annotation directly onto each assembly with Liftoff (Shumate & Salzberg 2021). Lifted genes inherit the existing functional annotation in analyses/18-annotation/gene_function_table.tsv through the shared gene-XXX ID — no new functional computation needed.

No repeat masking or de novo prediction (BRAKER3) here; that’s the optional Track B for later.

Outputs are large and not committed to GitHub. They live on Gannet: https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/

Inputs

Input	Path
Lean assembly	`data/genome/pb-hifiasm-lean-assembly.fa`
Siscowet assembly	`data/genome/pb-hifiasm-siscowet-assembly.fa`
Reference FASTA	`data/genome/ref-GCF_016432855.1/GCF_016432855.1_SaNama_1.0_genomic.fna`
Reference GFF	`data/genome/ref-GCF_016432855.1/genomic.gff`
Functional table	`analyses/18-annotation/gene_function_table.tsv`

Tools: NCBI datasets, liftoff + minimap2 (liftoff_env conda env), gffread 0.12.7, samtools 1.12. Run with 40 threads.

Workflow

1. Download the reference annotation

The reference FASTA and GFF3 for GCF_016432855.1 (SaNama_1.0) are the source annotation Liftoff transfers from. Run once; skipped if already present.

cd "${ref_dir}"
"${datasets}" download genome accession "${ref_accession}" \
  --exclude-rna \
  --exclude-protein \
  --exclude-genomic-cds \
  --filename "${ref_accession}.zip"
unzip -o "${ref_accession}.zip"
cp ncbi_dataset/data/"${ref_accession}"/*_genomic.fna "${ref_fasta}"
cp ncbi_dataset/data/"${ref_accession}"/genomic.gff   "${ref_gff}"

2. Run Liftoff per ecotype

For each ecotype the reference annotation is transferred onto the assembly. Key flags:

-g source GFF, -p threads
-polish re-aligns to fix gene/exon boundaries (writes <out>_polished)
-copies -sc 0.95 reports additional gene copies (expansions) at ≥95% identity
-u records reference genes that could not be lifted (ecotype-specific loss / PAV candidates)

cd "${output_dir}"
samtools faidx "${lean_fasta}"
"${liftoff}" \
  "${lean_fasta}" \
  "${ref_fasta}" \
  -g "${ref_gff}" \
  -o lean.liftoff.gff3 \
  -u lean.unmapped_features.txt \
  -dir lean_intermediate \
  -m "${minimap2}" \
  -p "${threads}" \
  -polish \
  -copies -sc 0.95 \
  > lean.liftoff.log 2>&1

The siscowet run is identical with siscowet.* substituted in.

3. Derive genes-only BED + protein FASTA

-polish writes *.gff3_polished; prefer that when present. A genes-only BED is built to match the repo’s data/...genes.bed convention (chrom start end gene-ID 0 strand), and proteins are extracted with gffread.

cd "${output_dir}"
LEAN_GFF=lean.liftoff.gff3_polished
[ -s "$LEAN_GFF" ] || LEAN_GFF=lean.liftoff.gff3
awk 'BEGIN{OFS="\t"} $3=="gene"{
  id=""; n=split($9,a,";");
  for(i=1;i<=n;i++){ if(a[i]~/^ID=/){ sub(/^ID=/,"",a[i]); id=a[i] } }
  print $1, $4-1, $5, id, 0, $7
}' "$LEAN_GFF" > lean.liftoff.genes.bed
"${gffread}" -y lean.liftoff.proteins.faa -g "${lean_fasta}" "$LEAN_GFF"

4. Inherit functional annotation

Lifted gene IDs are the reference gene-XXX IDs, which are the join key in the existing gene_function_table.tsv. A left join carries symbol / product / GO onto each ecotype’s lifted genes with no new computation.

func <- readr::read_tsv(func_table, show_col_types = FALSE)
bed_cols <- c("chrom", "start", "end", "gene_id", "score", "strand")
genes <- readr::read_tsv(bed_path, col_names = bed_cols, show_col_types = FALSE)
out <- genes %>%
  dplyr::select(gene_id, eco_chrom = chrom, eco_start = start,
                eco_end = end, eco_strand = strand) %>%
  dplyr::left_join(func, by = "gene_id")
readr::write_tsv(out, out_path)

Results

Liftoff transferred the SaNama_1.0 annotation onto both ecotype assemblies with very low loss (~1–1.6% of reference features unmapped).

Metric	Lean	Siscowet
Lifted genes	79,535	82,691
mRNAs	67,707	71,789
Proteins extracted	67,888	72,050
Unmapped reference features	1,294	1,073

The unmapped feature lists are the immediate ecotype-specific loss / PAV candidates and feed directly into the differential-PAV checks.

Deliverables

Per ecotype, written to output/20.1-liftoff-annotation/. Replace the placeholder URLs below with the final Gannet links.

File	Description	Link
`{eco}.liftoff.gff3_polished`	Lifted gene models on the ecotype assembly	lean · siscowet
`{eco}.liftoff.genes.bed`	Genes-only BED	lean · siscowet
`{eco}.liftoff.proteins.faa`	Predicted proteins	lean · siscowet
`{eco}.liftoff.gene_function_table.tsv`	Lifted genes joined to symbol/product/GO	lean · siscowet
`{eco}.unmapped_features.txt`	Reference genes not lifted (PAV / loss candidates)	lean · siscowet
`{eco}.liftoff.log`	Full Liftoff run log	lean · siscowet

Each output also has a sibling .md5 checksum file in the same directory.

Next steps

These per-ecotype gene models feed the ecotype-coordinate version of code/13.3-hifiasm-differential-methylation-plan.md and let the PAV results in code/15-diff-pav.py be checked against per-ecotype gene models. Optional Track B (repeat masking + BRAKER3 de novo prediction) remains for later.

References

Shumate, A. & Salzberg, S.L. (2021). Liftoff: accurate mapping of gene annotations. Bioinformatics 37(12):1639–1643.

--- title: "Liftoff Annotation of Lean & Siscowet Lake Trout Ecotype Assemblies" description: "Transferring the SaNama_1.0 annotation onto two PacBio HiFi ecotype assemblies with Liftoff" categories: [Genomics, Computing] #citation: date: 06-18-2026 image: "images/clipboard-584962284.png" author: - name: Steven Roberts url: orcid: 0000-0001-8302-1138 affiliation: Professor, UW - School of Aquatic and Fishery Sciences affiliation-url: https://robertslab.info #url: # self-defined draft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready! format: html: toc: true # ← enable TOC toc-location: left # or: right, body toc-depth: 3 # how many heading levels to show code-fold: FALSE code-tools: true code-copy: true highlight-style: github code-overflow: wrap #runtime: shiny --- <a href="https://robertslab.github.io/resources/Agentic-Coding-Tools/#five-level-rubric"><img src="https://img.shields.io/badge/AI%20Use-Level%202%20Drafting%2FCoding-yellowgreen" alt="AI Use Level 2: AI-assisted drafting or coding"/></a> ## Overview Quick write-up of the Liftoff-only structural annotation of the two ecotype-specific PacBio HiFi assemblies (`lean`, `siscowet`). This is Track A of `code/20-ecotype-genomes.md`; the full executable workflow is in `code/20.1-liftoff-annotation.Rmd`. This post just records what was run and what came out. Both assemblies are the same species as the annotated NCBI reference **GCF_016432855.1 (SaNama_1.0)**, so rather than predicting genes de novo I transferred the reference annotation directly onto each assembly with [Liftoff](https://github.com/agshumate/Liftoff) (Shumate & Salzberg 2021). Lifted genes inherit the existing functional annotation in `analyses/18-annotation/gene_function_table.tsv` through the shared `gene-XXX` ID — no new functional computation needed. No repeat masking or de novo prediction (BRAKER3) here; that's the optional Track B for later. Outputs are large and **not committed to GitHub**. They live on Gannet: <https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/> ## Inputs | Input | Path | |:-----------------|:-----------------------------------------------------| | Lean assembly | `data/genome/pb-hifiasm-lean-assembly.fa` | | Siscowet assembly | `data/genome/pb-hifiasm-siscowet-assembly.fa` | | Reference FASTA | `data/genome/ref-GCF_016432855.1/GCF_016432855.1_SaNama_1.0_genomic.fna` | | Reference GFF | `data/genome/ref-GCF_016432855.1/genomic.gff` | | Functional table | `analyses/18-annotation/gene_function_table.tsv` | Tools: NCBI `datasets`, `liftoff` + `minimap2` (liftoff_env conda env), `gffread` 0.12.7, `samtools` 1.12. Run with 40 threads. ## Workflow ### 1. Download the reference annotation The reference FASTA and GFF3 for `GCF_016432855.1` (SaNama_1.0) are the source annotation Liftoff transfers from. Run once; skipped if already present. ``` bash cd "${ref_dir}" "${datasets}" download genome accession "${ref_accession}" \ --exclude-rna \ --exclude-protein \ --exclude-genomic-cds \ --filename "${ref_accession}.zip" unzip -o "${ref_accession}.zip" cp ncbi_dataset/data/"${ref_accession}"/*_genomic.fna "${ref_fasta}" cp ncbi_dataset/data/"${ref_accession}"/genomic.gff "${ref_gff}" ``` ### 2. Run Liftoff per ecotype For each ecotype the reference annotation is transferred onto the assembly. Key flags: - `-g` source GFF, `-p` threads - `-polish` re-aligns to fix gene/exon boundaries (writes `<out>_polished`) - `-copies -sc 0.95` reports additional gene copies (expansions) at ≥95% identity - `-u` records reference genes that could **not** be lifted (ecotype-specific loss / PAV candidates) ``` bash cd "${output_dir}" samtools faidx "${lean_fasta}" "${liftoff}" \ "${lean_fasta}" \ "${ref_fasta}" \ -g "${ref_gff}" \ -o lean.liftoff.gff3 \ -u lean.unmapped_features.txt \ -dir lean_intermediate \ -m "${minimap2}" \ -p "${threads}" \ -polish \ -copies -sc 0.95 \ > lean.liftoff.log 2>&1 ``` The siscowet run is identical with `siscowet.*` substituted in. ### 3. Derive genes-only BED + protein FASTA `-polish` writes `*.gff3_polished`; prefer that when present. A genes-only BED is built to match the repo's `data/...genes.bed` convention (`chrom start end gene-ID 0 strand`), and proteins are extracted with `gffread`. ``` bash cd "${output_dir}" LEAN_GFF=lean.liftoff.gff3_polished [ -s "$LEAN_GFF" ] || LEAN_GFF=lean.liftoff.gff3 awk 'BEGIN{OFS="\t"} $3=="gene"{ id=""; n=split($9,a,";"); for(i=1;i<=n;i++){ if(a[i]~/^ID=/){ sub(/^ID=/,"",a[i]); id=a[i] } } print $1, $4-1, $5, id, 0, $7 }' "$LEAN_GFF" > lean.liftoff.genes.bed "${gffread}" -y lean.liftoff.proteins.faa -g "${lean_fasta}" "$LEAN_GFF" ``` ### 4. Inherit functional annotation Lifted gene IDs are the reference `gene-XXX` IDs, which are the join key in the existing `gene_function_table.tsv`. A left join carries symbol / product / GO onto each ecotype's lifted genes with no new computation. ```{r inherit-function, eval=FALSE} func <- readr::read_tsv(func_table, show_col_types = FALSE) bed_cols <- c("chrom", "start", "end", "gene_id", "score", "strand") genes <- readr::read_tsv(bed_path, col_names = bed_cols, show_col_types = FALSE) out <- genes %>% dplyr::select(gene_id, eco_chrom = chrom, eco_start = start, eco_end = end, eco_strand = strand) %>% dplyr::left_join(func, by = "gene_id") readr::write_tsv(out, out_path) ``` ## Results Liftoff transferred the SaNama_1.0 annotation onto both ecotype assemblies with very low loss (\~1–1.6% of reference features unmapped). | Metric | Lean | Siscowet | |:----------------------------|:-------|:---------| | Lifted genes | 79,535 | 82,691 | | mRNAs | 67,707 | 71,789 | | Proteins extracted | 67,888 | 72,050 | | Unmapped reference features | 1,294 | 1,073 | The unmapped feature lists are the immediate ecotype-specific loss / PAV candidates and feed directly into the differential-PAV checks. ## Deliverables Per ecotype, written to `output/20.1-liftoff-annotation/`. Replace the placeholder URLs below with the final Gannet links. | File | Description | Link | |:----------------|:----------------|:-------------------------------------| | `{eco}.liftoff.gff3_polished` | Lifted gene models on the ecotype assembly | [lean](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/lean.liftoff.gff3_polished) · [siscowet](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/siscowet.liftoff.gff3_polished) | | `{eco}.liftoff.genes.bed` | Genes-only BED | [lean](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/lean.liftoff.genes.bed) · [siscowet](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/siscowet.liftoff.genes.bed) | | `{eco}.liftoff.proteins.faa` | Predicted proteins | [lean](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/lean.liftoff.proteins.faa) · [siscowet](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/siscowet.liftoff.proteins.faa) | | `{eco}.liftoff.gene_function_table.tsv` | Lifted genes joined to symbol/product/GO | [lean](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/lean.liftoff.gene_function_table.tsv) · [siscowet](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/siscowet.liftoff.gene_function_table.tsv) | | `{eco}.unmapped_features.txt` | Reference genes not lifted (PAV / loss candidates) | [lean](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/lean.unmapped_features.txt) · [siscowet](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/siscowet.unmapped_features.txt) | | `{eco}.liftoff.log` | Full Liftoff run log | [lean](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/lean.liftoff.log) · [siscowet](https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/siscowet.liftoff.log) | Each output also has a sibling `.md5` checksum file in the same directory. ## Next steps These per-ecotype gene models feed the ecotype-coordinate version of `code/13.3-hifiasm-differential-methylation-plan.md` and let the PAV results in `code/15-diff-pav.py` be checked against per-ecotype gene models. Optional Track B (repeat masking + BRAKER3 de novo prediction) remains for later. ## References - Shumate, A. & Salzberg, S.L. (2021). Liftoff: accurate mapping of gene annotations. *Bioinformatics* 37(12):1639–1643.