Liftoff Annotation of Lean & Siscowet Lake Trout Ecotype Assemblies

Transferring the SaNama_1.0 annotation onto two PacBio HiFi ecotype assemblies with Liftoff
Genomics
Computing
Author
Affiliation

Steven Roberts

Published

June 18, 2026

AI Use Level 2: AI-assisted drafting or coding

Overview

Quick write-up of the Liftoff-only structural annotation of the two ecotype-specific PacBio HiFi assemblies (lean, siscowet). This is Track A of code/20-ecotype-genomes.md; the full executable workflow is in code/20.1-liftoff-annotation.Rmd. This post just records what was run and what came out.

Both assemblies are the same species as the annotated NCBI reference GCF_016432855.1 (SaNama_1.0), so rather than predicting genes de novo I transferred the reference annotation directly onto each assembly with Liftoff (Shumate & Salzberg 2021). Lifted genes inherit the existing functional annotation in analyses/18-annotation/gene_function_table.tsv through the shared gene-XXX ID — no new functional computation needed.

No repeat masking or de novo prediction (BRAKER3) here; that’s the optional Track B for later.

Outputs are large and not committed to GitHub. They live on Gannet: https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/

Inputs

Input Path
Lean assembly data/genome/pb-hifiasm-lean-assembly.fa
Siscowet assembly data/genome/pb-hifiasm-siscowet-assembly.fa
Reference FASTA data/genome/ref-GCF_016432855.1/GCF_016432855.1_SaNama_1.0_genomic.fna
Reference GFF data/genome/ref-GCF_016432855.1/genomic.gff
Functional table analyses/18-annotation/gene_function_table.tsv

Tools: NCBI datasets, liftoff + minimap2 (liftoff_env conda env), gffread 0.12.7, samtools 1.12. Run with 40 threads.

Workflow

1. Download the reference annotation

The reference FASTA and GFF3 for GCF_016432855.1 (SaNama_1.0) are the source annotation Liftoff transfers from. Run once; skipped if already present.

cd "${ref_dir}"
"${datasets}" download genome accession "${ref_accession}" \
  --exclude-rna \
  --exclude-protein \
  --exclude-genomic-cds \
  --filename "${ref_accession}.zip"
unzip -o "${ref_accession}.zip"
cp ncbi_dataset/data/"${ref_accession}"/*_genomic.fna "${ref_fasta}"
cp ncbi_dataset/data/"${ref_accession}"/genomic.gff   "${ref_gff}"

2. Run Liftoff per ecotype

For each ecotype the reference annotation is transferred onto the assembly. Key flags:

  • -g source GFF, -p threads
  • -polish re-aligns to fix gene/exon boundaries (writes <out>_polished)
  • -copies -sc 0.95 reports additional gene copies (expansions) at ≥95% identity
  • -u records reference genes that could not be lifted (ecotype-specific loss / PAV candidates)
cd "${output_dir}"
samtools faidx "${lean_fasta}"
"${liftoff}" \
  "${lean_fasta}" \
  "${ref_fasta}" \
  -g "${ref_gff}" \
  -o lean.liftoff.gff3 \
  -u lean.unmapped_features.txt \
  -dir lean_intermediate \
  -m "${minimap2}" \
  -p "${threads}" \
  -polish \
  -copies -sc 0.95 \
  > lean.liftoff.log 2>&1

The siscowet run is identical with siscowet.* substituted in.

3. Derive genes-only BED + protein FASTA

-polish writes *.gff3_polished; prefer that when present. A genes-only BED is built to match the repo’s data/...genes.bed convention (chrom start end gene-ID 0 strand), and proteins are extracted with gffread.

cd "${output_dir}"
LEAN_GFF=lean.liftoff.gff3_polished
[ -s "$LEAN_GFF" ] || LEAN_GFF=lean.liftoff.gff3
awk 'BEGIN{OFS="\t"} $3=="gene"{
  id=""; n=split($9,a,";");
  for(i=1;i<=n;i++){ if(a[i]~/^ID=/){ sub(/^ID=/,"",a[i]); id=a[i] } }
  print $1, $4-1, $5, id, 0, $7
}' "$LEAN_GFF" > lean.liftoff.genes.bed
"${gffread}" -y lean.liftoff.proteins.faa -g "${lean_fasta}" "$LEAN_GFF"

4. Inherit functional annotation

Lifted gene IDs are the reference gene-XXX IDs, which are the join key in the existing gene_function_table.tsv. A left join carries symbol / product / GO onto each ecotype’s lifted genes with no new computation.

func <- readr::read_tsv(func_table, show_col_types = FALSE)
bed_cols <- c("chrom", "start", "end", "gene_id", "score", "strand")
genes <- readr::read_tsv(bed_path, col_names = bed_cols, show_col_types = FALSE)
out <- genes %>%
  dplyr::select(gene_id, eco_chrom = chrom, eco_start = start,
                eco_end = end, eco_strand = strand) %>%
  dplyr::left_join(func, by = "gene_id")
readr::write_tsv(out, out_path)

Results

Liftoff transferred the SaNama_1.0 annotation onto both ecotype assemblies with very low loss (~1–1.6% of reference features unmapped).

Metric Lean Siscowet
Lifted genes 79,535 82,691
mRNAs 67,707 71,789
Proteins extracted 67,888 72,050
Unmapped reference features 1,294 1,073

The unmapped feature lists are the immediate ecotype-specific loss / PAV candidates and feed directly into the differential-PAV checks.

Deliverables

Per ecotype, written to output/20.1-liftoff-annotation/. Replace the placeholder URLs below with the final Gannet links.

File Description Link
{eco}.liftoff.gff3_polished Lifted gene models on the ecotype assembly lean · siscowet
{eco}.liftoff.genes.bed Genes-only BED lean · siscowet
{eco}.liftoff.proteins.faa Predicted proteins lean · siscowet
{eco}.liftoff.gene_function_table.tsv Lifted genes joined to symbol/product/GO lean · siscowet
{eco}.unmapped_features.txt Reference genes not lifted (PAV / loss candidates) lean · siscowet
{eco}.liftoff.log Full Liftoff run log lean · siscowet

Each output also has a sibling .md5 checksum file in the same directory.

Next steps

These per-ecotype gene models feed the ecotype-coordinate version of code/13.3-hifiasm-differential-methylation-plan.md and let the PAV results in code/15-diff-pav.py be checked against per-ecotype gene models. Optional Track B (repeat masking + BRAKER3 de novo prediction) remains for later.

References

  • Shumate, A. & Salzberg, S.L. (2021). Liftoff: accurate mapping of gene annotations. Bioinformatics 37(12):1639–1643.