func <- readr::read_tsv(func_table, show_col_types = FALSE)
bed_cols <- c("chrom", "start", "end", "gene_id", "score", "strand")
genes <- readr::read_tsv(bed_path, col_names = bed_cols, show_col_types = FALSE)
out <- genes %>%
dplyr::select(gene_id, eco_chrom = chrom, eco_start = start,
eco_end = end, eco_strand = strand) %>%
dplyr::left_join(func, by = "gene_id")
readr::write_tsv(out, out_path)Overview
Quick write-up of the Liftoff-only structural annotation of the two ecotype-specific PacBio HiFi assemblies (lean, siscowet). This is Track A of code/20-ecotype-genomes.md; the full executable workflow is in code/20.1-liftoff-annotation.Rmd. This post just records what was run and what came out.
Both assemblies are the same species as the annotated NCBI reference GCF_016432855.1 (SaNama_1.0), so rather than predicting genes de novo I transferred the reference annotation directly onto each assembly with Liftoff (Shumate & Salzberg 2021). Lifted genes inherit the existing functional annotation in analyses/18-annotation/gene_function_table.tsv through the shared gene-XXX ID — no new functional computation needed.
No repeat masking or de novo prediction (BRAKER3) here; that’s the optional Track B for later.
Outputs are large and not committed to GitHub. They live on Gannet: https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/project-lake-trout/output/20.1-liftoff-annotation/
Inputs
| Input | Path |
|---|---|
| Lean assembly | data/genome/pb-hifiasm-lean-assembly.fa |
| Siscowet assembly | data/genome/pb-hifiasm-siscowet-assembly.fa |
| Reference FASTA | data/genome/ref-GCF_016432855.1/GCF_016432855.1_SaNama_1.0_genomic.fna |
| Reference GFF | data/genome/ref-GCF_016432855.1/genomic.gff |
| Functional table | analyses/18-annotation/gene_function_table.tsv |
Tools: NCBI datasets, liftoff + minimap2 (liftoff_env conda env), gffread 0.12.7, samtools 1.12. Run with 40 threads.
Workflow
1. Download the reference annotation
The reference FASTA and GFF3 for GCF_016432855.1 (SaNama_1.0) are the source annotation Liftoff transfers from. Run once; skipped if already present.
cd "${ref_dir}"
"${datasets}" download genome accession "${ref_accession}" \
--exclude-rna \
--exclude-protein \
--exclude-genomic-cds \
--filename "${ref_accession}.zip"
unzip -o "${ref_accession}.zip"
cp ncbi_dataset/data/"${ref_accession}"/*_genomic.fna "${ref_fasta}"
cp ncbi_dataset/data/"${ref_accession}"/genomic.gff "${ref_gff}"2. Run Liftoff per ecotype
For each ecotype the reference annotation is transferred onto the assembly. Key flags:
-gsource GFF,-pthreads-polishre-aligns to fix gene/exon boundaries (writes<out>_polished)-copies -sc 0.95reports additional gene copies (expansions) at ≥95% identity-urecords reference genes that could not be lifted (ecotype-specific loss / PAV candidates)
cd "${output_dir}"
samtools faidx "${lean_fasta}"
"${liftoff}" \
"${lean_fasta}" \
"${ref_fasta}" \
-g "${ref_gff}" \
-o lean.liftoff.gff3 \
-u lean.unmapped_features.txt \
-dir lean_intermediate \
-m "${minimap2}" \
-p "${threads}" \
-polish \
-copies -sc 0.95 \
> lean.liftoff.log 2>&1The siscowet run is identical with siscowet.* substituted in.
3. Derive genes-only BED + protein FASTA
-polish writes *.gff3_polished; prefer that when present. A genes-only BED is built to match the repo’s data/...genes.bed convention (chrom start end gene-ID 0 strand), and proteins are extracted with gffread.
cd "${output_dir}"
LEAN_GFF=lean.liftoff.gff3_polished
[ -s "$LEAN_GFF" ] || LEAN_GFF=lean.liftoff.gff3
awk 'BEGIN{OFS="\t"} $3=="gene"{
id=""; n=split($9,a,";");
for(i=1;i<=n;i++){ if(a[i]~/^ID=/){ sub(/^ID=/,"",a[i]); id=a[i] } }
print $1, $4-1, $5, id, 0, $7
}' "$LEAN_GFF" > lean.liftoff.genes.bed
"${gffread}" -y lean.liftoff.proteins.faa -g "${lean_fasta}" "$LEAN_GFF"4. Inherit functional annotation
Lifted gene IDs are the reference gene-XXX IDs, which are the join key in the existing gene_function_table.tsv. A left join carries symbol / product / GO onto each ecotype’s lifted genes with no new computation.
Results
Liftoff transferred the SaNama_1.0 annotation onto both ecotype assemblies with very low loss (~1–1.6% of reference features unmapped).
| Metric | Lean | Siscowet |
|---|---|---|
| Lifted genes | 79,535 | 82,691 |
| mRNAs | 67,707 | 71,789 |
| Proteins extracted | 67,888 | 72,050 |
| Unmapped reference features | 1,294 | 1,073 |
The unmapped feature lists are the immediate ecotype-specific loss / PAV candidates and feed directly into the differential-PAV checks.
Deliverables
Per ecotype, written to output/20.1-liftoff-annotation/. Replace the placeholder URLs below with the final Gannet links.
| File | Description | Link |
|---|---|---|
{eco}.liftoff.gff3_polished |
Lifted gene models on the ecotype assembly | lean · siscowet |
{eco}.liftoff.genes.bed |
Genes-only BED | lean · siscowet |
{eco}.liftoff.proteins.faa |
Predicted proteins | lean · siscowet |
{eco}.liftoff.gene_function_table.tsv |
Lifted genes joined to symbol/product/GO | lean · siscowet |
{eco}.unmapped_features.txt |
Reference genes not lifted (PAV / loss candidates) | lean · siscowet |
{eco}.liftoff.log |
Full Liftoff run log | lean · siscowet |
Each output also has a sibling .md5 checksum file in the same directory.
Next steps
These per-ecotype gene models feed the ecotype-coordinate version of code/13.3-hifiasm-differential-methylation-plan.md and let the PAV results in code/15-diff-pav.py be checked against per-ecotype gene models. Optional Track B (repeat masking + BRAKER3 de novo prediction) remains for later.
References
- Shumate, A. & Salzberg, S.L. (2021). Liftoff: accurate mapping of gene annotations. Bioinformatics 37(12):1639–1643.