From Scaffolds to Chromosomes: GBResequence Workflows Explained

Introduction

GBResequence is a workflow-focused approach for reorganizing and ordering genomic scaffolds into chromosome-scale assemblies by leveraging conserved synteny, reference genomes, long-range linkage data, and manual curation. This article outlines core concepts, typical input data, stepwise workflows, common tools, validation strategies, and practical tips for producing reliable chromosome-level sequences from scaffolded assemblies.

Typical Inputs and Outputs

Inputs
- Draft assembly scaffolds (FASTA)
- Reference genome(s) from a related species or high-quality assembly of same species
- Long-range linking data: Hi-C, optical maps, linked reads, genetic maps
- Gene annotation or conserved marker sets (e.g., BUSCO loci)
Outputs
- Ordered and oriented pseudochromosomes (FASTA)
- AGP files describing scaffold placement
- Updated annotation liftover or coordinate maps
- Assembly metadata and quality reports

Core Concepts

Scaffold ordering: Determining relative positions of scaffolds along chromosomes using synteny or linkage information.
Orientation: Assigning forward/reverse orientation per scaffold.
Gap sizing: Estimating inter-scaffold gap lengths (often set to Ns when unknown).
Conflict resolution: Reconciling discrepancies between different linkage sources.
Manual curation: Visual inspection (e.g., contact maps) to correct misjoins or misassemblies.

Step-by-Step GBResequence Workflow

Preparation
- Assess assembly quality: Run QUAST, BUSCO for completeness and contiguity metrics.
- Index files: Create sequence indexes (samtools faidx, bwa index) and build mapping indices for long reads or Hi-C aligners.
Reference alignment and synteny mapping
- Whole-genome alignment: Use minimap2, nucmer (MUMmer) or lastz to align scaffolds to reference.
- Synteny blocks: Extract syntenic regions using tools like Satsuma, SyMAP, or MCScanX.
- Preliminary ordering: Generate scaffold order proposals from synteny chains.
Long-range data integration
- Hi-C contact maps: Map Hi-C reads (Juicer/BWA), build contact matrices (Juicebox, HiCExplorer).
- Optical maps/linked reads: Align maps to scaffolds and derive ordering constraints.
- Genetic maps: Use marker positions to anchor scaffolds to linkage groups.
- Combine constraints: Use tools such as ALLMAPS, RagTag, or RaGOO to merge ordering evidence.
Scaffolding and gap handling
- Scaffold placement: Run chosen scaffolder to produce pseudochromosomes; insert Ns for estimated gaps.
- Orientation checks: Validate orientation with contact maps and alignments.
Polishing and correction
- Polish sequence: Use short/long reads with Pilon, Racon, or Medaka to correct base errors.
- Break misjoins: Based on Hi-C/optical map evidence, split incorrect joins and re-run ordering if needed.
Annotation lift-over
- Coordinate translation: Use Liftoff, CrossMap, or custom scripts to transfer gene models to new coordinates.
- Functional checks: Verify presence and integrity of conserved genes (BUSCO).
Validation and quality control
- Contact map inspection: Visualize Hi-C maps in Juicebox for chromosomal patterns.
- Assembly metrics: Report N50, BUSCO scores, misassembly counts (QUAST).
- Biological checks: Confirm karyotype consistency and expected chromosome counts where available.
Documentation and release
- Produce AGP: Describe scaffold-to-chromosome placements.
- Metadata: Record data sources, tool versions, parameters, and decision notes for curation steps.
- Publish: Submit assembly and annotation to appropriate repositories (GenBank/ENA) following their submission standards.

Common Tools and When to Use Them

Alignment and synteny: minimap2, nucmer (MUMmer), lastz, Satsuma, MCScanX
Scaffolding and ordering: ALLMAPS, RaGOO, RagTag, ARKS/LINKS
Hi-C processing: Juicer, HiC-Pro, HiCExplorer, Juicebox
Polishing: Pilon, Racon, Medaka
Validation: QUAST, BUSCO, REAPR, KAT
Annotation liftover: Liftoff, CrossMap

Practical Tips and Pitfalls

Prefer multiple evidence types: Combine synteny and Hi-C for robust ordering.
Beware reference bias: Using a distant reference can misplace lineage-specific rearrangements—validate with contact maps.
Document manual changes: Keep logs and make decisions reproducible.
Conservative gap sizing: Use Ns rather than speculative base-level filling unless supported by reads.
Iterate: Order → validate → correct → polish in multiple rounds.

Example Minimal Command Sequence (conceptual)

bash
# Assess
busco -i assembly.fasta -l embryophyta_odb10 -o busco # Align to reference
minimap2 -ax asm5 ref.fa assembly.fasta > aln.sam # Order with RagTag (synteny-based)
ragtag scaffold ref.fa assembly.fasta -o ragtag_out # Hi-C mapping (Juicer pipeline) and visual check in Juicebox
# Polish with Racon/Pilon as needed

Conclusion

GBResequence workflows convert scaffold-level assemblies into chromosome-scale pseudochromosomes by integrating reference-guided synteny, long-range linkage data, iterative correction, and thorough validation. Combining complementary data types and documenting decisions yields high-confidence assemblies suitable for downstream analyses such as comparative genomics, population studies, and functional annotation.

From Scaffolds to Chromosomes: GBResequence Workflows Explained