From Scaffolds to Chromosomes: GBResequence Workflows Explained

From Scaffolds to Chromosomes: GBResequence Workflows Explained

Introduction

GBResequence is a workflow-focused approach for reorganizing and ordering genomic scaffolds into chromosome-scale assemblies by leveraging conserved synteny, reference genomes, long-range linkage data, and manual curation. This article outlines core concepts, typical input data, stepwise workflows, common tools, validation strategies, and practical tips for producing reliable chromosome-level sequences from scaffolded assemblies.

Typical Inputs and Outputs

  • Inputs
    • Draft assembly scaffolds (FASTA)
    • Reference genome(s) from a related species or high-quality assembly of same species
    • Long-range linking data: Hi-C, optical maps, linked reads, genetic maps
    • Gene annotation or conserved marker sets (e.g., BUSCO loci)
  • Outputs
    • Ordered and oriented pseudochromosomes (FASTA)
    • AGP files describing scaffold placement
    • Updated annotation liftover or coordinate maps
    • Assembly metadata and quality reports

Core Concepts

  • Scaffold ordering: Determining relative positions of scaffolds along chromosomes using synteny or linkage information.
  • Orientation: Assigning forward/reverse orientation per scaffold.
  • Gap sizing: Estimating inter-scaffold gap lengths (often set to Ns when unknown).
  • Conflict resolution: Reconciling discrepancies between different linkage sources.
  • Manual curation: Visual inspection (e.g., contact maps) to correct misjoins or misassemblies.

Step-by-Step GBResequence Workflow

  1. Preparation
    • Assess assembly quality: Run QUAST, BUSCO for completeness and contiguity metrics.
    • Index files: Create sequence indexes (samtools faidx, bwa index) and build mapping indices for long reads or Hi-C aligners.
  2. Reference alignment and synteny mapping
    • Whole-genome alignment: Use minimap2, nucmer (MUMmer) or lastz to align scaffolds to reference.
    • Synteny blocks: Extract syntenic regions using tools like Satsuma, SyMAP, or MCScanX.
    • Preliminary ordering: Generate scaffold order proposals from synteny chains.
  3. Long-range data integration
    • Hi-C contact maps: Map Hi-C reads (Juicer/BWA), build contact matrices (Juicebox, HiCExplorer).
    • Optical maps/linked reads: Align maps to scaffolds and derive ordering constraints.
    • Genetic maps: Use marker positions to anchor scaffolds to linkage groups.
    • Combine constraints: Use tools such as ALLMAPS, RagTag, or RaGOO to merge ordering evidence.
  4. Scaffolding and gap handling
    • Scaffold placement: Run chosen scaffolder to produce pseudochromosomes; insert Ns for estimated gaps.
    • Orientation checks: Validate orientation with contact maps and alignments.
  5. Polishing and correction
    • Polish sequence: Use short/long reads with Pilon, Racon, or Medaka to correct base errors.
    • Break misjoins: Based on Hi-C/optical map evidence, split incorrect joins and re-run ordering if needed.
  6. Annotation lift-over
    • Coordinate translation: Use Liftoff, CrossMap, or custom scripts to transfer gene models to new coordinates.
    • Functional checks: Verify presence and integrity of conserved genes (BUSCO).
  7. Validation and quality control
    • Contact map inspection: Visualize Hi-C maps in Juicebox for chromosomal patterns.
    • Assembly metrics: Report N50, BUSCO scores, misassembly counts (QUAST).
    • Biological checks: Confirm karyotype consistency and expected chromosome counts where available.
  8. Documentation and release
    • Produce AGP: Describe scaffold-to-chromosome placements.
    • Metadata: Record data sources, tool versions, parameters, and decision notes for curation steps.
    • Publish: Submit assembly and annotation to appropriate repositories (GenBank/ENA) following their submission standards.

Common Tools and When to Use Them

  • Alignment and synteny: minimap2, nucmer (MUMmer), lastz, Satsuma, MCScanX
  • Scaffolding and ordering: ALLMAPS, RaGOO, RagTag, ARKS/LINKS
  • Hi-C processing: Juicer, HiC-Pro, HiCExplorer, Juicebox
  • Polishing: Pilon, Racon, Medaka
  • Validation: QUAST, BUSCO, REAPR, KAT
  • Annotation liftover: Liftoff, CrossMap

Practical Tips and Pitfalls

  • Prefer multiple evidence types: Combine synteny and Hi-C for robust ordering.
  • Beware reference bias: Using a distant reference can misplace lineage-specific rearrangements—validate with contact maps.
  • Document manual changes: Keep logs and make decisions reproducible.
  • Conservative gap sizing: Use Ns rather than speculative base-level filling unless supported by reads.
  • Iterate: Order → validate → correct → polish in multiple rounds.

Example Minimal Command Sequence (conceptual)

bash

# Assess busco -i assembly.fasta -l embryophyta_odb10 -o busco # Align to reference minimap2 -ax asm5 ref.fa assembly.fasta > aln.sam # Order with RagTag (synteny-based) ragtag scaffold ref.fa assembly.fasta -o ragtag_out # Hi-C mapping (Juicer pipeline) and visual check in Juicebox # Polish with Racon/Pilon as needed

Conclusion

GBResequence workflows convert scaffold-level assemblies into chromosome-scale pseudochromosomes by integrating reference-guided synteny, long-range linkage data, iterative correction, and thorough validation. Combining complementary data types and documenting decisions yields high-confidence assemblies suitable for downstream analyses such as comparative genomics, population studies, and functional annotation.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *