From Scaffolds to Chromosomes: GBResequence Workflows Explained
Introduction
GBResequence is a workflow-focused approach for reorganizing and ordering genomic scaffolds into chromosome-scale assemblies by leveraging conserved synteny, reference genomes, long-range linkage data, and manual curation. This article outlines core concepts, typical input data, stepwise workflows, common tools, validation strategies, and practical tips for producing reliable chromosome-level sequences from scaffolded assemblies.
Typical Inputs and Outputs
- Inputs
- Draft assembly scaffolds (FASTA)
- Reference genome(s) from a related species or high-quality assembly of same species
- Long-range linking data: Hi-C, optical maps, linked reads, genetic maps
- Gene annotation or conserved marker sets (e.g., BUSCO loci)
- Outputs
- Ordered and oriented pseudochromosomes (FASTA)
- AGP files describing scaffold placement
- Updated annotation liftover or coordinate maps
- Assembly metadata and quality reports
Core Concepts
- Scaffold ordering: Determining relative positions of scaffolds along chromosomes using synteny or linkage information.
- Orientation: Assigning forward/reverse orientation per scaffold.
- Gap sizing: Estimating inter-scaffold gap lengths (often set to Ns when unknown).
- Conflict resolution: Reconciling discrepancies between different linkage sources.
- Manual curation: Visual inspection (e.g., contact maps) to correct misjoins or misassemblies.
Step-by-Step GBResequence Workflow
- Preparation
- Assess assembly quality: Run QUAST, BUSCO for completeness and contiguity metrics.
- Index files: Create sequence indexes (samtools faidx, bwa index) and build mapping indices for long reads or Hi-C aligners.
- Reference alignment and synteny mapping
- Whole-genome alignment: Use minimap2, nucmer (MUMmer) or lastz to align scaffolds to reference.
- Synteny blocks: Extract syntenic regions using tools like Satsuma, SyMAP, or MCScanX.
- Preliminary ordering: Generate scaffold order proposals from synteny chains.
- Long-range data integration
- Hi-C contact maps: Map Hi-C reads (Juicer/BWA), build contact matrices (Juicebox, HiCExplorer).
- Optical maps/linked reads: Align maps to scaffolds and derive ordering constraints.
- Genetic maps: Use marker positions to anchor scaffolds to linkage groups.
- Combine constraints: Use tools such as ALLMAPS, RagTag, or RaGOO to merge ordering evidence.
- Scaffolding and gap handling
- Scaffold placement: Run chosen scaffolder to produce pseudochromosomes; insert Ns for estimated gaps.
- Orientation checks: Validate orientation with contact maps and alignments.
- Polishing and correction
- Polish sequence: Use short/long reads with Pilon, Racon, or Medaka to correct base errors.
- Break misjoins: Based on Hi-C/optical map evidence, split incorrect joins and re-run ordering if needed.
- Annotation lift-over
- Coordinate translation: Use Liftoff, CrossMap, or custom scripts to transfer gene models to new coordinates.
- Functional checks: Verify presence and integrity of conserved genes (BUSCO).
- Validation and quality control
- Contact map inspection: Visualize Hi-C maps in Juicebox for chromosomal patterns.
- Assembly metrics: Report N50, BUSCO scores, misassembly counts (QUAST).
- Biological checks: Confirm karyotype consistency and expected chromosome counts where available.
- Documentation and release
- Produce AGP: Describe scaffold-to-chromosome placements.
- Metadata: Record data sources, tool versions, parameters, and decision notes for curation steps.
- Publish: Submit assembly and annotation to appropriate repositories (GenBank/ENA) following their submission standards.
Common Tools and When to Use Them
- Alignment and synteny: minimap2, nucmer (MUMmer), lastz, Satsuma, MCScanX
- Scaffolding and ordering: ALLMAPS, RaGOO, RagTag, ARKS/LINKS
- Hi-C processing: Juicer, HiC-Pro, HiCExplorer, Juicebox
- Polishing: Pilon, Racon, Medaka
- Validation: QUAST, BUSCO, REAPR, KAT
- Annotation liftover: Liftoff, CrossMap
Practical Tips and Pitfalls
- Prefer multiple evidence types: Combine synteny and Hi-C for robust ordering.
- Beware reference bias: Using a distant reference can misplace lineage-specific rearrangements—validate with contact maps.
- Document manual changes: Keep logs and make decisions reproducible.
- Conservative gap sizing: Use Ns rather than speculative base-level filling unless supported by reads.
- Iterate: Order → validate → correct → polish in multiple rounds.
Example Minimal Command Sequence (conceptual)
bash
# Assess busco -i assembly.fasta -l embryophyta_odb10 -o busco # Align to reference minimap2 -ax asm5 ref.fa assembly.fasta > aln.sam # Order with RagTag (synteny-based) ragtag scaffold ref.fa assembly.fasta -o ragtag_out # Hi-C mapping (Juicer pipeline) and visual check in Juicebox # Polish with Racon/Pilon as needed
Conclusion
GBResequence workflows convert scaffold-level assemblies into chromosome-scale pseudochromosomes by integrating reference-guided synteny, long-range linkage data, iterative correction, and thorough validation. Combining complementary data types and documenting decisions yields high-confidence assemblies suitable for downstream analyses such as comparative genomics, population studies, and functional annotation.
Leave a Reply