CUGI Pilot Sequencing/Assembly Projects Christopher Saski
Sequencing the Cacao Genome: 3 Megabases at a Time Pilot project to sequence and assemble 3Mbp segment of cacao genome IBM in silico assembly project – Testing the assembly pipeline
Sequencing the Cacao Genome: 3 Megabases at a Time Combination of: – “Old School Genomics” BAC libraries, physical mapping, and clone-by-clone sequencing – Roche 454 Titanium and FLX De Novo sequencing Key: – Not yet accurately assembled a eukaryotic genome with NGS alone – Reduce assembly complexity
3 Megabase segments Rounsley et al., 2009
Advantages Reduce assembly complexity Limit number of sequencing libraries Prioritize critical genomic regions Outsource BAC pools for sequencing in parallel at any center that has a 454 Titanium/GS-FLX sequencer Flexibility – Start slow with minimal investment – Could redesign strategy to reduce sequence runs
Strategy Components Integrated Physical/Genetic framework Pool development and sequencing: – BAC-end – Titanium 454 (paired/non-paired) – Draft sequence Assembly and integration: – Newbler – Celera (CABOG)
Cacao Integrated Physical/Genetic Framework Represents ~29X coverage (3 BAC libraries) Assembled into small number of large contigs Suggests reasonable levels of heterozygosity Manageable amounts of repetitive sequence 220 anchored genetic markers spanning 10 linkage groups – Resemble recombinational derived order
Pool Development Select contiguous BAC clones from MTP Pools will contain clones – 20-30kb overlap Complete Cacao MTP will require pools Repetitive-type regions: – BAC-end sequence and physical map data predictive tool Modify pools accordingly
Pool Development Estimate contig size using Consensus Band (CB) algorithm Example: Cacao cp genome is 160,604bp – Hybridization revealed cp containing contig and is estimated to be ~160 kb based on CB algorithm. Purified pool DNA can be produced at CUGI – Treat with ATP-dependent Dnase
Sequencing 3 Levels of Sequence: – Paired BAC-end Sequence – 20 kb increments – End sequencing of pool members – 454 sequencing of BAC pools Paired 3.5X-5.1X coverage (Roche 454/FLX) Non-paired 17X-26X coverage (Titanium)
454 Runs—Whole Genome 454 Titanium non-paired – 26X coverage/pool – 4 pools per slide (up to 150 pools total) Up to 38 slide runs 454 FLX paired-end (3kb) – 5X coverage/pool – 16 pools per slide (up to 150 pools total) Up to 10 slide runs total
Assembly/Curation of 3Mbp Segment Preprocessing – Filter reads to remove: Pair-end that did not contain both ends BAC vector E. coli (host DNA) Newbler Assembler (Roche) Celera Assembler (CABOG) – Improvements in homopolymer calls, and heterogeneous read length issues – Recently shown N50 contig size double to Newbler Human (50% repetitive) and microbes
Assembly Curation of 3Mbp Segment Assembly at various depths (5X, 10X, 15X) – Determine optimal sequencing coverage Utilize available data to scaffold contigs: – BAC end sequences every 20kb – Genetic marker sequences – RNA-seq clusters – Arabidopsis – Cacao synteny – Draft Sequence (2X) Augment approach by covering regions missed by clones – assist in selecting MTP
Assembly Curation of 3Mbp Segment Deliverable will be a pseudomolecule sequence for the 3Mbp region – Gaps will be strings of N Assess and employ lab-based gap filling strategies Make every attempt to close gaps
Assembly Validation and Correction In-silico virtual digest of scaffold sequence and compare to physical map restriction fragments – Draft sequence integration (DSI) via FPC Integrate and visualize physical map, 3 Mbp segments, and draft sequence
Sequence/Assembly Pipeline
IBM in silico Sequences IBM will provide a set of sequences that mimic the pilot caco sequences – Input error Indels, homopolymer calls, nucleotide substitutions Simulated data to test pipeline: – Physical map – Simulated BAC end sequences – Simulated pseudo-reads from pooled BACs – EST clusters – Indicate reference species for syntenic comparisons
Pilot Project Budget BAC-end sequencing (30K BACs), 20Kb increments – $206, Assembly/curation/validation of cacao 3Mbp – $16, Assembly of IBM in-silico derived sequences – $15,400.00
ESTIMATED Budget – Whole Genome Assembly Assembly, curation, validation of , 3Mbp segments – $147, Automated structural/functional annotation – $8,800.00
Acknowledgements USDA-ARS Mars Inc. Dr. Alex Feltus Stephen Ficklin Dr. Keith Murphy Dr. Margaret Staton