Presentation is loading. Please wait.

Presentation is loading. Please wait.

CUGI Pilot Sequencing/Assembly Projects Christopher Saski.

Similar presentations


Presentation on theme: "CUGI Pilot Sequencing/Assembly Projects Christopher Saski."— Presentation transcript:

1 CUGI Pilot Sequencing/Assembly Projects Christopher Saski

2 Sequencing the Cacao Genome: 3 Megabases at a Time Pilot project to sequence and assemble 3Mbp segment of cacao genome IBM in silico assembly project – Testing the assembly pipeline

3 Sequencing the Cacao Genome: 3 Megabases at a Time Combination of: – “Old School Genomics” BAC libraries, physical mapping, and clone-by-clone sequencing – Roche 454 Titanium and FLX De Novo sequencing Key: – Not yet accurately assembled a eukaryotic genome with NGS alone – Reduce assembly complexity

4 3 Megabase segments Rounsley et al., 2009

5 Advantages Reduce assembly complexity Limit number of sequencing libraries Prioritize critical genomic regions Outsource BAC pools for sequencing in parallel at any center that has a 454 Titanium/GS-FLX sequencer Flexibility – Start slow with minimal investment – Could redesign strategy to reduce sequence runs

6 Strategy Components Integrated Physical/Genetic framework Pool development and sequencing: – BAC-end – Titanium 454 (paired/non-paired) – Draft sequence Assembly and integration: – Newbler – Celera (CABOG)

7 Cacao Integrated Physical/Genetic Framework Represents ~29X coverage (3 BAC libraries) Assembled into small number of large contigs Suggests reasonable levels of heterozygosity Manageable amounts of repetitive sequence 220 anchored genetic markers spanning 10 linkage groups – Resemble recombinational derived order

8 Pool Development Select contiguous BAC clones from MTP Pools will contain 25-30 clones – 20-30kb overlap Complete Cacao MTP will require 120-150 pools Repetitive-type regions: – BAC-end sequence and physical map data predictive tool Modify pools accordingly

9 Pool Development Estimate contig size using Consensus Band (CB) algorithm Example: Cacao cp genome is 160,604bp – Hybridization revealed cp containing contig and is estimated to be ~160 kb based on CB algorithm. Purified pool DNA can be produced at CUGI – Treat with ATP-dependent Dnase

10 Sequencing 3 Levels of Sequence: – Paired BAC-end Sequence – 20 kb increments – End sequencing of pool members – 454 sequencing of BAC pools Paired 3.5X-5.1X coverage (Roche 454/FLX) Non-paired 17X-26X coverage (Titanium)

11 454 Runs—Whole Genome 454 Titanium non-paired – 26X coverage/pool – 4 pools per slide (up to 150 pools total) Up to 38 slide runs 454 FLX paired-end (3kb) – 5X coverage/pool – 16 pools per slide (up to 150 pools total) Up to 10 slide runs total

12 Assembly/Curation of 3Mbp Segment Preprocessing – Filter reads to remove: Pair-end that did not contain both ends BAC vector E. coli (host DNA) Newbler Assembler (Roche) Celera Assembler (CABOG) – Improvements in homopolymer calls, and heterogeneous read length issues – Recently shown N50 contig size double to Newbler Human (50% repetitive) and microbes

13 Assembly Curation of 3Mbp Segment Assembly at various depths (5X, 10X, 15X) – Determine optimal sequencing coverage Utilize available data to scaffold contigs: – BAC end sequences every 20kb – Genetic marker sequences – RNA-seq clusters – Arabidopsis – Cacao synteny – Draft Sequence (2X) Augment approach by covering regions missed by clones – assist in selecting MTP

14 Assembly Curation of 3Mbp Segment Deliverable will be a pseudomolecule sequence for the 3Mbp region – Gaps will be strings of N Assess and employ lab-based gap filling strategies Make every attempt to close gaps

15 Assembly Validation and Correction In-silico virtual digest of scaffold sequence and compare to physical map restriction fragments – Draft sequence integration (DSI) via FPC Integrate and visualize physical map, 3 Mbp segments, and draft sequence

16 Sequence/Assembly Pipeline

17 IBM in silico Sequences IBM will provide a set of sequences that mimic the pilot caco sequences – Input error Indels, homopolymer calls, nucleotide substitutions Simulated data to test pipeline: – Physical map – Simulated BAC end sequences – Simulated pseudo-reads from pooled BACs – EST clusters – Indicate reference species for syntenic comparisons

18 Pilot Project Budget BAC-end sequencing (30K BACs), 20Kb increments – $206,605.00 Assembly/curation/validation of cacao 3Mbp – $16,720.00 Assembly of IBM in-silico derived sequences – $15,400.00

19 ESTIMATED Budget – Whole Genome Assembly Assembly, curation, validation of 130-150, 3Mbp segments – $147,620.00 Automated structural/functional annotation – $8,800.00

20 Acknowledgements USDA-ARS Mars Inc. Dr. Alex Feltus Stephen Ficklin Dr. Keith Murphy Dr. Margaret Staton


Download ppt "CUGI Pilot Sequencing/Assembly Projects Christopher Saski."

Similar presentations


Ads by Google