The Changing Face of Sequencing

Slides:



Advertisements
Similar presentations
Advancing Science with DNA Sequence Maize Missouri 17 chromosome 10 project update Dan Rokhsar 3 October 2006.
Advertisements

Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.
De-novo Assembly Day 4.
Mouse Genome Sequencing
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
The New Zealand Institute for Plant & Food Research Limited Potato Genome Sequencing Consortium, notes from the edge Dr Susan Thomson, Dr Mark Fiers, Dr.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Genome sequencing Haixu Tang School of Informatics.
Genome Sequencing in the Legumes Le et al Phylogeny Major sequencing efforts Minor sequencing efforts ~14 MY ~45 MY.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
WGP Tomato EU-SOL meeting July 15, 2009 Antoine Janssen.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Chromosome 12 M. Pietrella 1, G. Falcone 1, E. Fantini 1, A. Fiore 1, C. Perla 1, M.R. Ercolano 2, A. Barone 2, M.L. Chiusano 2, S. Grandillo 3, N. D’Agostino.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
1.Data production 2.General outline of assembly strategy.
Human Genome.
billion-piece genome puzzle
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
Structural genomics includes the genetic mapping, physical mapping and sequencing of entire genomes.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.
Quality Control & Preprocessing of Metagenomic Data
Cross_genome: Assembly Scaffolding using Cross-species Synteny
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Pre-genomic era: finding your own clones
Very important to know the difference between the trees!
2nd (Next) Generation Sequencing
Padova sequencing contribution:
CSCI 1810 Computational Molecular Biology 2018
Next-generation DNA sequencing
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Presentation transcript:

The Changing Face of Sequencing Strategies for de novo sequencing of complex genomes

Quick Review: BACs Whole Genome Shotgun

First some history…. 2000: Arabidopsis 2005: Rice 2006: Poplar BAC 2000: Arabidopsis BAC & WGS 2005: Rice WGS 2006: Poplar WGS 2007: Grapevine BAC 2008: Maize WGS 2008: Papaya WGS 2009: Sorghum

BAC-based vs WGS BAC-by-BAC WGS Pros Cons Simpler more accurate assembly Localized sequence Easily distributed Can be targeted to regions Physical map not needed, but helps Logistically simple Low library costs Rapid Cons Requires physical map Labor intensive Expensive (more libraries) Slower Complex assembly Harder to localize sequence Requires centralized assembly Whole genome or nothing

What made WGS possible? Long, high quality Sanger reads (700-800bp) Paired-end libraries Range of insert sizes 3kb 8-10kb 40kb fosmids Assemblers tailored to these datatypes. Still not guaranteed… public maize project went BAC by BAC

NGS changes all the rules Quantity not quality is now the focus New platforms generate huge quantities of data Read length & PE’s initially limited de novo apps Rapid cycle of improvements No time for standard approaches to spread beyond genome centers before next cycle begins. Third party software sometimes slow to catch up Cost model has changed Library construction used to be minor component of cost Unit used to be 96 or 384 reads….. Choice is now more complex than BAC vs WGS

does not One size^fits all Every project has individual needs Monolithic reference genome is rarely needed now How bad are the repeat structures? Is it important to get them right? How important is it to anchor all the sequence to a genome location? What other genome data can be leveraged?

BACs and NGS – the problem Pre-NGS: To sequence a BAC: Make 1 sequencing library ~$50-100 Sequence two 384-well plates of clones ~$750 ~6x coverage With NGS: To sequence a BAC with 454: Make 1 sequencing library ~$300 Sequence 1/8 plate of 454: ~$1,000 ~600x coverage Too expensive, and too much coverage…..

New BAC-based approaches One library per BAC is cost-prohibitive Map-based BAC pooling Retain some of the assembly benefits of BACs Reduced library costs over BAC-by-BAC If contiguous, retains the genome localization benefits

Scaffolds from individual BAC pools BAC pooling strategy Chr3. shortarm Select FPC contigs on the shortarm FPC contigs Select overlapping BACs and bin them into 3Mb pools 3 Mb pools Selected BACs Pyrosequencing of BAC pools and assembly of raw sequences ~20x 454 Titanium Reads (~400bp each) Contigs from individual BAC pools 454 FLX PE’s (~250bp each) Contigs are organized into scaffolds using 454 paired end sequences Scaffolds from individual BAC pools Use BAC ends for very long scaffolds Generate superscaffolds using BAMBUS and BAC end sequences Superscaffolds spanning pool boundaries From Rounsley et al. (2009)

Results: Chr3S of Oryza barthii 6 x 3Mb BAC pools 1 Titanium Run 0.5 FLX Run ~$12k in reagents Contig N50: 14.3 kb Scaffold N50: 370.9 kb Scaffold N50: 3,165.1 kb (after BAC ends) Nt Accuracy: 2.2 errors per 10kb

2D pooling: An alternative to contiguous BAC pools Place ordered clones in plates 1 Library from each row 1 Library from each column Identify reads from each individual clone by sequence overlap. Then assemble each clone Assembly unit reduced to ~ single BAC Library cost drops with size of grid 10x10: 100 clones, 20 libraries 50x50: 2500 clones, 100 libraries 3D grid lowers cost even further 10x10x10: 1000 clones, 30 libraries 20x20x20: 4000 clones, 60 libraries Repeats may misbehave but can choose to ignore them

The ideal…. One library per BAC clone Barcoded Sequence all clones from BAC library in one combined, barcoded pool BUT: currently not cost-effective. Individual DNA preps for thousands of BAC clones is costly

Is WGS with NGS feasible yet? 400bp reads, + 4kb and 20kb insert PE protocols Success may be Species & Goal dependent: Arabidopsis small & low repeat content 21kb contig N50; 2.6Mb scaffold N50 Roche & Ecker Cassava 800Mb, lots of repeats 5.3kb contig N50; 180kb scaffold N50 Roche & JGI Missing half of the genome (repetitive half)

WGS with Solexa/Illumina Improved read-lengths, PE protocols Improved third party assemblers e.g. SOAPdenovo, Velvet Cucumber genome - BGI 300Mb genome 50x coverage with 50bp PE 5kb contigN50, 60kb scaffoldN50 Much better when mixed with 4x Sanger Missing half of genome (repeats) Panda Genome - BGI 3Gb genome 50x coverage with 75bp PE 300kb contigN50 (?) Big question: What is misassembly rate?

Building contigs from overlapping clones 5 overlapping BAC clones form small contig Cut with R.E. Overlapping BACs share common fragments

Building contigs from overlapping clones Measure lengths Overlapping BACs will share fragments of same size Make sequencing lib Sequence from each cut site Overlapping BACs will share sequence tags next to each cut site

A BAC-WGS hybrid? whole genome profiling by Keygene A: Solexa-based BAC map Construct BAC library; array into 2D pools Cut with restriction enzyme, and make 1 library per pool. Generate sequence from libraries Deconvolute pools to identify the Solexa reads from each BAC. Build a map from overlaps Map has short sequence tag every 1-2kb in genome B: WGS sequencing with Solexa Assemble short contigs (high stringency) Use above map to locate each contig in genome. Map can identify misassemblies C: Result: High quality map-based genome at fraction of cost

Simulation of Tag-based Map building Rice: 372Mb, 12 chromosomes Simulate a 10x BAC library 28,600 clones Cut the sequence for each clone with HindIII Simulate a short read sequence from each site 2.2 million sequence tags Build a map from these – overlapping clones share tags 33 contigs built (<3 contigs per chromosome) Only 1 misassembly!

So you want to sequence a genome? Lots of choices to make: BACs, WGS Which NGS technology? Single end, paired end? What size paired ends? What depth of coverage from each? How do you pick? Do lots of testing of strategies - $$$$$ Guess – Free Copy what someone else did - Free Educated Guess based on Simulation

How to decide on a strategy? Simulating Genome Sequencing “Plantagora” Plant Genome Assembly Simulation Platform Use existing genomes to simulate sequencing reads Combine reads in many combinations Assemble Score the results with meaningful metrics Report results on web site

Summary No longer BACs vs WGS Different ways of using BACs Linear pooling 2D pooling BACs for map, WGS for sequence WGS works on easy parts of genome Simulation is valuable in evaluating strategies