1.Data production 2.General outline of assembly strategy.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.
Next–generation DNA sequencing technologies – theory & practice
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
EU-SOL 2008 November 13-16, Toulouse, FRANCE CHROMOSOME 7 SEQUENCING Current status and perspective TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
Update tomato chr. 6 Roeland van Ham Centre for BioSystems Genomics The Netherlands.
SOL 2008 October 12-16, Cologne, Germany CHROMOSOME 7 THE FRENCH CONTRIBUTION TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497 T0676 TM18 CT54 T0966.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
The Changing Face of Sequencing
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Theobroma cacao Integrated Physical and Genetic Map 2 BAC Libraries 250 Genetic Markers.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Chromosome 2 Doil Choi, Sunghwan Jo KOREA. Cytological architecture of chromosome kb/µm DAPI (4’-6-diamidino-2-phenylindole) stained pachytene chromosome.
Bombus terrestris, the buff-tailed bumble bee Native to Europe A managed pollinator Commercially available Reared in greenhouses Important pollinator in.
Chromosome 12 M. Pietrella 1, G. Falcone 1, E. Fantini 1, A. Fiore 1, C. Perla 1, M.R. Ercolano 2, A. Barone 2, M.L. Chiusano 2, S. Grandillo 3, N. D’Agostino.
Chromosome 12 M. Pietrella 1, G. Falcone 1, E. Fantini 1, A. Fiore 1, M.R. Ercolano 2, A. Barone 2, M.L. Chiusano 2, S. Grandillo 3, N. D’Agostino 2, A.
HeterochromatinEuchromatin Relative chromosome length Relative bivalent diameter X 1.23 X 1.00 Relative area Relative optical density.
CSE403 Software Engineering Autumn 2000 Benchmark day Gary Kimura Lecture #23 November 17, 2000.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
2nd TOMATO FINISHING WORKSHOP chromosome 9 Wageningen, April 24-25, 2008.
billion-piece genome puzzle
Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.
GigAssembler. Genome Assembly: A big picture
De Novo Genome Assembly - Introduction
Day Two. DAY TWO 9:00 – 9:10Recap of day one 9:10 – 9:55TOPAAS demo (Sander) 9:55 – 10:15Coffee break 10:30 – 11:30New Technology Data 11:30 – 12:30High.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
US Contribution to the International Tomato Genome Sequencing Effort Current structure of contributions Ongoing activity summary Funding issues.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Are Roche 454 shotgun reads giving a accurate picture of the genome?
Manufacturing Simulation Case Studies
Virginia Commonwealth University
Tomato Sequencing Project Meeting at SOL 2008, Oct. 15, 2008
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Professors: Dr. Gribskov and Dr. Weil
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Stuff to Do.
How to Build a Horse: Final Report
Development of genome sequencing infrastructure and progress toward sequencing of chromosomes 1, 10 and 11 Steve Tanksley, Cornell U Steve Stack, Colorado.
Padova sequencing contribution:
Introduction to Sequencing
Assembly of Solexa tomato reads
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
The Potato Genome Sequencing Consortium: An Update
Presentation transcript:

1.Data production 2.General outline of assembly strategy

Original plan ►454 ►SOLiD ►WGP (Keygene): new sequence-based physical map |Due date: July 15

Developments ►US to join 454 data production ►Spain to prepare 4-5 kb mate-pair library and run ►Throughput Titanium lower than specs (500 Mb/run): | Mb /run ►Effect of clonality/redundancy apparent in 454 data: |~11% in shotgun library |~13% in 3 kb library |~30% in 20 kb library ►Roche/454 offered to prepare additional paired-end libraries ►new recommendations for coverage given by Roche/454:

New recommendations Roche/454 ►Libraries per genome size |3kb: 1 library every 250MB of your genome |8kb: 1 library every 100MB (or 250MB) of your genome |20kb: 1 library every 100MB of your genome ►Sequencing per library |3kb: 2 Titanium runs per library, 3X coverage |8kb: 1 Titanium run per library, 2X coverage |20kb: 0.5 Titanium runs per library, 1.5-2X coverage |15X shotgun reads

Paired-end library production by Roche/454 ►Q |3 kb libraries: 4 |20 kb libraries: 4 ►Q (currently being produced: ready ~beginning august) |8 kb libraries: 10 (US) |20 kb libraries: 6 (Italy & France) |40 kb libraries: 4 (US)

NL sequencing of Q libraries ►shotgun libraries (home made): |total: 19 runs ►3 kb: |lib1: 4.0 runs; lib2: runs; lib3: 0.25 runs; lib4: 0.25 runs |total: runs |libraries also shipped to Italy and France ►20 kb: |lib1: 5.75 runs; lib2 1 run; lib3: 0.25 runs; lib3: 0.25 runs |total: 7.25 runs |libraries also shipped to Italy and France

Typical output 454 Ti run basecalling software bug new version basecaller may yield additional ~25 bases/read!

Calculations (1) low end specs corrections for clonality/redundancy

Calculations (2) NL sequencing of Q libraries ►shotgun libraries (home made): |total: 19 runs = 5.9 Gb = 6.2X |recommended = 15X ►3 kb: |total: runs = 1.7 Gb (nonredundant) * 50% = 0.9X paired ends |recommended = 3X ►20 kb: |total: 7.25 runs = 1.5 Gb (nonredundant) * 50% = 0.8X paired ends |recommended = 1.5-2X

To be calculated today! ►Who has to do how much additional sequencing from which libraries?

SOliD data production ►NL / Applied BioSystems

SOliD data production ►NL / Applied BioSystems

SOliD data production ►Applied BioSystems offered to prepare additional 10 kb mate-pair library |currently running in Italy ►Spain produces 4-5 kb mate-pair library ►Discussion: |do we need additional 7 kb mate-pair library, to be prepared by UK?

Additional data ►~4 Million shotgun Sanger reads from Selected BAC Mixture (SBM- data, Kazusa) |currently being put on harddisk which will be shipped to Netherlands this week ►400,000 BAC ends (200,000 pairs) ►200,000 fosmid ends (100,000 pairs) |additional 200K reads will be produced (?) ►~36% euchromatic sequence (70 Mb) ►WGP: sequenced based physical map

1.Data production 2.General outline of assembly strategy

Strategy overview 1.Create assembly-validation set 2.Filter raw data 3.De novo assembly of 454 & SBM data 4.Consolidate 454/SBM assemblies 5.Integrate SOLiD data into 454/SBM assembly 6.Scaffold using BAC and fosmid ends 7.Map scaffolds to physical map

Strategy overview ►Release of assembly to SOL Sequencing Consortium: November |Annotation by iTAG ►Public release of data (under ENCODE guidelines) December 2009

Strategy in detail 1: Create assembly-validation set ►Input: Sanger BAC contigs from SGN Output: Selected high-quality subset of large Sanger BAC contigs Discussion: ►We might be able to use the same pipeline for BAC selection as is being developed for potato (by Erwin Datema) ►Coordinator/specific tasks/division of labor: single location, single person: NL ►Deadline: August 1

2: Filter raw data (1) ►Input: raw sequence data Output: clean reads, ready for assembly ►Discussion: Should the input data be filtered in advance? If so, what criteria should be used? Should all countries use the same filtering or can everyone experiment with different settings and filters and contribute their best data set? ►Possible filter criteria: repeats, contamination (human, vectors, local sources of contamination, mitochondrion/chloroplast), duplicates (redundancy & clonality) ►How exactly will the high repeat content influence the assembly? Can we include them in the assembly from the start or should we remove them to reduce complexity (and will this influence the final assembly quality)?

2: Filter raw data (2) ►Coordinator/specific tasks/division of labor: single location (filtering for local sources of contamination probably has to be done locally, because not everyone may be willing or allowed to share 'local' sequences) ►Deadline: September 5

3: De novo assembly of 454 & SBM data (1) ►Input: (filtered) 454 and SBM reads Output: 5-10 different assemblies Discussion: ►Explore different assembly methods, parameter settings, etc. |Newbler, CABOG, other? ►Should these assemblies already be validated against the validation set or will this happen during the next step? ►What are the criteria that an assembly should comply with or how to assess the quality of the assemblies? Should we define these? Statistics like the number of contigs/scaffolds, N50 size, etc?

3: De novo assembly of 454 & SBM data (2) Discussion: ►How should unassembled reads be treated? These would include repetitive reads, singleton reads (and very small contigs?), erroneous reads, etc. ►Should all data (assembled or not) be available in the end for possible usage downstream? ►Do we want to do a de novo assembly of the SOLiD data? If so, should we assemble it standalone or in a hybrid fashion with 454 & SBM? ►Coordinator/specific tasks/division of labor: Assembly in one location or distribute over countries? In case of the latter, how to divide the labor? In our opinion multiple people could contribute to this step. ►Deadline:

4: Consolidate multiple 454/SBM assemblies into a single best product (1) ►Input: 5-10 assembled data sets Output: Single best, validated, assembly of 454 and SBM data. Discussion: ►Reconcile and merge various assemblies (from step 3) into a single best assembly ►The assembly must be validated against the validation set (from step 1): all BAC contigs must be present in the assembly. ►Compare and validate assemblies (e.g. amosvalidate) and assess error rates among different assemblies

4: Consolidate multiple 454/SBM assemblies into a single best product (2) Discussion: ►What are the quality criteria? Which data makes it into the best assembly? How should conflicts between the assemblies be resolved? ►Can we already use the physical map for some quality assessment? ►Coordinator/specific tasks/division of labor: Consolidation should happen in a single location ►Deadline:

5: Add SOLiD data to 454/SBM assembly (1) ►Input: SOLiD reads and single best 454/SBM assembly (from step 4) Output: single best 454/SBM/SOLiD assembly Discussion: ►De novo assembly of SOLiD data? ►Use SOLiD reads to fix possible base errors in 454/SBM assembly and homopolymer tracts. ►Gap filling and extension using unassembled SOLiD/454/SBM reads and read-pairs

5: Add SOLiD data to 454/SBM assembly (2) Discussion: Coordinator/specific tasks/division of labor: De novo assembly can possibly be done by multiple people ►Consolidation and/or mapping (incl. gap filling) on 454/SBM assembly should happen at a single location ►Deadline:

6: Scaffold using BAC and fosmid ends ►Input: clone ends and single best 454/SBM/SOLiD assembly Output: single best 454/SBM/SOLiD/clone-end assembly Discussion: ►Strict selection on clone ends to select non-duplicated reads that have a paired-end read ►Newbler can handle paired fosmid ends but not BAC ends (limit on spacing of paired ends) ►Coordinator/specific tasks/division of labor: Single location? ►Deadline:

7: Map scaffolds to physical map ►Input: physical map and single best 454/SBM/SOLiD/CE assembly Output: draft of tomato genome Discussion: Should be done incrementally with mapping of the clone ends? How to handle contradictions between step 6 and 7? ►Coordinator/specific tasks/division of labor: Coordinated by and executed in NL (Wageningen) ►Deadline:

To be settled today ►Time frame |July - October2009 |Timing of deliverables ►Practical issues: |Division of labor ►Share all 454 data with assembly team from 454 Life Sciences (Jim Knight)?

Strategy overview TaskPartnersDue date 1. Create assembly-validation setNL Filter raw dataNL, Fr, It De novo assembly of 454 & SBM dataNL, US, Fr, It 4. Consolidate 454/SBM assembliesNL 5. Integrate SOLiD data into 454/SBM assemblyIt, Sp 6. Scaffold using BAC and fosmid endsNL 7. Map scaffolds to physical mapNL