Sequencing the Maize (B73) Genome

Slides:



Advertisements
Similar presentations
Advancing Science with DNA Sequence Maize Missouri 17 chromosome 10 project update Dan Rokhsar 3 October 2006.
Advertisements

Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Sequencing the Maize Genome Maize Genome Sequencing Consortium
Maize Production Sequencing
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Maize Genetics, Genomics, Bioinformatics workshop
The Human Genome Project
Introduction 1.Ordering of P. knowlesi contigs v P. falciparum methodology progress/status towards a synteny map – ‘true’ scaffold 2. Gene prediction generating.
9 Genomics and Beyond Brief Chapter Outline
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Mouse Genome Sequencing
Chromosome 8 Sequencing: Current Status and Future Prospects toward Finishing Shusei Sato, Erika Asamizu, Takakazu Kaneko, Hiroyuki Fukuoka, Satoshi Tabata.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments Automation Comparative Maps Genetic Marker Correspondences.
The New Zealand Institute for Plant & Food Research Limited Potato Genome Sequencing Consortium, notes from the edge Dr Susan Thomson, Dr Mark Fiers, Dr.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
SOL 2008 October 12-16, Cologne, Germany CHROMOSOME 7 THE FRENCH CONTRIBUTION TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497 T0676 TM18 CT54 T0966.
Tomato Overgo Project and Seed BAC Selection Cornell Team Ying Eileen Wang, 2005 PAG.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Genome Sequencing in the Legumes Le et al Phylogeny Major sequencing efforts Minor sequencing efforts ~14 MY ~45 MY.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
4th Solanaceae Genome Workshop 2007, September 09th- 13th, Jeju Island, Korea THE FRENCH CONTRIBUTION TO THE INTERNATIONAL TOMATO GENOME SEQUENCING PROGRAM.
FINISHING WORKSHOP APRIL 2008 CHROMOSOME 7 THE FRENCH CONTRIBUTION TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497 T0676 TM18 CT54 T0966 T0731 TM15.
Theobroma cacao Integrated Physical and Genetic Map 2 BAC Libraries 250 Genetic Markers.
Chromosome 2 Doil Choi, Sunghwan Jo KOREA. Cytological architecture of chromosome kb/µm DAPI (4’-6-diamidino-2-phenylindole) stained pachytene chromosome.
INDIAN INITIATIVE FOR TOMATO GENOME SEQUENCING Nagendra Singh National Research Centre on Plant Biotechnology Indian Agricultural Research Institute New.
Chromosome 12 M. Pietrella 1, G. Falcone 1, E. Fantini 1, A. Fiore 1, C. Perla 1, M.R. Ercolano 2, A. Barone 2, M.L. Chiusano 2, S. Grandillo 3, N. D’Agostino.
Chromosome 12 M. Pietrella 1, G. Falcone 1, E. Fantini 1, A. Fiore 1, M.R. Ercolano 2, A. Barone 2, M.L. Chiusano 2, S. Grandillo 3, N. D’Agostino 2, A.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
2nd TOMATO FINISHING WORKSHOP chromosome 9 Wageningen, April 24-25, 2008.
Center for Integrated Fungal Research
Maize Genome Project Shiran Pasternak January 13, 2006 Gramene SAB Meeting San Diego, CA Shiran Pasternak January 13, 2006 Gramene SAB Meeting San Diego,
Solanum lycopersicum Chromosome 4 Mapping and Finishing Update SRC-UK and Wellcome Trust Sanger Institute SOL Korea – September 2007 Wellcome Trust Medical.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Day Two. DAY TWO 9:00 – 9:10Recap of day one 9:10 – 9:55TOPAAS demo (Sander) 9:55 – 10:15Coffee break 10:30 – 11:30New Technology Data 11:30 – 12:30High.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
Drosophila Genomics Where are we now? Where are we going? Christopher Shaffer, Wilson Leung, Sarah Elgin Dept of Biology; Washington University in St.
16 th April 2007 Christine Nicholson, Mapping Core Group Wellcome Trust Sanger Institute Tomato Chromosome 4 Mapping & Use of FPC Copyright Wellcome Trust.
Sequencing Chromosome 12. runs db (blast) SOL dbrelational db Choice of suitable seed BACs Running 96 samples For each BAC check db update db update dbcheck.
26 th July 2006 Christine Nicholson, Mapping Core Group Karen McLaren, Finishing Group Leader Wellcome Trust Sanger Institute Sequencing the Gene Space.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Welcome to the combined BLAST and Genome Browser Tutorial.
The Oryza Map Alignment Project (OMAP) Overview of the OMAP project OMAP data in Gramene Future directions Bonnie Hurwitz, Gramene SAB meeting, January.
Virginia Commonwealth University
Pre-genomic era: finding your own clones
Stuff to Do.
Plant & Animal Genome Conference
Development of genome sequencing infrastructure and progress toward sequencing of chromosomes 1, 10 and 11 Steve Tanksley, Cornell U Steve Stack, Colorado.
Progress in sequencing chromosome 6
Assembly of BAC ends on P250I21
TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497 T0676 TM18 CT54 T0966 T0731 TM15 T1347 T1257 T0848 THE FRENCH CONTRIBUTION TO THE INTERNATIONAL.
Sequence the 3 billion base pairs of human
Presentation transcript:

Sequencing the Maize (B73) Genome Maize Genome Sequencing Consortium Genome Sequencing Center

The Team WU Genome Sequencing Center (R. Wilson, PI) Bob Fulton, Pat Minx, Sandy Clifton Arizona Genome Institute (R. Wing) Cold Spring Harbor Laboratory D. Ware, L. Stein R. McCombie, R. Martienssen Iowa State University (P. Schnable & S. Aluru) The Maize research community

The Plan

Progress as of 9/30/06

Agenda 9:00 – 9:15 Introductions and Project Overview (Rick Wilson) 9:15 – 10:15 Plans and Progress – WU/AGI/CSHL/ISU Project Map and Tile Path Selection (Rod Wing) Library Construction and Production (Lucinda Fulton) Sequence Improvement (Bob Fulton, Dick McCombie, Rod Wing) Data Submission (Joanne Nelson) Annotation and Data Display (Doreen Ware) Outreach (Rick Wilson) 10:15 - 10:30 Break 10:30 – 11:00 Plans and Progress – DOE Project (Dan Rohksar) 11:00 – 11:30 Future Plans and Collaborations Pat Schnable (by phone) - retrotransposons 11:30 – Noon Executive Session Noon – 1:00 Working Lunch and Discussion 1:00 Depart for Airport

BAC-by-BAC Strategy to Sequence the Maize Genome Maize B73 Genome (2300 Mb) BAC library construction (Hind III, EcoR I/MboI ; 27X deep ; 150kb avg. insert) Genetic Anchoring in silico, overgo hybridization Fingerprinting ~460,000 BACs BAC End Sequencing ~800,000 BAC physical maps (HICF & Agarose) FPC databases (Agarose and HICF) STC database Choose a seed BAC Shotgun sequencing and finishing STC database search, FP comparison Determine minimum overlap BACs Complete maize genome sequence

Map Summary Total Assembled Contigs: 721 Equal to 2,150 Mb, 93.5% coverage of 2300 Mb genome Anchored: 421 ctgs, 86.1% the genome average anchored contig size: 4.7 Mb Unanchored: 300 ctgs, 7.4% coverage average unanchored contig size: 0.56 Mb 189 of the 300 unanchored contigs are less than 10 clones Largest anchored contig 22.9Mb in Chr9 Largest unanchored contig 6.7 Mb Total FPC Markers: 25,924 STS markers: 9,129 Overgo Markers: 14,877 Anchored markers: 1918

MTP Selection Seed BACs: 4000, done Mega Contig: 197, done Clone Walking from Seed BACs: 2,800 done; in progress Total clones picked = 6,997 On track to deliver 1000 clones/month until maze MTP is complete

Flowchart for MTP picking and Library Construction Clone selection (combine seed BAC and BAC end sequences with fingerprinting and trace files) Clone picking (Resource Center) GenBank BAC end sequence database MTP sequencing Seed BAC database Library DNA production Library DNA production DNA shearing Hfq sequencing MTP BAC end database Clone verification Clone shipping Continue shotgun library construction at WashU

Seed BAC Walking In Agarose and HICF map, selecting large clones next to seed BAC Blastn search of BAC end sequences against seed BAC sequences Check blastn alignment for candidate clones Check trace file for Dye blob Check the Sulston score in HICF map for overlap Check Agarose fingerprints to avoid overlap with large bands Choose walking clone

Minimum Tile Path Pipeline BAC End Sequence of potential BACs are BLASTed against the Seed BACs Results are classified based on location on the FPC A table for each BAC is created of filtered BLAST results with links to CMap and GBrowse Blast results are imported into CMap and GBrowse with additional information such as trace files and FPCs

Minimum Tile Path Pipeline Usage A table of alignments between the seed BAC and the BAC end sequences contains links to CMap and GBrowse. CMap displays the FPC data for the seed BAC and the potential next BACs. GBrowse provides an alignment of the BES with the seed sequence and displays the trace data.

Blast Results Table

Maize Production Sequencing Shotgun of 19,000 BACs Fosmid End Sequencing of 1 Million Reads BAC End Sequencing of 220,000 clones

Maize BAC shotgun BAC DNA received from AGI or prepared at the GSC Small Scale Library Construction Production Sequencing - 1,536 reads/project Automated Shotgun_done

To date 3,106 BAC clones are shotgun_done

Maize Fosmid Sequencing Fosmid trays 0001 to 0471 were received from Messing lab Initial QC was fine, but bulk shipment has failed to grow Stamping results of the original trays show no growth 85 Fosmid ligations which represent ~250,000 clones were received from the Messing lab, plating is underway GSC Fosmid library construction has been completed and represents 1M clones Expected completion date is November of this year.

Maize BAC End Sequencing BAC end sequencing will be completed next week Total of 440,000 reads from two different libraries Pass rate of 75% with an average read length 600 bases Paired end read rate is ~70%

Sequence Improvement Pipeline Shotgun_done triggers the prefinishing pipeline Initial identification of “do finish” regions Manual sorting and use of autoedit(Gordon) to break apart misassembly. Autofinish(Gordon) used to choose directed reactions for all gaps and regions of low quality in “do finish” regions Reassembly and 2nd iteration of prefinishing pipeline Final identification of “do finish” regions and handoff to finishing pipeline

Clone Improvement through the Prefinishing Pipeline

Coverage (green) Spanning Plasmids End

EST sequence GSS sequence Do Finish Repeat Tags

Alignment with cDNA read pairs Alignment with End Sequences

Future Plans for Improved Throughput Automated Shotgun-done status assigning Overlap Evaluation at Prefinishing Addition of Fosmid End Pairs at Prefinishing Direct Sequencing for Unspanned Gaps Additional Finishing Staff Hired at all 3 Centers

Maize clone submissions clone status submission keywords shotgun complete HTGS_PHASE1; HTGS_FULLTOP 2 rounds of prefinish HTGS_PHASE1; HTGS_PREFIN in finishing HTGS_PHASE1; HTGS_ACTIVEFIN finished HTGS_PHASE1; HTGS_IMPROVED Query GenBank by keywords zea mays[ORGN] AND HTGS_PREFIN[KYWD] AND WUGSC[CNTR] zea mays[ORGN] AND HTGS_IMPROVED[KYWD] AND WUGSC[CNTR] Restrict by date range: zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09[PDAT] zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09/26:2006/10/03[PDAT] |

HTGS_IMPROVED submissions Pick a clonename, any clonename - DEFINITION Zea mays chromosome 4 clone CH201-11H16; ZMMBBc0011H16 Center project name: Z_AF-11H16 Improved sequence is annotated on submission record Where possible, contigs have been ordered and oriented based on read pairing. and these regions are designated as scaffolds. Small contigs (<2kb) that don’t represent a clone end, don’t contain improved sequence, or are not part of a scaffold are removed from the final submission. Contigs are screened for bacterial contamination

FEATURES Location/Qualifiers source 1..173904 /organism="Zea mays" /mol_type="genomic DNA" /db_xref="taxon:4577" /chromosome="unknown" /clone="CH201-112C8; ZMMBBc0112C08" misc_feature 1..51940 /note="scaffold_name:Scaffold1" misc_feature 1..36440 /note="assembly_name:Contig245 clone_end:left vector_side:T7" gap 36441..36540 /estimated_length=unknown misc_feature 36541..51940 /note="assembly_name:Contig240" misc_feature 51941..129231 /note="scaffold_name:Scaffold2" gap 51941..52040 misc_feature 52041..59371 /note="assembly_name:Contig250” ........... misc_feature 120342..122491 /note="Improved sequence." misc_feature 128142..129231 misc_feature 129232..139656 /note="scaffold_name:Scaffold3" .....

GenBank 1005 HTGS_FULLTOP 254 PREFIN_DONE 1532 ACTIVE_FIN 357 HTGS_IMPROVED

Ongoing work at CSHL BAC Annotations Levels Data Analysis Display Project Management Collaborations

BAC Data Analysis Ensembl Pipeline 3 inclusive phases of annotation Level I: Display BAC information Level II: Sequence-based annotations Level III: Integrative annotations Shiran Pasternak, Apurva Narechania, Joshua Stein

Application of Mathematical Repeat Analysis Identifies novel repeats w/o dependence on curation. Based on frequency of 20-mers in JGI WGS sequence Correlates with presence of retroelements. Can modulate threshold to optimize application. AC148169 Statistics are based on FgenesH exons from High Confidence Gene Models (Aligning nearly end to end with a peptide in nraa) which are classified as TE or WH. Apurva Narechania, Joshua Stein

Retroelement Annotation Collaboration with Jeff Bennetzen and Philip SanMiguel Classify retroelement families Current list covers ~68% of genome Ten most prevalent account for ~80% retroelement sequences Ji, huck, opie, zeon, cinful, prem1, grande, xilon, gyma, giepum Goal is to visualize the history of transpositions Retro class are color coded. LTR’s were identified and are shown in separate track (for clarity track is compacted and labels are turned off) Numbers below glyph show the start/end position of the alignment relative to the coordinates of the retro sequence (expressed as a percent of length). So if the retro sequence was 1000 nt long then 0-15 would represent position 1 to 150 of that sequence and 45-63 would represent position 450 to 630 of the sequence. This allows us to reconstruct the giepum element as once being a full length and intact element into which a full-length ji element and opie elements have inserted. Statistics on genome coverage are based on pilot 100 BACs Giepum element interrupted by ji and opie in AC148166 Joshua Stein

Whole Genome Alignments Wobble Aware Bulk Aligner (WABA)* TIGR Transcripts Rice WABA alignments Maize Distinguishes between: low similarity regions (grey) high-similarity regions (medium blue) high similarity regions w/ wobble-base mismatch of coding regions (green) *Kent, WJ & Zahler, A.M. (2000). Genome Res. 10:1115-25 Joshua Stein

Whole Genome Alignments BLASTZ* with AXTCHAIN** & CHAINNET** Sensitive gapped BLAST algorithm designed for aligning long sequences. Accommodates long gaps & overlapping gaps, inversions, translocations, & duplications *Schwartz, S et al. (2003). Genome Res. 13:103-7 **Kent, WJ, et al. (2003). PNAS 100:11484-11489 Example of BLASTZ(net) display in Ensembl. Cg is chicken and Cf is dog and the reference genome is human.  I chose these because the primates were so close to human that essentially you got a solid pink line across the screen (not very interesting).   http://www.ensembl.org/Homo_sapiens/contigview?c=1:18913781.5;w=1000000;bottom=%7Cbump_gallus_gallus_blastz_net_match%3Aoff

www.maizesequence.org Sequenced BAC FPC Contig Virtual Bin Core Bin Marker Chromosome Synteny Views Main Navigation bar is accessible from every page Contains multiple entry points to the genome

MapView Displays statistics by chromosome and provides entry points based on a single chromosome

CytoView Provides detail information on features anchored to the FPC map. The side bar highlights the location on the chromosome and provides page specific functionality including data export. The Detailed view is customizable, tracks can be added or removed by the users. Feature contain drop down menus that contain general information as well as provided internal links, and external links.

ContigView This view is based BAC coordinated and displays annotation levels II and III. The header contains the Clone name in the physical map, GenBank Accession, and Chromosome and FPC contig information. Detailed view offers semantic zooming, customizable and provides links to other views and information resources.

SyntenyView

Upcoming Features Notification System Release Users are notified October 2006 BlastView December 2006 BAC Annotation Level II January, 2007 Level III annotation April, 2007 WG alignments June, 2007 BioMart January, 2007 NSF collaborations TwinScan annotations: March, 2007 Maize Optical Map: July, 2007 Full-length cDNAs: December, 2007 Notification System Users are notified When a region of interest is updated When markers are aligned to a specific sequence January, 2007

Hardware Environments Software Developed locally Managed with source control Frequent releases to staging environment Quarterly production releases Data Timed analysis on staging environment Mirrored weekly on production The two maize machines (cannon and ascutney) are Dell Poweredge 2850's (2U rackmount). The have: - 8GB of memory - 1.5 TB of redundant disk space - Dual 3.8Ghz Intel 64bit processors The machines are backed up utilizing Atempo's Time Navigator backup software. The backups are performed over a gigabit switched network and the data is written to SDLT tape housed in an Overland Storage tape library. Full backups are performed every four weeks and incrementals three times per week. 42 IBM blade servers each containing: - 2 dual core AMD 2.0 Ghz 64 bit CPU's (84 CPU's. 168 cores total) - 4GB memory - 73GB SCSI disk drive Doreen has 3 BC, each with 14 blades. Each blade contains: two 2.0GHz Dual core Opteron 64's, with 4 x 1GB memory and a single 73Gb SFF SCSI disk. These are integrated into the HPCC. Overall there are 33+3 BC's in the HPCC all quad GigE connected to a central Cisco 6509 core. There are 2 management nodes and 2 storage nodes. The storage nodes are 32 bay Aberdeen units filled with 500GB SATA II drives in RAID6 configurations. How best to use the space, how any backups will be done, etc. is yet to be determined -- depends on user requests and usage; but overall there will be ~20TB available. Shiran Pasternak, Apurva Narechania

Quality Assurance Unit-testing framework Software Quality Control Binary assertions Failure report and automatic notification Software Quality Control e.g., code retrieves correct data from the database Data Quality Control e.g., clone in Genbank record exists in FPC map Shiran Pasternak

Project Management Mantis Bug Tracker Manage tasks using priorities, severities, and resource allocations Automated submission of issues using feedback form Generation of progress reports

Project Management Wiki Enhances group communication Meeting notes, flowcharts, specification documents Maintains history of specifications and design decisions Seamless editing

Collaborations MaizeGDB (Iowa State University, University of Missouri) C. Lawrence Maize Optical Map (University of Wisconsin) D. Schwartz Maize Transposon Annotation (University of Georgia, Purdue) J. Bennetzen, P. San Miguel Ensembl (EBI) E. Birney Vmatch for Mathematical Repeats (University of Hamburg) S. Kurtz Maize Full Length cDNA project (Arizona Genomics Institute) Y. Yu TwinScan (Danforth Plant Science Center) B. Barbazuk