Presentation on theme: "Sequencing the Maize (B73) Genome"— Presentation transcript:
1Sequencing the Maize (B73) Genome Maize Genome Sequencing ConsortiumGenomeSequencingCenter
2The Team WU Genome Sequencing Center (R. Wilson, PI) Bob Fulton, Pat Minx, Sandy CliftonArizona Genome Institute (R. Wing)Cold Spring Harbor LaboratoryD. Ware, L. SteinR. McCombie, R. MartienssenIowa State University (P. Schnable & S. Aluru)The Maize research community
6Agenda 9:00 – 9:15 Introductions and Project Overview (Rick Wilson) 9:15 – 10:15 Plans and Progress – WU/AGI/CSHL/ISU ProjectMap and Tile Path Selection (Rod Wing)Library Construction and Production (Lucinda Fulton)Sequence Improvement (Bob Fulton, Dick McCombie, Rod Wing)Data Submission (Joanne Nelson)Annotation and Data Display (Doreen Ware)Outreach (Rick Wilson)10: :30 Break10:30 – 11:00 Plans and Progress – DOE Project (Dan Rohksar)11:00 – 11: Future Plans and CollaborationsPat Schnable (by phone) - retrotransposons11:30 – Noon Executive SessionNoon – 1:00 Working Lunch and Discussion1: Depart for Airport
7BAC-by-BAC Strategy to Sequence the Maize Genome Maize B73 Genome (2300 Mb)BAC library construction(Hind III, EcoR I/MboI ; 27X deep ; 150kb avg. insert)Genetic Anchoring in silico, overgo hybridizationFingerprinting~460,000 BACsBAC EndSequencing~800,000BAC physical maps (HICF & Agarose)FPC databases(Agarose and HICF)STC databaseChoose a seed BACShotgun sequencing and finishingSTC database search, FP comparisonDetermine minimum overlap BACsComplete maize genome sequence
8Map Summary Total Assembled Contigs: 721 Equal to 2,150 Mb, 93.5% coverage of 2300 Mb genomeAnchored: 421 ctgs, 86.1% the genomeaverage anchored contig size: 4.7 MbUnanchored: 300 ctgs, 7.4% coverageaverage unanchored contig size: 0.56 Mb189 of the 300 unanchored contigs are lessthan 10 clonesLargest anchored contig 22.9Mb in Chr9Largest unanchored contig 6.7 MbTotal FPC Markers: 25,924STS markers: 9,129Overgo Markers: 14,877Anchored markers: 1918
9MTP Selection Seed BACs: 4000, done Mega Contig: 197, done Clone Walking from Seed BACs: 2,800 done; in progressTotal clones picked = 6,997On track to deliver 1000 clones/month until maze MTP is complete
10Flowchart for MTP picking and Library Construction Clone selection(combine seed BAC and BAC end sequences with fingerprinting and trace files)Clone picking (Resource Center)GenBank BAC end sequence databaseMTP sequencingSeed BAC databaseLibrary DNA productionLibrary DNA productionDNA shearingHfq sequencingMTP BAC end databaseClone verificationClone shippingContinue shotgun library construction at WashU
11Seed BAC WalkingIn Agarose and HICF map, selecting large clones next to seed BACBlastn search of BAC end sequences against seed BAC sequencesCheck blastn alignment for candidate clonesCheck trace file for Dye blobCheck the Sulston score in HICF map for overlapCheck Agarose fingerprints to avoid overlap with large bandsChoose walking clone
12Minimum Tile Path Pipeline BAC End Sequence of potential BACs are BLASTed against the Seed BACsResults are classified based on location on the FPCA table for each BAC is created of filtered BLAST results with links to CMap and GBrowseBlast results are imported into CMap and GBrowse with additional information such as trace files and FPCs
13Minimum Tile Path Pipeline Usage A table of alignments between the seed BAC and the BAC end sequences contains links to CMap and GBrowse.CMap displays the FPC data for the seed BAC and the potential next BACs.GBrowse provides an alignment of the BES with the seed sequence and displays the trace data.
19Maize Fosmid Sequencing Fosmid trays 0001 to 0471 were received from Messing labInitial QC was fine, but bulk shipment has failed to growStamping results of the original trays show no growth85 Fosmid ligations which represent ~250,000 clones werereceived from the Messing lab, plating is underwayGSC Fosmid library construction has been completed andrepresents 1M clonesExpected completion date is November of this year.
21Maize BAC End Sequencing BAC end sequencing will be completed next weekTotal of 440,000 reads from two different librariesPass rate of 75% with an average read length 600 basesPaired end read rate is ~70%
22Sequence Improvement Pipeline Shotgun_done triggers the prefinishing pipelineInitial identification of “do finish” regionsManual sorting and use of autoedit(Gordon) to break apart misassembly.Autofinish(Gordon) used to choose directed reactions for all gaps and regions of low quality in “do finish” regionsReassembly and 2nd iteration of prefinishing pipelineFinal identification of “do finish” regions and handoff to finishing pipeline
23Clone Improvement through the Prefinishing Pipeline
27Alignment with cDNA read pairs Alignment with End Sequences
28Future Plans for Improved Throughput Automated Shotgun-done status assigningOverlap Evaluation at PrefinishingAddition of Fosmid End Pairs at PrefinishingDirect Sequencing for Unspanned GapsAdditional Finishing Staff Hired at all 3 Centers
29Maize clone submissions clone status submission keywordsshotgun complete HTGS_PHASE1; HTGS_FULLTOP2 rounds of prefinish HTGS_PHASE1; HTGS_PREFINin finishing HTGS_PHASE1; HTGS_ACTIVEFINfinished HTGS_PHASE1; HTGS_IMPROVEDQuery GenBank by keywordszea mays[ORGN] AND HTGS_PREFIN[KYWD] AND WUGSC[CNTR]zea mays[ORGN] AND HTGS_IMPROVED[KYWD] AND WUGSC[CNTR]Restrict by date range:zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09[PDAT]zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09/26:2006/10/03[PDAT]|
30HTGS_IMPROVED submissions Pick a clonename, any clonename -DEFINITION Zea mays chromosome 4 clone CH201-11H16; ZMMBBc0011H16Center project name: Z_AF-11H16Improved sequence is annotated on submission recordWhere possible, contigs have been ordered and oriented based on read pairing. and these regions are designated as scaffolds.Small contigs (<2kb) that don’t represent a clone end, don’t contain improved sequence, or are not part of a scaffold are removed from the final submission.Contigs are screened for bacterial contamination
35Application of Mathematical Repeat Analysis Identifies novel repeats w/o dependence on curation.Based on frequency of 20-mers in JGI WGS sequenceCorrelates with presence of retroelements.Can modulate threshold to optimize application.AC148169Statistics are based on FgenesH exons from High Confidence Gene Models (Aligning nearly end to end with a peptide in nraa) which are classified as TE or WH.Apurva Narechania, Joshua Stein
36Retroelement Annotation Collaboration with Jeff Bennetzen and Philip SanMiguelClassify retroelement familiesCurrent list covers ~68% of genomeTen most prevalent account for ~80% retroelement sequencesJi, huck, opie, zeon, cinful, prem1, grande, xilon, gyma, giepumGoal is to visualize the history of transpositionsRetro class are color coded.LTR’s were identified and are shown in separate track (for clarity track is compacted and labels are turned off)Numbers below glyph show the start/end position of the alignment relative to the coordinates of the retro sequence (expressed as a percent of length).So if the retro sequence was 1000 nt long then 0-15 would represent position 1 to 150 of that sequence and would represent position 450 to 630 of the sequence.This allows us to reconstruct the giepum element as once being a full length and intact element into which a full-length ji element and opie elements have inserted.Statistics on genome coverage are based on pilot 100 BACsGiepum element interrupted by ji and opie in AC148166Joshua Stein
37Whole Genome Alignments Wobble Aware Bulk Aligner (WABA)*TIGR Transcripts RiceWABA alignments MaizeDistinguishes between:low similarity regions (grey)high-similarity regions (medium blue)high similarity regions w/ wobble-base mismatch of coding regions (green)*Kent, WJ & Zahler, A.M. (2000). Genome Res. 10:Joshua Stein
38Whole Genome Alignments BLASTZ* with AXTCHAIN** & CHAINNET**Sensitive gapped BLAST algorithm designed for aligning long sequences.Accommodates long gaps & overlapping gaps, inversions, translocations, & duplications*Schwartz, S et al. (2003). Genome Res. 13:103-7**Kent, WJ, et al. (2003). PNAS 100:Example of BLASTZ(net) display in Ensembl.Cg is chicken and Cf is dog and the reference genome is human. I chose these because the primates were so close to human that essentially you got a solid pink line across the screen (not very interesting).
39Sequenced BACFPC ContigVirtual BinCore Bin MarkerChromosomeSynteny ViewsMain Navigation bar is accessible from every pageContains multiple entry points to the genome
40MapViewDisplays statistics by chromosome and provides entry points based on a single chromosome
41CytoViewProvides detail information on features anchored to the FPC map.The side bar highlights the location on the chromosome and provides page specific functionality including data export.The Detailed view is customizable, tracks can be added or removed by the users.Feature contain drop down menus that contain general information as well as provided internal links, and external links.
42ContigViewThis view is based BAC coordinated and displays annotation levels II and III.The header contains the Clone name in the physical map, GenBank Accession, and Chromosome and FPC contig information.Detailed view offers semantic zooming, customizable and provides links to other views and information resources.
44Upcoming Features Notification System Release Users are notified October 2006BlastViewDecember 2006BAC AnnotationLevel II January, 2007Level III annotation April, 2007WG alignments June, 2007BioMartJanuary, 2007NSF collaborationsTwinScan annotations: March, 2007Maize Optical Map: July, 2007Full-length cDNAs: December, 2007Notification SystemUsers are notifiedWhen a region of interest is updatedWhen markers are aligned to a specific sequenceJanuary, 2007
45Hardware Environments SoftwareDeveloped locallyManaged with source controlFrequent releases to staging environmentQuarterly production releasesDataTimed analysis on staging environmentMirrored weekly on productionThe two maize machines (cannon and ascutney) are Dell Poweredge 2850's (2U rackmount). The have:- 8GB of memory- 1.5 TB of redundant disk space- Dual 3.8Ghz Intel 64bit processorsThe machines are backed up utilizing Atempo's Time Navigator backup software. The backups are performed over a gigabit switched network and the data is written to SDLT tape housed in an Overland Storage tape library. Full backups are performed every four weeks and incrementals three times per week.42 IBM blade servers each containing:- 2 dual core AMD 2.0 Ghz 64 bit CPU's (84 CPU's. 168 cores total)- 4GB memory- 73GB SCSI disk driveDoreen has 3 BC, each with 14 blades. Each blade contains: two 2.0GHz Dual core Opteron 64's, with 4 x 1GB memory and a single 73Gb SFF SCSI disk. These are integrated into the HPCC. Overall there are 33+3 BC's in the HPCC all quad GigE connected to a central Cisco 6509 core. There are 2 management nodes and 2 storage nodes. The storage nodes are 32 bay Aberdeen units filled with 500GB SATA II drives in RAID6 configurations. How best to use the space, how any backups will be done, etc. is yet to be determined -- depends on user requests and usage; but overall there will be ~20TB available.Shiran Pasternak, Apurva Narechania
46Quality Assurance Unit-testing framework Software Quality Control Binary assertionsFailure report and automatic notificationSoftware Quality Controle.g., code retrieves correct data from the databaseData Quality Controle.g., clone in Genbank record exists in FPC mapShiran Pasternak
47Project Management Mantis Bug Tracker Manage tasks using priorities, severities, and resource allocationsAutomated submission of issues using feedback formGeneration of progress reports
48Project Management Wiki Enhances group communication Meeting notes, flowcharts, specification documentsMaintains history of specifications and design decisionsSeamless editing
49CollaborationsMaizeGDB (Iowa State University, University of Missouri)C. LawrenceMaize Optical Map (University of Wisconsin)D. SchwartzMaize Transposon Annotation (University of Georgia, Purdue)J. Bennetzen, P. San MiguelEnsembl (EBI)E. BirneyVmatch for Mathematical Repeats (University of Hamburg)S. KurtzMaize Full Length cDNA project (Arizona Genomics Institute)Y. YuTwinScan (Danforth Plant Science Center)B. Barbazuk