Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

Slides:



Advertisements
Similar presentations
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
Advertisements

V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.
ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Click to edit Master title style Irys data analysis January 10 th, 2014.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Reminder: Class on Friday, Discussion of Li et al. Proposal/Projects CAMERA feedback?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Introduction to next generation sequencing Rolf Sommer Kaas.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
The Changing Face of Sequencing
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Bombus terrestris, the buff-tailed bumble bee Native to Europe A managed pollinator Commercially available Reared in greenhouses Important pollinator in.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
1.Data production 2.General outline of assembly strategy.
Human Genome.
billion-piece genome puzzle
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Virginia Commonwealth University
Sequence Assembly.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Cross_genome: Assembly Scaffolding using Cross-species Synteny
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Denovo genome assembly of Moniliophthora roreri
M. roreri de novo genome assembly using abyss/1.9.0-maxk96
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Jin Zhang, Jiayin Wang and Yufeng Wu
CS 598AGB Genome Assembly Tandy Warnow.
2nd (Next) Generation Sequencing
How to Build a Horse: Final Report
Presentation transcript:

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut teamed together (under the KanGO consortium) as part of an international effort to generate a de novo genome sequence for marsupial model species, the tammar wallaby (M. eugenii). An important model organism is part of the Metatherian mammals and harbors unique life history traits and genome features. Genome sequencing was done at several institutions, using several technologies, over several years. Developed a pipeline to integrate long and short read sequences and existing assemblies using well known mapping and assembly tools such as Bowtie and Phrap.

Overview of the Data TechnologyInstitutionRead Length# Reads# Bases SangerBCMAvg. 915 bp9,924,1369,088,748, AGRF, UCONNAvg. 160 bp1,530,592275,951,386 IlluminaAGRF100 bp271,875,06427,187,506,400 SolidBCM25 bp710,427,49018,471,114,740 All reads used in reassembly were “paired”. Read orientation not considered. The Solid reads had insert size of kb. The Illumina reads had insert size of 3 and 8 kb in roughly equal proportions. The 454 reads had insert size of 8, 12, 20, and 30 kb with the majority of them split between 20 and 30 kb. Read Insert “Paired” Read Overview:

Local Assembly Pipeline Initial Assembly Sanger reads assembled using Atlas (BCM). Initial scaffolding done using Solid mate pairs. Map Short Reads To scaffolds using Bowtie, there are two cases: Orphaned, when one mate is unmapped. Complete, when both are mapped. Reads which map to multiple locations (non unique) are not considered. Key: 454, illumina, sanger, unmapped read, dotted line – estimated distance Initial Assembly Map Short Reads Scaffold Assembly Final Mapping Gap Re- estimation Quality Calling Finished Assembly

Pipeline continued Final Mapping Map all data to the final contigs. Gap Restimation Of all gap distances in scaffolds using complete pairs mapped on different contigs. ;;;;;;;;;;;7;;;;;-;;;3;83;;;;;;;;;;;9;7;;.7; ;;;;;;;;;;;7;;;;;-;;;3;83 ;;3;;;;;;;7;;;;;;;88383;;;34;;3;;;;;;7;;;;;;;88383;;;77 Quality Calling Map all reads and calculate a quality of each base. Scaffold Assembly Feed contigs, complete and orphaned pairs to Phrap. Re-assemble. Output: Contigs, Quality Scores, Scaffold (agp file).

Local Assembly Close-up Screen shot taken from Codon Code aligner which uses Phrap to map reads. Red and blue denote orientation. A: Two contigs may be fused using short reads. B: Contig will be extended with short reads at one end. C: Single nucleotide and small errors are corrected using the short reads higher coverage. We are confident of changes since even at < 10x each read is itself uniquely mapped, or its approximate position supported by a uniquely mapped read. AB C

Assembly Comparison and Validation RIKEN BACsTotal ReadsTotal BasesRecovered Reads Recovered Bases 1.1 (original) BAC FOSMID (updated) BAC FOSMID CategoryMeug 1.0Meug 1.1Meug 1.2 Contigs (10^6) N50 (10^3) Bases (10^6) scaffolds Gaps (10^6)NA539614

Gap Estimation Methods Step1: Maximization: compute gap estimate (x), let the mean insertion length of N pairs equal μ (initial value is library average). Step 2: Sampling, given x, and the length of contigs, sample μ from completely mapped reads spanning gaps. Gap between two contigs estimated using an expectation maximization algorithm. Steps are repeated until estimated parameters do not change.

Gap Estimation Results Simulation study of EM algorithm accuracy. ctg len \ gap When using libraries with different insert size and std deviation it is necessary to bundle the estimates. The following is an example of how two libraries are bundled: n x = # reads in lib x, e x = lib x estimate, s x = lib std dev.

Conclusion This method is a viable way of improving existing draft genomes with short read technologies at limited (<10x depth) coverage. This method is robust and easily parallelized so it is practical for large mammalian genomes. Better results may be obtained through multiple iterations. Re-scaffolding of the contigs should be done between iterations. A contig aware assembly algorithm could improve local assembly performance. Future Work