Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.

Slides:



Advertisements
Similar presentations
Advancing Science with DNA Sequence Maize Missouri 17 chromosome 10 project update Dan Rokhsar 3 October 2006.
Advertisements

Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing and assembling
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Sequencing a genome and Basic Sequence Alignment Lecture 8 1Global Sequence.
Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.
De-novo Assembly Day 4.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Genome Sequencing in the Legumes Le et al Phylogeny Major sequencing efforts Minor sequencing efforts ~14 MY ~45 MY.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Fuzzypath – Algorithms, Applications and Future Developments
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly.
Sequencing a genome and Basic Sequence Alignment
The Changing Face of Sequencing
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Human Genome.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Welcome to the combined BLAST and Genome Browser Tutorial.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Virginia Commonwealth University
Lesson: Sequence processing
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Fragment Assembly (in whole-genome shotgun sequencing)
Lettuce/Sunflower EST CGPDB project.
Removing Erroneous Connections
Discovery tools for human genetic variations
CSCI 1810 Computational Molecular Biology 2018
Presentation transcript:

Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University

Motivation Maize genome is more complex than previously sequenced genomes – Many high-copy, long, highly conserved repeats – Genome contains many NIPs (Nearly Identical Paralogs, low-copy genes that are expressed and >98% identical; Emrich et al., 2007) (= CNPs and CNV) Hence, assembling this genome presents new challenges Are existing assembly programs up to the task?

Evidence of Assembly Errors Wash U noticed examples of collapse of repeats ISU identified examples of NIP collapse

AC AT GC B73 Mo17 SNP: single nucleotide polymorphism between alleles of a single gene Paramorphism (PM): a single nucleotide substitution between paralogs Nearly Identical Paralogs (NIPs): paralogous sequences with >99% identity Terms

Paramorphisms Provide Evidence of NIPs

Frequency of NIPs Conservatively ~1% of maize genes have NIPs (Emrich et al., 2007) Inspection of assembled BACs reveals NIP clusters But in addition also detect examples of NIP collapse CNPs/CNV associated with adaptive evolution in humans (Perry et al., Nat. Genetics, 2007)

BAC Assembly, Example 1 MAGI3.1 ID: MAGI_18749 (Emrich et al., 2007) BAC ID: CH C17 Paramorphic Sites: C/T (1,175), C/T (1,293), C/T (1,359) CH C17: gi| |gb|AC (152,054 bp) GenBank 56,57255, bp

BAC Assembly Example 1 - Site #1 BAC ID: CH C17 GI: GB: AC ,054 bp MAGI_18749 Paramorphic Site #1: C/T (1,175) 2 C vs 2 T Consensus Base Paramorphic Site #1 2/7 assembled BACs known to contain NIPs exhibit evidence of NIP collapse (conservative)

Traditional Assembly Sequence alignments between reads are identified Construct contigs – Start at a good alignment – Extend ends of contig one sequence at a time Clone pair information is used to scaffold contigs after contig construction.

Our Approach Integrate clone pair data into contig assembly process Model sequence alignments & clone pairs as a graph. First, construct an alignment graph Sequence reads are nodes A black edge is drawn between a pair of nodes if there is a valid sequence alignment

Clone Pair Informed Assembly Second, introduce two addl types of edges into the graph Clone pair edges (red) Path edges (green) A path edge exists between two nodes if: they are close together in the graph AND their clone pairs are also close together Identifies assembly-relevant sequence alignments

Repeat Example

Our Approach Series of graph transformations to ensure black edges (sequence alignments) represent correct genomic overlaps, and resolve entries into and exits out of repeats. – Use clone pairs to validate alignments in repeat regions if the corresponding mate pairs are anchored to unique regions and exhibit alignment. – Use paramorphisms to break spurious alignments due to NIPs. – Use clone pairs to match entries into and exits out of repeats. – Use clone pairs and validated alignments to guide contigs. – Use graph min-cuts to find correct assignment of reads to the complementary strands. – Use graph reductions and visualization for further analysis.

Example: Use Paramorphisms to Break Spurious Alignments GTCT A CAG GTCT C CAG GTCT A CAG GTCT C CAG

Three Random Stage 3 BACs Shotgun sequences extracted from Genbank and trimmed NameReadsPost TrimCorrupt Quality Info 273D N H

273D22 Annotate paths via walking through the graph. Make use of three levels of pointers: – Black edges: show what steps are available – Green edges: indicate the best path – Red edges: indicate our final destination

273D22: Incorrect Contiging Contig 0 Contig 1 Contig 1 is a small contig in the finished BAC that contains sequences that should be attached to the end of Contig 0.

273D22: Missing Scaffold

306N19: Mis-assembly Contig 3 Contig 5 Contig 0 Contig 4 Contig 3

306N19: Complex Repeat

D396H10: Missed Scaffolding Contig 6 Contig 8 Contig 5

D396H10: Missed Scaffolding Contig 7 Contig 2 Contig 3

Identifying Assembly Errors ???

273D22: Weak Link not Corroborated by Clone Pairs Contig 3

Conclusions & Future Directions Discovered misassembled regions in all three randomly chosen BACs – Conclusions supported by multiple lines evidence (clone pair + overlap) – Mis-assemblies (e.g., repeat-induced knots; collapsed repeats & NIPs) and missed scaffolding Benefits of our approach – Can provide better assemblies Can navigate through repeats Can correctly assemble NIPs – With development could output contigs and perform scaffolding in one step – Could provide refined finishing advice – Could include a community-accessible visualization of assembled BAC contigs and supporting data (confidence levels) Longer term – Our assembly approach could be applied to whole genome assembly of maize and other complex genomes – Could incorporate paired next generation sequencing data (e.g. 454, Solexa, Solid) Needed research – Random collection of finished BACs (truth) – Develop algorithms for navigating paths through the graph – Accurately construct final contigs that contain multiple copies of repeats – Create BAC re-assembly pipeline (inform finishing efforts in future sequencing projects) – Scale approach to whole genome level