Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.

Slides:



Advertisements
Similar presentations
Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Next Generation Sequencing, Assembly, and Alignment Methods
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
The Human Genome Race. Collins vs. Venter Collins Venter.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
1 Next Generation Sequencing Itai Sharon November 11th, 2009 Introduction to Bioinformatics.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Henrik Lantz - BILS/SciLife/Uppsala University
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Sequence comparison: Local alignment
Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Sequencing a genome and Basic Sequence Alignment
The Changing Face of Sequencing
Chapter 21 Eukaryotic Genome Sequences
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
1.Data production 2.General outline of assembly strategy.
Human Genome.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
billion-piece genome puzzle
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
De Novo Genome Assembly - Introduction
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.
Human Genome Project.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Very important to know the difference between the trees!
Stuff to Do.
Introduction to Genome Assembly
Henrik Lantz - NBIS/SciLife/Uppsala University
CS 598AGB Genome Assembly Tandy Warnow.
How to Build a Horse: Final Report
A Sequenciação em Análises Clínicas
CSCI 1810 Computational Molecular Biology 2018
Human Genome Project Seminal achievement. Scientific milestone.
Presentation transcript:

Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1

WGS sequencing Multiple copies of DNA Fragments of ,000 bases No information is retained on which part of the DNA the fragments came from. 2

WGS sequencing: fragments Sequencing machine reads bases on the ends of the fragments, producing pairs of reads. The fragment sizes are known up to ± 10-20%. CAAGCTGAT... Pair of reads Unknown sequence …GTTTGGAAC 3

The mathematical problem We start with millions of pairs of reads, bases each We start with millions of pairs of reads, bases each Multiple copies of DNA provide multiple coverage by reads Multiple copies of DNA provide multiple coverage by reads The problem of genome assembly is to recover the original sequence of bases of the genome (as much as possible…). The problem of genome assembly is to recover the original sequence of bases of the genome (as much as possible…). 4

Assembling a jigsaw puzzle 1 The task of the assembly becomes the task of assembling a giant jigsaw puzzle The task of the assembly becomes the task of assembling a giant jigsaw puzzle We look for reads whose sequences suggest that they came from the same place in the genome: AGTGATTAGATGATAGTAGA |||||||||||| GATGATAGTAGAGGATAGATTTA We look for reads whose sequences suggest that they came from the same place in the genome: AGTGATTAGATGATAGTAGA |||||||||||| GATGATAGTAGAGGATAGATTTA 5

Assembling a jigsaw puzzle 2 Then we put “overlapping” reads together Then we put “overlapping” reads together AGTGATTAGATGATAGTAGA AGATGATAGTAGAGATAGATAGACC AGATGATAGTAGAGATAGATAGACC ATAGATAGACCACTCATCATAC ATAGATAGACCACTCATCATACAGTGATTAGATGATAGTAGAGATAGATAGACCACTCATCATAC reads This yields a “contig” 6

Assembling a jigsaw puzzle 3 We use read pairing information to order and orient contigs to produce scaffolds – the final product of assembly We use read pairing information to order and orient contigs to produce scaffolds – the final product of assembly Pairs of reads belonging to the same fragment of DNA contig 7

Difficulties in assembly Sequencing errors: two reads that came from the same place in the genome often have mismatching sequences AGTGATTAGATCATAGTAGAG || ||||||||| ATGATAGTAGAGGATAGAT Repetitive DNA (~ 5-20% of human DNA is repetitive): TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 8

Repeat regions may cause omissions ARBRC ARC 9

Erroneous duplications UMD2 BosTau4 Each base in the genome is covered by 6 reads, on average. A way to judge which assembly is correct is to compute the average read coverage for these regions. Two recent published assemblies of the cow genome: UMD2 and BosTau4 Two recent published assemblies of the cow genome: UMD2 and BosTau4 Segmental duplications were a central theme in BosTau4 genome paper Segmental duplications were a central theme in BosTau4 genome paper UMD2 assembly had many fewer duplications UMD2 assembly had many fewer duplications We examined the duplications, > 99.5% identity, >5000bp, one copy in the UMD2 assembly and two copies in the BosTau4 10

Examining read coverage reveals errors The thick solid vertical line is placed at the coverage at which it is as likely to have two copies as it is to have one. 11

Next Gen vs. Sanger Sequencing Sanger sequencing for a mammalian (~ 3Gbp) genome Sanger sequencing for a mammalian (~ 3Gbp) genome  Expensive: $50M for a mammalian genome  Large amount of DNA required We get bp reads all with mate pairs We get bp reads all with mate pairs Illumina and 454 Sequencing for the same genome Illumina and 454 Sequencing for the same genome Inexpensive: as low as $25K (Illumina), or $1M (454) for a mammalian genome Inexpensive: as low as $25K (Illumina), or $1M (454) for a mammalian genome Small amount of DNA required (e.g. one insect) Small amount of DNA required (e.g. one insect)  Only 100 or 400 bp reads, some with mate pairs Assembly is a much harder problem now Assembly is a much harder problem now 12

Difficulties in denovo Assembly of Illumina and 454 data Reads are short – high coverage needed, imposing demanding requirements on the software and computer hardware Reads are short – high coverage needed, imposing demanding requirements on the software and computer hardware Error patterns in the reads: Error patterns in the reads: substitution errors in Illumina reads substitution errors in Illumina reads homopolymer errors (unable to tell AAAA from AAA) in 454 reads homopolymer errors (unable to tell AAAA from AAA) in 454 reads Biased coverage by Illumina reads depending on the CG content Biased coverage by Illumina reads depending on the CG content Unreliable mate pairs: Unreliable mate pairs: Assembly techniques have much larger impact now Assembly techniques have much larger impact now could actually be 13

NGS Assemblers New assemblers developed for different kinds of NGS data: New assemblers developed for different kinds of NGS data: Newbler for 454 data Newbler for 454 data SOAPdenovo, Velvet, ABYSS, ALLPATHS, and others for Illumina data SOAPdenovo, Velvet, ABYSS, ALLPATHS, and others for Illumina data We use open source Celera Assembler currently supported by J. Craig Venter Institute bioinformatics team We use open source Celera Assembler currently supported by J. Craig Venter Institute bioinformatics team CA is capable of assembling mixed data sets CA is capable of assembling mixed data sets 14

Assembly quality varies significantly with the software used Example 1: Argentine ant assembly comparison. Example 1: Argentine ant assembly comparison. Both assemblies used the same 75bp Illumina reads, unmated and in 3kb and 8kb mate pairs Both assemblies used the same 75bp Illumina reads, unmated and in 3kb and 8kb mate pairs SOAPdenovoCA 5.4Improvement Sequence in assembly 137 Mbp171 Mbp25% N50 Scaffold size139 bp386,149 bp3000 times N50 Contig size139 bp3,367 bp24 times 15

Assembly quality varies significantly with the software used Example 2. Pogonomyrmex barbatus, the Red Harvester Ant assembly comparison (454 data). Example 2. Pogonomyrmex barbatus, the Red Harvester Ant assembly comparison (454 data). Both assemblies used the same 454 data in 3kb mate pairs, 8kb mate pairs and shotgun reads Both assemblies used the same 454 data in 3kb mate pairs, 8kb mate pairs and shotgun reads NewblerCA 5.3Improvement Sequence in assembly 194 Mbp220 Mbp13% N50 Scaffold size47 Kbp794 Kbp17 times N50 Contig size2 Kbp12 Kbp6 times 16

Benefits of combining 454 and Illumina data Example 3: Argentine ant assembly comparison assembled Illumina data and 454 data with Celera Assembler 5.4. Example 3: Argentine ant assembly comparison assembled Illumina data and 454 data with Celera Assembler x Illumina coverage, 15x 454 coverage 45x Illumina coverage, 15x 454 coverage Unmated reads, 3kb and 8kb mate pairs Unmated reads, 3kb and 8kb mate pairs Illumina only454 onlyIllumina and 454 Sequence in assembly (Mbp) N50 Scaffold size (Kbp) ,459 N50 Contig size (Kbp)

Post-assembly steps Assemblers output scaffolds – ordered and oriented collections of contigs. Assemblers output scaffolds – ordered and oriented collections of contigs. Scaffolds typically are much smaller than chromosomes and may contain large-scale errors. Scaffolds typically are much smaller than chromosomes and may contain large-scale errors. Some mate pair linking information remains unused by assemblers. Some mate pair linking information remains unused by assemblers. Marker maps, i.e. collections of short sequences whose positions on the chromosomes are known, can be used to position the contigs on the chromosomes. Marker maps, i.e. collections of short sequences whose positions on the chromosomes are known, can be used to position the contigs on the chromosomes. 18

UMD Chromosome builder Uses contigs, mate pairs and markers, discarding unreliable scaffold information Uses contigs, mate pairs and markers, discarding unreliable scaffold information Mapping steps: Mapping steps: Use mate pairs to orient contigs Use mate pairs to orient contigs Use markers and mate pairs to assign oriented contigs to the chromosomes Use markers and mate pairs to assign oriented contigs to the chromosomes Compute position of each contig on the chromosome as the best least-square fit to the available mate pair and marker data Compute position of each contig on the chromosome as the best least-square fit to the available mate pair and marker data 19

Computing contig orientations An orientation problem: An orientation problem: A B C 20

Computing contig orientations An orientation problem: An orientation problem: A B C 21

Computing contig orientations An orientation problem: An orientation problem: Matrix form: Matrix form: Compute y, the eigenvector corresponding to the largest eigenvalue of M. The signs of the eigenvector components provide recipe to flipping the contigs to achieve consistent orientations Compute y, the eigenvector corresponding to the largest eigenvalue of M. The signs of the eigenvector components provide recipe to flipping the contigs to achieve consistent orientations A B C A B C A M = B C 22

Computing contig orientations The eigenvector of M corresponding to the largest eigenvalue, or Frobenius – Perron eigenvalue =2: y=(0.5774, , ). The eigenvector of M corresponding to the largest eigenvalue, or Frobenius – Perron eigenvalue =2: y=(0.5774, , ). sign(y) = (1, 1, -1), that is the solution is to flip contig C sign(y) = (1, 1, -1), that is the solution is to flip contig C Final matrix of orientations = diag(sign(y))*M* diag(sign(y)): Final matrix of orientations = diag(sign(y))*M* diag(sign(y)): Flipping C is the correct solution! Flipping C is the correct solution! =

Conclusions Genome assembly is a difficult problem that has gotten harder because of Next Gen Sequencing data Genome assembly is a difficult problem that has gotten harder because of Next Gen Sequencing data Assembly techniques have large impact on the quality of the assembly Assembly techniques have large impact on the quality of the assembly Output of the assembler is not the final assembly; extensive post-processing is required to produce chromosome sequences Output of the assembler is not the final assembly; extensive post-processing is required to produce chromosome sequences 24