A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Instructor Neelima Gupta Table of Contents Approximation Algorithms.
Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Chapter 8: Graph Algorithms July/23/2012 Name: Xuanyu Hu Professor: Elise de Doncker.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
Approximation Algorithms for the Traveling Salesperson Problem.
Computational Genomics and Proteomics Sequencing genomes.
"An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Sequencing tutorial Peter HANTZ EMBL Heidelberg.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Lecture 3 1.Protein Function prediction using network concepts 2.Application of network concepts in DNA sequencing.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Outline More exhaustive search algorithms Today: Motif finding
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
Sequencing tutorial Peter HANTZ EMBL Heidelberg.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
Performance Profiling of NGS Genome Assembly Algorithms Alex Ropelewski Pittsburgh Supercomputing Center
Hibridization: provide information about l-tuples present in DNA. DNA sequencing There are two techniques: Shotgun: DNA sequences are broken into 100Kb-500Kb.
Graph Algorithms © Jones and Pevzner © Robert Simons
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly
Assembly algorithms for next-generation sequencing data
CS296-5 Genomes, Networks, and Cancer
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Lecture 3 Protein Function prediction using network concepts
CSE 5290: Algorithms for Bioinformatics Fall 2011
Introduction to Genome Assembly
Graph Algorithms in Bioinformatics
CS 598AGB Genome Assembly Tandy Warnow.
Genome Assembly.
Graph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics
An Eulerian path approach to DNA fragment assembly
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006

2 Preface Introduce the author Introduce the author The background of the paper The background of the paper The history of DNA Sequencing The history of DNA Sequencing

3 Traditional DNA Sequencing Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) Shear DNA into millions of small fragmentsShear DNA into millions of small fragments Shake DNA

4 Fragment Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“super string”)Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“super string”) Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problemUntil late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

5 Shortest Superstring Problem Problem: Given a set of strings, find a shortest string that contains all of them Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s 1, s 2,…., s n Input: Strings s 1, s 2,…., s n Output: A string s that contains all strings Output: A string s that contains all strings s 1, s 2,…., s n as substrings, such that the length of s is minimized s 1, s 2,…., s n as substrings, such that the length of s is minimized Complexity: NP – complete Complexity: NP – complete Note: this formulation does not take into account sequencing errors Note: this formulation does not take into account sequencing errors

6 Reducing SSP to eulerian path problem Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa Construct a graph with n vertices representing the n strings s 1, s 2,…., s n. Construct a graph with n vertices representing the n strings s 1, s 2,…., s n. Insert edges of length overlap ( s i, s j ) between vertices s i and s j. Insert edges of length overlap ( s i, s j ) between vertices s i and s j. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.

7 Bruijun graph Properties Properties If n = 1 then the condition for any two vertices forming an edge holds vacuously, and hence all the vertices are connected forming a total of m 2 edges. Each vertex has exactly m incoming and m outgoing edges

8 Sequencing by Hybridization

9 l -mer (tulip) composition Spectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s of length n Spectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} {TGG, TGC, TAT, GTG, GGT, ATG}

10 SBH: Eulerian Path Approach S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S Edges correspond to l – mers from S AT GT CG CA GC TG GG Path visited every EDGE once

11 S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: ATGGCGTGCA ATGCGTGGCA AT TG GC GG GT CG GT CG CA GCTG GG

12 Error Correction Or Data Corruption Euler algorithm sometimes introduces errors. Euler algorithm sometimes introduces errors. Introduces errors for reducing the complexity of the Bruijn graph. Introduces errors for reducing the complexity of the Bruijn graph. Reeducation of Bruijn graph eliminate false edge. Reeducation of Bruijn graph eliminate false edge. For example: N.meningitieds sequencing project,orphan elimination corrects errors, and introces 1452 errors. For example: N.meningitieds sequencing project,orphan elimination corrects errors, and introces 1452 errors.

13 Observations of the EULER

14 Conclusions Finishing is a bottleneck in large-scale DNA Finishing is a bottleneck in large-scale DNA EULER has excellent scaling potential. EULER has excellent scaling potential. The complexity of EULER is mainly defined by the number of tangles rather than the number of repeats/length of the gonomes. The complexity of EULER is mainly defined by the number of tangles rather than the number of repeats/length of the gonomes.

RESULTS AND DISCUSSION The general performance of SEA on the benchmark Prediction ambiguity improves alignment quality Alignment quality versus local structure prediction ambiguity

CONCLUSION

Any Questions?

18

19