DNA Fragment Assembly CIS 667 Spring 2004 February 18.

Slides:



Advertisements
Similar presentations
Set Cover 資工碩一 簡裕峰. Set Cover Problem 2.1 (Set Cover) Given a universe U of n elements, a collection of subsets of U, S ={S 1,…,S k }, and a cost.
Advertisements

 Review: The Greedy Method
Longest Common Subsequence
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Approximations of points and polygonal chains
Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms Greed is good. (Some of the time)
Great Theoretical Ideas in Computer Science for Some.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Label Placement and graph drawing Imo Lieberwerth.
Great Theoretical Ideas in Computer Science.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Multiple Sequence Alignment
Phylogenetic Tree Construction and Related Problems Bioinformatics.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Approximation Algorithms
The Shortest Path Problem
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Physical Mapping of DNA Shanna Terry March 2, 2004.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Data Structures and Algorithms A. G. Malamos
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.
Techniques for Proving NP-Completeness Show that a special case of the problem you are interested in is NP- complete. For example: The problem of finding.
Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Greedy Algorithms for the Shortest Common Superstring Overview by Anton Nesterov Saint Petersburg State University Russia Original paper by A. Frieze,
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Graphs. Graph Definitions A graph G is denoted by G = (V, E) where  V is the set of vertices or nodes of the graph  E is the set of edges or arcs connecting.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Approximation Algorithms Greedy Strategies. I hear, I forget. I learn, I remember. I do, I understand! 2 Max and Min  min f is equivalent to max –f.
CSCI2950-C Genomes, Networks, and Cancer
Great Theoretical Ideas in Computer Science
Chapter 5. Optimal Matchings
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
Autumn 2015 Lecture 10 Minimum Spanning Trees
CSE 589 Applied Algorithms Spring 1999
Algorithms (2IL15) – Lecture 7
Dynamic Programming II DP over Intervals
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
Fragment Assembly 7/30/2019.
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

DNA Fragment Assembly CIS 667 Spring 2004 February 18

Objectives The problem: DNA Fragment Assembly  The ideal case  The complications Models  The Shortest Common Superstring  Reconstruction  Multicontig A greedy algorithm Heuristics

The Problem Assumption: We know the length of the target sequence approximately The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA.  We determine only one of the strands of the original molecule

The ideal case Input: 1.The set of fragments: ACCGT CGTGC TTAC TACCGT 2.Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus by majority of votes)

Complications 1.real problem instance is very large 2.errors substitutions insertions deletions chimeras 3.unknown orientation of the fragments 4.repeated regions causes ambiguity in sequencing 5.lack of coverage causes gaps

Errors: Substitution Input: 1.The set of fragments: ACCGT CGTGC TTAC TGCCGT  substitution 2.Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T G C C G T _ _ T T A C C G T G C (consensus by majority of votes)

Errors: Insertion Input: 1.The set of fragments: ACCGT CAGTGC  insertion TTAC TACCGT 2.Total length 10bp Output: _ _ A C C * G T _ _ _ _ _ _ C A G T G C T T A C _ * _ _ _ _ _ T A C C * G T _ _ T T A C C * G T G C (consensus by majority of votes)

Errors: Chimeras Input: 1.The set of fragments: ACCGT, CGTGC, TTAC, TACCGT, TTATGC 2.Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus) T T A _ _ _ T G C A chimera arises when two regular fragments from distinct parts of the target molecule join end-to end Remedy: recognize them before use!

Repeated Regions Unknown orientation with no errors Unknown orientation with errors Repeated regions causes ambiguity PXQXRXS PXRXQXS

Direct repeat More complex are inverted repeat  repeated regions in opposite strands PXQYRXSY PXSYRXQY

Lack of coverage causes formation of gaps compute the mean coverage  add up all the fragments and divide by the target length insufficient coverage is covered by sampling more fragments How many fragments do I need? Assume  all fragments have the same length  let t be the safe overlap of at least t bases  n is the number of fragments  T is the target length Apparent contigs: p = n e –n(l-t)/T

Shortest Common Superstring Input: A collection F of strings Output: A shortest possible string S |  f  F, S is a superstring of f. Example: F={ATG, TGC, GCC} S= ATGCC Question: Is it the shortest? Observe: u=ATGand v=GCC overlap in G and TGC is a substring

Shortest Common Superstring Is it a good problem? Advantages: The problem finds the PERFECT superstring Good for most ideal cases Disadvantages:  the problem does not deal with errors  good only in some ideal cases in presence of no errors and known orientation, it fails in presence of repeat repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage  It does not consider lack of coverage and size of the target  NP-hard

Reconstruction Objective:We want to consider errors and unknown orientation Substring Edit Distance d s (a,b) = min d s  s(b) (a,s)  one unit is charged for insertion, deletion, substitution  no charges for deletion in the extremity of 2nd sequence Example u=CGATGT v=AACTAATGTGC _ _ C G A * T G T _ _ A A C T A A T G T G C d s (u,v) = 2 A string f is an approximate substring of S at error level  (between 0, 1) when d s (f,S)   |f|

Reconstruction Input: A collection F of strings, an error tolerance  with 0   1 Output: A shortest possible string S |  f  F, we have min(d s (f, S), d s (f,S))   |f| where f is the reverse complement Advantage:  takes into account errors and unknown orientation Disadvantages:  Is an NP-hard problem  It does not model repeats  It does not consider lack of coverage and size of the target

Multicontig Objective: We want to consider internal linkage No special assumptions except:  for known orientation, fragment and reverse complement are not both present in the collection. We want to have good linkage (overlap between fragments)  An overlap is a link if it is not (properly) contained in a bigger fragment  The smallest size of a link in a layout is called a weakest link  A layout is a t-contig if its weakest link is at least size t  We partition F into the minimum number of collections which admit a t-contig

Multicontig Idea: Let's partition F in the minimum number of t-contigs! Example: F={GTAG, TAATG, TGTAA} for t=3 F1={TAATG, TGTAA} and F2={GTAG} for t=2 we have two solutions 1.F1={TAATG, TGTAA} and F2={GTAG} 2.F1={TAATG, TGTAA} and F2={GTAG} for t=1 we have the desired solution (the minimum) F1={TAATG, TGTAA, GTAG} For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small

Multicontig Input: A collection F of strings, and an integer t  0 and an error tolerance  with 0   1 Output: A partition of F in the minimum number of subcollections C i. 1  i  k | every C i admits a t-contig with an  -consensus Advantage:  takes into account errors and unknown orientation  take into account internal linkage of the fragments the answer is formed by several contigs Disadvantages:  Is an NP-hard problem even in the simplest case of no errors and known orientation It contains as a special case finding a Hamiltonian path in a restricted class of graphs  It has no provision to use information on the approximate size of the target

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? CTAAAG TACGG GGACG GCCC 2 1    

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? CTAAAG TACGG GGACAG GCCC 2 1    

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

Theoretical results Theorem: the total length of A (set of fragments) is ||A|| = w(P) + |S(P)| where ||A||=  a  A |a| w(P) is the weight of the path P |S(P)| is the length of the superstring derived from P.  to convince yourself, read the proof from the book Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.

The Greedy Methodology NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time To apply a greedy methodology: 1.the problem must show optimal substructure –A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems 2.the optimal solution is reached by taking the best "local" choice

Overlap Graph An overlap graph has only edges with maximum weight CTAAAG TACGA GACA ACCC 2 1    The Greedy Algorithm input: weighted di-graph OG(F) with n vertices output: Hamiltonian path in OG(F) //Initialize for i  1 to n do in[i]=0 //how many selected edges enter i out[i]=0 //how many selected edges exit i MakeSet(i) //Process Sort the edges by weight, heaviest first for each edge (f,g) in this order do //test for acceptance if in[g] = 0 and out[f] = 0 and FindSet(f) ≠FindSet(g) select (f,g) in[g]  1 out[f]  1 Union(FindSet(f), Findset(g)) if there is only one component break return selected edges

A graph where "greedy" fails F={GCAAAG, AGTA,TACGA} GCAAAG TACGA AGTA 2   We order the edges by weight (AGAT, GCAAAG) = 3 (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 The algorithm will choose first (AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path. Instead the solution should be (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2

Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. Issues: Scoring Coverage Linkage

Heuristics Scoring  Uniformity is good, variability is bad.  Compute the entropy of a column  the entropy is the measure of the chaos in a column. There are 5 possible characters, A, T, C, G, space E=-  c p c log p c E=0 if p c =1 for a character; E=log5 if each p c =1/5  To measure the uniformity we want a low entropy per column Coverage  minimun, maximum or medium coverage  if the coverage reaches 0 for a column I, we do not have a connected layout  if we have more columns with zero coverage, any permutation of the intervening regions (the contig) is acceptable  Coverage gives confidence to the consensus  Linkage High coverage with no links is not good. Overlap is required.

More Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. Assembly in practice consists of: 1.Finding overlaps 2.Building Layout 3.Computing the consensus Advantages: We treat each problem separately. Disadvantages:  It becomes difficult to understand the relationship between the input and the final output

Heuristics Finding overlaps  use a dynamic programming approach with a score system such as  1 for matches  -1 for mismatches  -2 for spaces  Do not charge for space after the first sequence and before the second one.

Heuristics Ordering Fragments  there is no algorithm simple and general enough Considerations:  Use the set DF=F  F  If f=uv  g=wx then  g =  w  x   f =  v  u  if f is approximately the same as the beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement

Finding overlaps  Finding a good ordering of overlapping means finding a direct path in the overlap graph  Both strands are constructed simultaneously  Contained fragment are not essential in the path  A disconnected graph indicates lack of coverage  The presence of cycles indicates repeats  Unusual high coverage indicate possible repeats  The presence of reverse complement cycles indicates inverted repeats

Alignment and Consensus Use the minimal sum of the distances Suppose we have f  g  h CATAGTC TAACTAT AGACTATCC Two semiglobal aligments for f and g are: C A TAG T C_ _ _ C ATA GT C_ _ _ _ _ TAA _ C TA T _ _TA_ A CT A T C ATA GT C_ _ _ _ _TA _ A CT A T _ _ _ A G A CT A T C C CATA GA C T A T C C  d s (f, S) = 1 d s (g, S) = 1 d s (h, S) = 0 if we use the second aligment and d s (f, S) +d s (g, S) +d s (h, S) = 2  d s (f, S) = 1 d s (g, S) = 2 d s (h, S) = 0 if we use the first aligment and A is chosen for column 6, d s (f, S) +d s (g, S) +d s (h, S) = 3

A Linked List of Bases Sometimes we know what is best only later. Is there a structure that helps us?  Use a Linked List of Bases  matches bases are unified in one node  unmatched bases are left separate  Technique: Traverse this graph in topological order. G  T C  A  T  A C  T  A  T A

Conclusions The models fail to address all the issues involved in the problem The effective real problem is NP-hard Approximation gives us some help but fails in some cases Heuristics helps and the problem is broken in 3 smaller problems: 1.finding overlap 2.building layout and 3.computing the consensus Are we sure there is nothing else to do?  We will look next week at the smaller problem of comparing only two sequences instead of many. Will we find something better?