Presentation is loading. Please wait.

Presentation is loading. Please wait.

DNA Fragment Assembly CIS 667 Spring 2004 February 18.

Similar presentations


Presentation on theme: "DNA Fragment Assembly CIS 667 Spring 2004 February 18."— Presentation transcript:

1 DNA Fragment Assembly CIS 667 Spring 2004 February 18

2 Objectives The problem: DNA Fragment Assembly  The ideal case  The complications Models  The Shortest Common Superstring  Reconstruction  Multicontig A greedy algorithm Heuristics

3 The Problem Assumption: We know the length of the target sequence approximately The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA.  We determine only one of the strands of the original molecule

4 The ideal case Input: 1.The set of fragments: ACCGT CGTGC TTAC TACCGT 2.Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus by majority of votes)

5 Complications 1.real problem instance is very large 2.errors substitutions insertions deletions chimeras 3.unknown orientation of the fragments 4.repeated regions causes ambiguity in sequencing 5.lack of coverage causes gaps

6 Errors: Substitution Input: 1.The set of fragments: ACCGT CGTGC TTAC TGCCGT  substitution 2.Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T G C C G T _ _ T T A C C G T G C (consensus by majority of votes)

7 Errors: Insertion Input: 1.The set of fragments: ACCGT CAGTGC  insertion TTAC TACCGT 2.Total length 10bp Output: _ _ A C C * G T _ _ _ _ _ _ C A G T G C T T A C _ * _ _ _ _ _ T A C C * G T _ _ T T A C C * G T G C (consensus by majority of votes)

8 Errors: Chimeras Input: 1.The set of fragments: ACCGT, CGTGC, TTAC, TACCGT, TTATGC 2.Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus) T T A _ _ _ T G C A chimera arises when two regular fragments from distinct parts of the target molecule join end-to end Remedy: recognize them before use!

9 Repeated Regions Unknown orientation with no errors Unknown orientation with errors Repeated regions causes ambiguity PXQXRXS PXRXQXS

10 Direct repeat More complex are inverted repeat  repeated regions in opposite strands PXQYRXSY PXSYRXQY

11 Lack of coverage causes formation of gaps compute the mean coverage  add up all the fragments and divide by the target length insufficient coverage is covered by sampling more fragments How many fragments do I need? Assume  all fragments have the same length  let t be the safe overlap of at least t bases  n is the number of fragments  T is the target length Apparent contigs: p = n e –n(l-t)/T

12 Shortest Common Superstring Input: A collection F of strings Output: A shortest possible string S |  f  F, S is a superstring of f. Example: F={ATG, TGC, GCC} S= ATGCC Question: Is it the shortest? Observe: u=ATGand v=GCC overlap in G and TGC is a substring

13 Shortest Common Superstring Is it a good problem? Advantages: The problem finds the PERFECT superstring Good for most ideal cases Disadvantages:  the problem does not deal with errors  good only in some ideal cases in presence of no errors and known orientation, it fails in presence of repeat repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage  It does not consider lack of coverage and size of the target  NP-hard

14 Reconstruction Objective:We want to consider errors and unknown orientation Substring Edit Distance d s (a,b) = min d s  s(b) (a,s)  one unit is charged for insertion, deletion, substitution  no charges for deletion in the extremity of 2nd sequence Example u=CGATGT v=AACTAATGTGC _ _ C G A * T G T _ _ A A C T A A T G T G C d s (u,v) = 2 A string f is an approximate substring of S at error level  (between 0, 1) when d s (f,S)   |f|

15 Reconstruction Input: A collection F of strings, an error tolerance  with 0   1 Output: A shortest possible string S |  f  F, we have min(d s (f, S), d s (f,S))   |f| where f is the reverse complement Advantage:  takes into account errors and unknown orientation Disadvantages:  Is an NP-hard problem  It does not model repeats  It does not consider lack of coverage and size of the target

16 Multicontig Objective: We want to consider internal linkage No special assumptions except:  for known orientation, fragment and reverse complement are not both present in the collection. We want to have good linkage (overlap between fragments)  An overlap is a link if it is not (properly) contained in a bigger fragment  The smallest size of a link in a layout is called a weakest link  A layout is a t-contig if its weakest link is at least size t  We partition F into the minimum number of collections which admit a t-contig

17 Multicontig Idea: Let's partition F in the minimum number of t-contigs! Example: F={GTAG, TAATG, TGTAA} for t=3 F1={TAATG, TGTAA} and F2={GTAG} for t=2 we have two solutions 1.F1={TAATG, TGTAA} and F2={GTAG} 2.F1={TAATG, TGTAA} and F2={GTAG} for t=1 we have the desired solution (the minimum) F1={TAATG, TGTAA, GTAG} For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small

18 Multicontig Input: A collection F of strings, and an integer t  0 and an error tolerance  with 0   1 Output: A partition of F in the minimum number of subcollections C i. 1  i  k | every C i admits a t-contig with an  -consensus Advantage:  takes into account errors and unknown orientation  take into account internal linkage of the fragments the answer is formed by several contigs Disadvantages:  Is an NP-hard problem even in the simplest case of no errors and known orientation It contains as a special case finding a Hamiltonian path in a restricted class of graphs  It has no provision to use information on the approximate size of the target

19 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? CTAAAG TACGG GGACG GCCC 2 1    

20 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

21 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? CTAAAG TACGG GGACAG GCCC 2 1    

22 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

23 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

24 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

25 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

26 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (a  t )b = a(  t b) or  |a|-t a = b  |b|-t CTAAAG TACGG GGACAG GCCC 2 1    Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 

27 Theoretical results Theorem: the total length of A (set of fragments) is ||A|| = w(P) + |S(P)| where ||A||=  a  A |a| w(P) is the weight of the path P |S(P)| is the length of the superstring derived from P.  to convince yourself, read the proof from the book Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.

28 The Greedy Methodology NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time To apply a greedy methodology: 1.the problem must show optimal substructure –A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems 2.the optimal solution is reached by taking the best "local" choice

29 Overlap Graph An overlap graph has only edges with maximum weight CTAAAG TACGA GACA ACCC 2 1    The Greedy Algorithm input: weighted di-graph OG(F) with n vertices output: Hamiltonian path in OG(F) //Initialize for i  1 to n do in[i]=0 //how many selected edges enter i out[i]=0 //how many selected edges exit i MakeSet(i) //Process Sort the edges by weight, heaviest first for each edge (f,g) in this order do //test for acceptance if in[g] = 0 and out[f] = 0 and FindSet(f) ≠FindSet(g) select (f,g) in[g]  1 out[f]  1 Union(FindSet(f), Findset(g)) if there is only one component break return selected edges

30 A graph where "greedy" fails F={GCAAAG, AGTA,TACGA} GCAAAG TACGA AGTA 2   We order the edges by weight (AGAT, GCAAAG) = 3 (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 The algorithm will choose first (AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path. Instead the solution should be (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2

31 Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. Issues: Scoring Coverage Linkage

32 Heuristics Scoring  Uniformity is good, variability is bad.  Compute the entropy of a column  the entropy is the measure of the chaos in a column. There are 5 possible characters, A, T, C, G, space E=-  c p c log p c E=0 if p c =1 for a character; E=log5 if each p c =1/5  To measure the uniformity we want a low entropy per column Coverage  minimun, maximum or medium coverage  if the coverage reaches 0 for a column I, we do not have a connected layout  if we have more columns with zero coverage, any permutation of the intervening regions (the contig) is acceptable  Coverage gives confidence to the consensus  Linkage High coverage with no links is not good. Overlap is required.

33 More Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. Assembly in practice consists of: 1.Finding overlaps 2.Building Layout 3.Computing the consensus Advantages: We treat each problem separately. Disadvantages:  It becomes difficult to understand the relationship between the input and the final output

34 Heuristics Finding overlaps  use a dynamic programming approach with a score system such as  1 for matches  -1 for mismatches  -2 for spaces  Do not charge for space after the first sequence and before the second one.

35 Heuristics Ordering Fragments  there is no algorithm simple and general enough Considerations:  Use the set DF=F  F  If f=uv  g=wx then  g =  w  x   f =  v  u  if f is approximately the same as the beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement

36 Finding overlaps  Finding a good ordering of overlapping means finding a direct path in the overlap graph  Both strands are constructed simultaneously  Contained fragment are not essential in the path  A disconnected graph indicates lack of coverage  The presence of cycles indicates repeats  Unusual high coverage indicate possible repeats  The presence of reverse complement cycles indicates inverted repeats

37 Alignment and Consensus Use the minimal sum of the distances Suppose we have f  g  h CATAGTC TAACTAT AGACTATCC Two semiglobal aligments for f and g are: C A TAG T C_ _ _ C ATA GT C_ _ _ _ _ TAA _ C TA T _ _TA_ A CT A T C ATA GT C_ _ _ _ _TA _ A CT A T _ _ _ A G A CT A T C C CATA GA C T A T C C  d s (f, S) = 1 d s (g, S) = 1 d s (h, S) = 0 if we use the second aligment and d s (f, S) +d s (g, S) +d s (h, S) = 2  d s (f, S) = 1 d s (g, S) = 2 d s (h, S) = 0 if we use the first aligment and A is chosen for column 6, d s (f, S) +d s (g, S) +d s (h, S) = 3

38 A Linked List of Bases Sometimes we know what is best only later. Is there a structure that helps us?  Use a Linked List of Bases  matches bases are unified in one node  unmatched bases are left separate  Technique: Traverse this graph in topological order. G  T C  A  T  A C  T  A  T A

39 Conclusions The models fail to address all the issues involved in the problem The effective real problem is NP-hard Approximation gives us some help but fails in some cases Heuristics helps and the problem is broken in 3 smaller problems: 1.finding overlap 2.building layout and 3.computing the consensus Are we sure there is nothing else to do?  We will look next week at the smaller problem of comparing only two sequences instead of many. Will we find something better?


Download ppt "DNA Fragment Assembly CIS 667 Spring 2004 February 18."

Similar presentations


Ads by Google