Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fragment Assembly 7/30/2019.

Similar presentations


Presentation on theme: "Fragment Assembly 7/30/2019."— Presentation transcript:

1 Fragment Assembly 7/30/2019

2 Introduction Fragments are typically of 200-700 bp long
“Target” string is about 30k – 100k bp long Problem: given a set of fragments reconstruct the target 7/30/2019

3 Introduction Multiple-alignment of the fragments ignoring spaces at the end The alignment is called “layout” The output is called the “consensus sequence” An optimization problem 7/30/2019

4 Complications Base-call errors: Substitution errors [p 107]
Insertion errors (possibly from the host sequence) [p 108, fig 4.3] Deletion error [fig 4.4] Majority voting solves them (or some form of optimization) 7/30/2019

5 Complications Chimeras:
To non-contiguous fragments get joined as a single fragment [p 109, fig 4.5] Needs to be weeded out as a preprocessing step Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well 7/30/2019

6 Complications Unknown orientation:
Fragments may come from either strand Even from the opposite strand, its reverse-complement must be in the target string Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments) [p 109, fig 4.6] 7/30/2019

7 Complications Repeats:
Regions (super-string of some fragments) may repeat in a target Consequent problem: where do the fragments really come from, on approximate alignment? [p 110, fig 4.7] Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9] Inverted repeats: repeat of the reverse complement [fig 4.10] 7/30/2019

8 Complications Insufficient coverage:
Chance of coverage increases with redundancy (a heuristic: cover 8 times the target length) Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here 7/30/2019

9 Complications Insufficient coverage:
What you get with insufficient coverage is multiple “contigs,” not one contig “t-contig” is where we expect t-long overlap between pairs of fragments Expected number of contigs: [p 112, formula 4.1] Lower t means lesser number of contigs (more aligned segments), but weaker consensus 7/30/2019

10 Reconstruction Shortest common superstrings are not the best solution
Fig 4.12 vs Fig 4.13 (p115/116) 7/30/2019

11 Reconstruction Superstring to be reconstructed out of fragments
An alignment problem with no end penalty d_s is edit distance score without end-penalty: minimized over edit distances d Fig 4.14 (p117) for best aligned subsequence-matching Note, char matched is charged 0, mismatch 1, gap 2, in “distance” rather than “similarity” We will use d for d_s 7/30/2019

12 Reconstruction f is approximate substring of S at error level e, then the score is d(f, S) =< e|f|, e=1 means no error allowed e<1 allows insert/delete/substitution errors f and f- both should be matched 7/30/2019

13 Reconstruction: Problem
Input: Set F of substrings, error level e Output: Shortest possible string S s.t. for all f Min(d(f, S), d(f-, S)) =< e|f| 7/30/2019

14 Reconstruction: Multicontig
How much overlap do we require between strings? Ideally, each column in the layout L should have same character, for all columns 1 through |L| Fig 4.4 (p 118): t-contig for t=3, 2, 1 Balance between t and number of t-contigs 7/30/2019

15 Reconstruction: Multicontig
S is e-consensus sequence (multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f| Multicontig problem: Input: set F, integer t>=0, 0=<e=<1 Output: Minimum partition over F, each partition Ci is a t-contig with e-consensus 7/30/2019

16 Reconstruction: Overlap Multi-graph
Nodes are the fragments Directed arcs label length t of overlap between nodes” t-suffix= t-prefix Arcs between all pairs of nodes, but no self-loop Fig 4.15 (p 121): example Length of a created superstring=total wt along the path(or overlaps) + total length of all fragments involved Max weight Hamiltonian path is what we are looking for in this graph  max overlapped superstring 7/30/2019

17 Reconstruction Substrings of fragments within the set of fragments are noise: remove them Draw OMG of the substring free set of fragments Shortest common superstring always correspond to a Hamiltonian path in this graph 7/30/2019

18 Reconstruction: OMG Thm 4.1 (p 123): F substring free, for every common superstring S, there is a Ham. Path P, s.t., S(P) is in S Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists) Path follows the same order of fragments (as in S) in OMG S may contain extra garbage materials, so, S(P) is within S 7/30/2019

19 Reconstruction: OMG If S is shortest common superstring, then S must be within S(P), or S=S(P) In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F 7/30/2019

20 Reconstruction: OMG Think of an algorithm for weeding out substrings from F Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes If the wt on an edge is below a threshold t, then the wt should be treated as 0 7/30/2019

21 Reconstruction: OMG Greedy Algorithm to draw Ham. Path (p 125)
Collects edges largest to smallest, (1) preventing cycle (union-find), (2) indegree of each node should be =<1 (first node has 0) (3) outdegree of each node should be =<1 (last node has 0) [Does not return Ham. Path. Can you modify to return Ham. Path?] Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4 7/30/2019

22 Reconstruction: OMG Subintervals: if a fragment can be embedded within another one in the set Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string 7/30/2019

23 Reconstruction: OMG If a repeat exists in the original string, then the graph will have a cycle False positive: substrings from two different portions has t-overlap If a cycle exist in the graph, then there must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered 7/30/2019

24 Reconstruction: OMG If there is no repeats in a subinterval-free graph, then there exist a unique Ham. Path If there exist a cycle it may not come from a repeat 7/30/2019

25 Reconstruction: OMG Example 4.6 (p 130): greedy alg finds wrong string, but the Ham. Path finds the correct one Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring) Ham path chooses any t-overlap connections – cares for linkage only 7/30/2019

26 Parameters in aligning for fragment assembly
Score on a column: traditionally {0,-1,-2} in sum-of-pairs Entropy: Sum[over alphabets and space c] –pc log pc, where pc is probability of c All same character, pc = 1, entropy=0 For {a, t, c, g, -}, all different, pc = 1/5, entropy=log 5entropy measures uniformity alone, a better metric 7/30/2019

27 Parameters in aligning for fragment assembly
Coverage: How many each column is “covered” by how many fragments? (Average, min, max) This is different from the concept of t-overlap If a column (of the target) is covered by 0, then the layout is disconnected Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns 7/30/2019

28 Parameters in aligning for fragment assembly
Coverage is not enough, we need good linkage, Example: p 133 Ham. Path algorithm is doing that 7/30/2019

29 Steps in assembly : Step 1: Overlap finding
Approximate – delete, insert, replace allowed by semi-global DP algorithm with appropriate end-gap penalty, pairwise between each fragment and its reverse-complement 7/30/2019

30 Steps in assembly : Step 2: Construct over (F union F-bar) for the fragment set F (-- after eliminating substrings?) Construct Hamiltonian path in this graph Cycles and unbalanced coverage may mean repeats 7/30/2019

31 Steps in assembly : Step 3: fine tuning the multiple alignment to get a consensus target Manual or algorithmic Examples in p 7/30/2019


Download ppt "Fragment Assembly 7/30/2019."

Similar presentations


Ads by Google