Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

Similar presentations


Presentation on theme: "Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department."— Presentation transcript:

1 Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department

2 What Is Computational Biology? [G. Lancia] “Study of mathematical and computational problems of modeling biological processes in the cell, removing experimental errors from genomic data, interpreting the data and providing theories about their biological relations” Multidisciplinary field at the intersection of computer science, biology, discrete mathematics, statistics, optimization, chemistry, physics, …

3 5 Steps to Solving CB Problems 1.Understand biological problem 2.Represent biological data as mathematical objects (strings, sets, graphs, permutations,…), map biological relations into mathematical relations, and formulate the biological question as optimization or feasibility problem 3.Study computational complexity: Polynomial? NP-hard? 4.Develop efficient algorithms –If in P, find fast and memory efficient exact algorithms –If NP-hard, find practical exact algorithms and/or algorithms with provable approximation guarantees 5.Validate algorithms on biological data

4 Outline Shortest Superstring Sequencing by Hybridization PCR Primer Selection

5 Shotgun Sequencing

6 Shortest Superstring Given: set of strings s 1, s 2, …, s n Find: shortest string s containing each s i as a substring Example: Set of strings: 000, 001, 010, 011, 100, 101, 110, 111 Superstring: 0001110100 NP-Hard [Maier&Storer77]

7 Greedy Merging Algorithm Approximation factor no better than 2: –s 1 = ab k, s 2 =b k c, s 3 = b k+1 –Greedy output: ab k cb k+1 length = 2k+3 –Optimum: ab k+1 c length = k+3 Open problem: prove that greedy superstring is always at most twice longer than optimum -S = {s 1,s 2,…,s n } -While |S| > 1 do -Find s,t in S with longest overlap -S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} -Output final string

8 Overlap & Prefix of 2 strings Overlap of s and t: longest suffix of s that is a prefix of t Prefix of s and t: s after removing overlap(s,t) s = a 1 a 2 a 3 … a |s|-k+1 … a |s| t = b 1 … b k … b |t| prefix(s,t) overlap(s,t)

9 Lower Bound on OPT OPT = prefix(s 1,s 2 ) … prefix(s n-1,s n ) prefix(s n,s 1 ) overlap(s n,s 1 ) cost of tour 1  2  …  n in the prefix graph

10 The Cycle Cover Algorithm Computing TSP in prefix graph is NP-hard Key idea: lowerbound OPT using min-weight cycle cover For every cycle c = (i 1  i 2  …  i l  i 1 ),  (c) := prefix(s i1,s i2 ) … prefix(s il,s i1 ) s i1 is a superstring of s i1, …, s il Cycle cover algorithm:

11 The Cycle Cover Algorithm Theorem [Blum,Jiang,Li,Tromp,Yannakakis94]: Cycle cover algorithm gives factor 4 approximation. Length of output is where r i is a “representative” string from cycle c i wt(C)  OPT - If r i no longer than wt(c i )  output within factor 2 of optimum! - r i can be much longer than wt(c i ) (periodic strings!) - it can be shown that  | r i |  OPT + 2 wt(C)  factor 4

12 Improved Algorithm Theorem [Blum,Jiang,Li,Tromp,Yannakakis 94]: The improved algorithm gives factor 3 approximation. Proof using that the greedy algorithm gives at least ½ of the optimum compression. Current best approximation factor is 2.596 [Breslauer,Jiang,Jiang97]

13 Sequencing by Hybridization Exploits parallel hybridization in DNA arrays All 4 k probes of a certain length k (k=8 to 10) are synthesized on the array Target DNA hybridizes at locations containing probes complementary to its k-substrings Sequencing by Hybridization (SBH) Problem: Reconstruct target DNA given its k-length substrings (spectrum)

14 Mathematical Formulation of SBH SBH is a special case of the shortest superstring: solution corresponds to a Hamiltonian path (NP-hard to find) in the “prefix length = 1” graph [Pevzner 89] SBH is equivalent to finding an Eulerian path (easy to find in linear time) in the following graph: –Vertices are all (k-1)-tuples –Directed edge between two (k-1)-tuples u and v iff there is a k-length string in the spectrum whose first k symbols match u and last k symbols match v Choose the right mathematical abstraction!

15 Polymerase Chain Reaction …

16 Primer Selection Problem  L+x f i r ir i Forward primer Reverse primer i-th amplification locus 3'3' 3'3' 5'5' 5'5'  L+x Given: Pairs of forward/reverse sequences for the n amplification loci Primer length k and amplification upperbound L Find: Minimum set of primers S of length k such that, for each amplification locus, there are two primers in S hybridizing to the forward and reverse sequences within a distance of L of each other

17 Previous Work [Pearson et al. 96] Logarithmic approximation factor using greedy set cover algorithm for a formulation that does not distinguish between forward and reverse primers Similar formulations used by [Linhart&Shamir’02, Souvenir et al.’03] To enforce bound of L on amplification length must truncate forward and reverse sequences to length L/2 [Fernandes&Skiena’02] model primer selection as a minimum multicolored subgraph problem: Vertices are candidate primers Add edge colored by color i between primers u and v if they hybridize to i-th forward and reverse sequences within a distance of L Find minimum size set of vertices inducing edges of all colors No non-trivial approximation factor proposed

18 Improved Approximations [Konwar,M,Russell,Shvartsman 04] Logarithmic approximation factor using “potential function” greedy for the bounded amplification length primer selection problem O(Lln n) approximation factor based on randomized rounding for the minimum multicolored subgraph problem of [Fernandes&Skiena’02]

19 Improved Approximations [Konwar,M,Russell,Shvartsman 04] Logarithmic approximation factor using “potential function” greedy for the bounded amplification length primer selection problem O(Lln n) approximation factor based on randomized rounding for the minimum multicolored subgraph problem of [Fernandes&Skiena’02]

20 Key Lemma If r and r’ are representative strings from cycles c and c’, then If |overlap(r,r’)|  wt(c) + wt(c’), then   = (  ’)     covers strings in both c and c’  cycle cover is not minimal

21 Proof of Factor 4 Length of output Numbering r i ’s in order of lefmost occurrence in OPT and using Lemma   | r i |  OPT +  |overlap(r i,r i+1 )|  OPT + 2 wt(C) wt(C)  OPT  Length of output  4 x OPT

22 Improved Algorithm Analysis Observation 1: The greedy algorithm is known to achieve at least ½ of the optimum compression, i.e.,  |  (c i ) | - |  |  ½ (  |  (c i ) | - OPT  ) where OPT  is the shortest superstring of  (c i ), i=1,…,k  |  | - OPT   ½ (  |  (c i ) | - OPT  ) Observation 2: By numbering  (c i )’s in order of lefmost occurrence in OPT  and using again the key Lemma  |  (c i ) | - OPT  =  |overlap(  (c i ),  (c i+1 ) )|  2 wt(C)  |  | - OPT   wt(C) Observation 3: OPT   OPT + wt(C)  |  |  3 OPT


Download ppt "Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department."

Similar presentations


Ads by Google