Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang Institute of Applied Mathematics,
Chapter 6. Relaxation (1) Superstring Ding-Zhu Du.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Computing Kemeny and Slater Rankings Vincent Conitzer (Joint work with Andrew Davenport and Jayant Kalagnanam at IBM Research.)
Combinatorial Algorithms
Optimal Testing of Digital Microfluidic Biochips: A Multiple Traveling Salesman Problem R. Garfinkel 1, I.I. Măndoiu 2, B. Paşaniuc 2 and A. Zelikovsky.
Combinatorial Algorithms
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Approximation Algorithms: Combinatorial Approaches Lecture 13: March 2.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Approximation Algorithms for the Traveling Salesperson Problem.
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
1 Combinatorial Optimization Methods for Reliable Genomic-Based Detection Systems Ion Mandoiu University of Connecticut Computer Science & Engineering.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander.
Optimization Methods for Reliable Genomic- Based Pathogen Detection Systems K.M. Konwar, I.I. Mandoiu, A.C. Russell, and A.A. Shvartsman Computer Science.
Approximation Algorithms
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
Algorithms for Network Optimization Problems This handout: Minimum Spanning Tree Problem Approximation Algorithms Traveling Salesman Problem.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Physical Mapping of DNA Shanna Terry March 2, 2004.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
C&O 355 Mathematical Programming Fall 2010 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
On realizing shapes in the theory of RNA neutral networks Speaker: Leszek Gąsieniec, U of Liverpool, UK Joint work with: Peter Clote, Boston College, USA.
1 A -Approximation Algorithm for Shortest Superstring Speaker: Chuang-Chieh Lin Advisor: R. C. T. Lee National Chi-Nan University Sweedyk, Z. SIAM Journal.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Prabhas Chongstitvatana1 NP-complete proofs The circuit satisfiability proof of NP- completeness relies on a direct proof that L  p CIRCUIT-SAT for every.
Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation
Approximation Algorithms
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Chapter 2 Greedy Strategy I. Independent System Ding-Zhu Du.
Nonunique Probe Selection and Group Testing Ding-Zhu Du.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Outline Today’s topic: greedy algorithms
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
CSC 413/513: Intro to Algorithms
Approximation Algorithms by bounding the OPT Instructor Neelima Gupta
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Approximation Algorithms Greedy Strategies. I hear, I forget. I learn, I remember. I do, I understand! 2 Max and Min  min f is equivalent to max –f.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
Approximation Algorithms based on linear programming.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
1 Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Computability and Complexity
Approximation Algorithms
Graph Algorithms in Bioinformatics
Ion Mandoiu Computer Science & Engineering Department
The Theory of NP-Completeness
Learning a hidden graph with adaptive algorithms
Presentation transcript:

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department

What Is Computational Biology? [G. Lancia] “Study of mathematical and computational problems of modeling biological processes in the cell, removing experimental errors from genomic data, interpreting the data and providing theories about their biological relations” Multidisciplinary field at the intersection of computer science, biology, discrete mathematics, statistics, optimization, chemistry, physics, …

5 Steps to Solving CB Problems 1.Understand biological problem 2.Represent biological data as mathematical objects (strings, sets, graphs, permutations,…), map biological relations into mathematical relations, and formulate the biological question as optimization or feasibility problem 3.Study computational complexity: Polynomial? NP-hard? 4.Develop efficient algorithms –If in P, find fast and memory efficient exact algorithms –If NP-hard, find practical exact algorithms and/or algorithms with provable approximation guarantees 5.Validate algorithms on biological data

Outline Shortest Superstring Sequencing by Hybridization PCR Primer Selection

Shotgun Sequencing

Shortest Superstring Given: set of strings s 1, s 2, …, s n Find: shortest string s containing each s i as a substring Example: Set of strings: 000, 001, 010, 011, 100, 101, 110, 111 Superstring: NP-Hard [Maier&Storer77]

Greedy Merging Algorithm Approximation factor no better than 2: –s 1 = ab k, s 2 =b k c, s 3 = b k+1 –Greedy output: ab k cb k+1 length = 2k+3 –Optimum: ab k+1 c length = k+3 Open problem: prove that greedy superstring is always at most twice longer than optimum -S = {s 1,s 2,…,s n } -While |S| > 1 do -Find s,t in S with longest overlap -S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} -Output final string

Overlap & Prefix of 2 strings Overlap of s and t: longest suffix of s that is a prefix of t Prefix of s and t: s after removing overlap(s,t) s = a 1 a 2 a 3 … a |s|-k+1 … a |s| t = b 1 … b k … b |t| prefix(s,t) overlap(s,t)

Lower Bound on OPT OPT = prefix(s 1,s 2 ) … prefix(s n-1,s n ) prefix(s n,s 1 ) overlap(s n,s 1 ) cost of tour 1  2  …  n in the prefix graph

The Cycle Cover Algorithm Computing TSP in prefix graph is NP-hard Key idea: lowerbound OPT using min-weight cycle cover For every cycle c = (i 1  i 2  …  i l  i 1 ),  (c) := prefix(s i1,s i2 ) … prefix(s il,s i1 ) s i1 is a superstring of s i1, …, s il Cycle cover algorithm:

The Cycle Cover Algorithm Theorem [Blum,Jiang,Li,Tromp,Yannakakis94]: Cycle cover algorithm gives factor 4 approximation. Length of output is where r i is a “representative” string from cycle c i wt(C)  OPT - If r i no longer than wt(c i )  output within factor 2 of optimum! - r i can be much longer than wt(c i ) (periodic strings!) - it can be shown that  | r i |  OPT + 2 wt(C)  factor 4

Improved Algorithm Theorem [Blum,Jiang,Li,Tromp,Yannakakis 94]: The improved algorithm gives factor 3 approximation. Proof using that the greedy algorithm gives at least ½ of the optimum compression. Current best approximation factor is [Breslauer,Jiang,Jiang97]

Sequencing by Hybridization Exploits parallel hybridization in DNA arrays All 4 k probes of a certain length k (k=8 to 10) are synthesized on the array Target DNA hybridizes at locations containing probes complementary to its k-substrings Sequencing by Hybridization (SBH) Problem: Reconstruct target DNA given its k-length substrings (spectrum)

Mathematical Formulation of SBH SBH is a special case of the shortest superstring: solution corresponds to a Hamiltonian path (NP-hard to find) in the “prefix length = 1” graph [Pevzner 89] SBH is equivalent to finding an Eulerian path (easy to find in linear time) in the following graph: –Vertices are all (k-1)-tuples –Directed edge between two (k-1)-tuples u and v iff there is a k-length string in the spectrum whose first k symbols match u and last k symbols match v Choose the right mathematical abstraction!

Polymerase Chain Reaction …

Primer Selection Problem  L+x f i r ir i Forward primer Reverse primer i-th amplification locus 3'3' 3'3' 5'5' 5'5'  L+x Given: Pairs of forward/reverse sequences for the n amplification loci Primer length k and amplification upperbound L Find: Minimum set of primers S of length k such that, for each amplification locus, there are two primers in S hybridizing to the forward and reverse sequences within a distance of L of each other

Previous Work [Pearson et al. 96] Logarithmic approximation factor using greedy set cover algorithm for a formulation that does not distinguish between forward and reverse primers Similar formulations used by [Linhart&Shamir’02, Souvenir et al.’03] To enforce bound of L on amplification length must truncate forward and reverse sequences to length L/2 [Fernandes&Skiena’02] model primer selection as a minimum multicolored subgraph problem: Vertices are candidate primers Add edge colored by color i between primers u and v if they hybridize to i-th forward and reverse sequences within a distance of L Find minimum size set of vertices inducing edges of all colors No non-trivial approximation factor proposed

Improved Approximations [Konwar,M,Russell,Shvartsman 04] Logarithmic approximation factor using “potential function” greedy for the bounded amplification length primer selection problem O(Lln n) approximation factor based on randomized rounding for the minimum multicolored subgraph problem of [Fernandes&Skiena’02]

Improved Approximations [Konwar,M,Russell,Shvartsman 04] Logarithmic approximation factor using “potential function” greedy for the bounded amplification length primer selection problem O(Lln n) approximation factor based on randomized rounding for the minimum multicolored subgraph problem of [Fernandes&Skiena’02]

Key Lemma If r and r’ are representative strings from cycles c and c’, then If |overlap(r,r’)|  wt(c) + wt(c’), then   = (  ’)     covers strings in both c and c’  cycle cover is not minimal

Proof of Factor 4 Length of output Numbering r i ’s in order of lefmost occurrence in OPT and using Lemma   | r i |  OPT +  |overlap(r i,r i+1 )|  OPT + 2 wt(C) wt(C)  OPT  Length of output  4 x OPT

Improved Algorithm Analysis Observation 1: The greedy algorithm is known to achieve at least ½ of the optimum compression, i.e.,  |  (c i ) | - |  |  ½ (  |  (c i ) | - OPT  ) where OPT  is the shortest superstring of  (c i ), i=1,…,k  |  | - OPT   ½ (  |  (c i ) | - OPT  ) Observation 2: By numbering  (c i )’s in order of lefmost occurrence in OPT  and using again the key Lemma  |  (c i ) | - OPT  =  |overlap(  (c i ),  (c i+1 ) )|  2 wt(C)  |  | - OPT   wt(C) Observation 3: OPT   OPT + wt(C)  |  |  3 OPT