Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011.

Slides:



Advertisements
Similar presentations
Max Cut Problem Daniel Natapov.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
“Devo verificare un’equivalenza polinomiale…Che fò? Fò dù conti” (Prof. G. Di Battista)
The Theory of NP-Completeness
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
CS774. Markov Random Field : Theory and Application Lecture 17 Kyomin Jung KAIST Nov
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
CSC5160 Topics in Algorithms Tutorial 2 Introduction to NP-Complete Problems Feb Jerry Le
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Semidefinite Programming
Recent Development on Elimination Ordering Group 1.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
The Theory of NP-Completeness
Analysis of Algorithms CS 477/677
A New Algorithm for Optimal 2-Constraint Satisfaction and Its Implications Ryan Williams Computer Science Department, Carnegie Mellon University Presented.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:
Chapter 11: Limitations of Algorithmic Power
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
The Maximum Independent Set Problem Sarah Bleiler DIMACS REU 2005 Advisor: Dr. Vadim Lozin, RUTCOR.
Hardness Results for Problems
1 The Theory of NP-Completeness 2 NP P NPC NP: Non-deterministic Polynomial P: Polynomial NPC: Non-deterministic Polynomial Complete P=NP? X = P.
MCS312: NP-completeness and Approximation Algorithms
Fixed Parameter Complexity Algorithms and Networks.
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
Design Techniques for Approximation Algorithms and Approximation Classes.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Advanced Algorithm Design and Analysis (Lecture 13) SW5 fall 2004 Simonas Šaltenis E1-215b
Approximation Algorithms
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
EMIS 8373: Integer Programming NP-Complete Problems updated 21 April 2009.
Princeton University COS 423 Theory of Algorithms Spring 2001 Kevin Wayne Approximation Algorithms These lecture slides are adapted from CLRS.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Unit 9: Coping with NP-Completeness
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Non-Approximability Results. Summary -Gap technique -Examples: MINIMUM GRAPH COLORING, MINIMUM TSP, MINIMUM BIN PACKING -The PCP theorem -Application:
NP-Complete problems.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Chapter 11 Introduction to Computational Complexity Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Algorithms for hard problems Introduction Juris Viksna, 2015.
NP Completeness Piyush Kumar. Today Reductions Proving Lower Bounds revisited Decision and Optimization Problems SAT and 3-SAT P Vs NP Dealing with NP-Complete.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Introduction to NP Instructor: Neelima Gupta 1.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
Approximation Algorithms based on linear programming.
On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.
The NP class. NP-completeness
Chapter 10 NP-Complete Problems.
Introduction to Approximation Algorithms
Design and Analysis of Algorithm
Possibilities and Limitations in Computation
NP-Completeness Yin Tat Lee
ICS 353: Design and Analysis of Algorithms
Chapter 11 Limitations of Algorithm Power
NP-Completeness Yin Tat Lee
Major Design Strategies
Major Design Strategies
Presentation transcript:

Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011

NP-Hard Problems The P vs. NP problem: Is finding a solution to a combinatorial search problem as easy as checking a solution. The answer is expected to be “No.” NP-Hard Problems: Solvable in polynomial time only if P=NP. General belief: Solving an NP-hard problem requires worst case exponential time.

Understanding NP-Hard Problems Through Worst-Case Analysis Exact solution methods: exponential running time in worst case. Polynomial-time approximation algorithms for optimization problems, yielding a worst- case upper bound of the ratio between the cost of an approximate solution and the cost of optimal solution. Unfortunately, these guaranteed approximation ratios are unrealistically high.

Probabilistic Analysis and Heuristics In probabilistic analysis problem instances are drawn from simple probability distributions. Often one can prove excellent performance on the average. However, the probability distributions may not correspond to real-life instances. Heuristics are typically evaluated empirically on examples drawn from, or representative of. real- life instances. Heuristics are often “unreasonably effective,” for reasons not well understood.

Famous Unreasonably Effective Heuristics Large traveling-salesman problems can be solved by quick tour construction methods, local improvement methods or cutting plane methods. Local improvement methods find near-optimal solutions to graph bisection problems. Huge satisfiability problems are routinely solved rapidly by branch-and-bound methods. The greedy set cover algorithm typically gives solutions within a few percent of optimal.

NP-Hard Problems Arising in Molecular Biology and Genetics Genome Sequencing Global alignment of multiple genomes Identifying siblings, cousins, second cousins etc. through comparison of genomes Finding protein modules containing specified types of proteins Computational discovery of dysregulated pathways in human diseases

Patterns of Inheritance In each region of the genome, each individual has two haplotypes, one inherited from each parent. A haplotype is a sequence of alleles. The haplotype inherited from a parent is a mosaic of segments inherited from the parent’s two haplotypes. Recombination occurs at the boundaries between segments. In a pedigree graph the vertices are individuals and the edges represent parent-child relations.

Reconstructing Pedigrees Given the haplotypes of individuals in the current generation, we wish to reconstruct the pedigree that gave rise to that generation and chart the flow of alleles.

Assumptions of a Generative Model Monogamy Layered structure: each individual and its mate lie in generation g, have parents in generation g-1, and children in generation g+1. Generation 1 is the founding generation. The number of children of each couple is drawn from a Poisson distribution with mean 2. In each haplotype, sites of recombination occur according to a Poisson process with known rate.

Working Backwards We construct the pedigree generation by generation, working backwards from the current generation. It suffices to determine, in each generation, which individuals are siblings. Two alleles are identical by descent (IBD) if they are inherited from the same allele in the founding generation. To test whether two individuals in generation g are likely to be siblings, we observe the amount of IBD between their descendants in the current generation

Inferring Siblinghood Problem: determine which individuals in generation g are siblings. Using IBD, we construct a compatibility graph with a vertex for each individual in generation g, and edges indicating pairs of individuals that are likely to be siblings on the basis of the IBD of their descendants. Problem: Infer the siblinghood graph from the compatibility graph.

Inferring Siblinghood Because of the monogamy assumption, the siblinghood graph must be a union of cliques. Problem: Given a compatibility graph C determine the “closest” siblinghood graph S. The algorithm maintains a partition of the vertices of C. The parts of the partition are called quasi-cliques. The score of a partition is A times the number of edges of C whose end points lie in the same quasi-clique, minus the number of non-edges of C whose end points lie in the same in the same quasi-clique. We seek a partition of maximum score.

Justifying the Scoring Function Assumptions: The compatibility graph C is obtained by randomly perturbing the siblinghood graph S. S is a random union of disjoint cliques with sizes uniformly distributed between 1 and a parameter t. If u is adjacent to v in S then u is adjacent to v in C with probability p; if u is not adjacent to v in S then u is adjacent to v in C with probability q, where q <p. Under these assumptions maximizing the score produces a siblinghood graph of maximum conditional probability given C.

Heuristic Algorithm The heuristic algorithm creates an initial partition by greedily constructing disjoint quasi-cliques. It then performs the following local operations to improve the score: Move a vertex; Extract a vertex; Split a quasi- clique; Merge two quasi-cliques; Restructure two quasi-cliques adjacent to a vertex v; Dynamic Programming: given a chain of quasi- cliques, make an optimal simultaneous move of a small set of vertices from each quasi-clique in the chain to its successor quasi-clique.

Performance of the Algorithm Typically the algorithm produces a partition with a slightly higher score than the “true” partition from which the compatibility graph was generated by perturbation. However, the fraction of vertices placed in the “correct” partition lies between 93% and 98%, depending on the fraction of edges deleted from cliques and the fraction of edges added between cliques in creating the compatibility graph C from the siblinghood graph S.

The Colorful Subgraph Problem Input: A graph G and an assignment of a color to each vertex. Find, if one exists, a connected subgraph H containing exactly one vertex of each color. Optimization version: Minimize a x (number of extra vertices) + b x (number of omitted colors) The problem is NP-hard, even on planar graphs.

Interpretation In the protein-protein interaction (PPI) graph of a species the vertices represent proteins and the edges represent pairs of physically interacting proteins. Given a connected set X of proteins performing a regulatory function in species A, we seek a similar connected set of proteins in species B. The color of each protein in species B indicates its similarity to a particular protein in X.

Dynamic Programming For each vertex v and set of colors S, determine whether there is a tree containing exactly one vertex of each color in S, no vertices of any other color, and containing vertex v. The computation is recursive, running through sets S in order of increasing cardinality. The running time is of order n3 k where n is the number of vertices and k is the number of colors.

Integer Programming plus Constraint Generation We may assume that the desired connected subgraph is a tree T Variables: x(i)= 1 iff vertex i is included in T y(e) = 1 iff edge e is included in T Constraints: Exactly one vertex of each color is included; Exactly n-1 edges are included in T; If an edge is included then its endpoints are included; For each set of colors X, the number of edges of T connecting two vertices in X is at most |X| -1.

Performance An implementation of integer programming plus constraint generation solves typical instances with 100 vertices in less than a minute. Using a heuristic not yet implemented, one can solve typical instances with 100 vertices by hand in minutes.

Heuristic Algorithmic Strategy Repeat: (1) Delete vertices with frequent colors; (2) In the remaining graph, select a minimal set of connected components covering all infrequent colors; (3) Insert minimal set of vertices with frequent colors to restore connectedness and cover all colors.

Example on Grid W E L C O M E T O T H E W E B S I T E O F T H E A N N U A L S Y MP O S I U M O N C O MB I N A T O R I A L P A T T E R NMA T

After One Iteration W e B S F t H N N U a L S Y P S I U N C o m B I N R I L P R N

After Two Iterations W E B F T H n A l s Y P I u C O M B