Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Slides:



Advertisements
Similar presentations
NP-Hard Nattee Niparnan.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 Appendix B: Solving TSP by Dynamic Programming Course: Algorithm Design and Analysis.
Department of Computer Science & Engineering
© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.
Chapter 7 Dynamic Programming.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Combinatorial Algorithms
Complexity 11-1 Complexity Andrei Bulatov NP-Completeness.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Tirgul 12 Algorithm for Single-Source-Shortest-Paths (s-s-s-p) Problem Application of s-s-s-p for Solving a System of Difference Constraints.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
The Theory of NP-Completeness
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Analysis of Algorithms CS 477/677
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Chapter 11: Limitations of Algorithmic Power
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Hardness Results for Problems
1 The Theory of NP-Completeness 2 NP P NPC NP: Non-deterministic Polynomial P: Polynomial NPC: Non-deterministic Polynomial Complete P=NP? X = P.
The Shortest Path Problem
GRAPH Learning Outcomes Students should be able to:
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
 Jim has six children.  Chris fights with Bob,Faye, and Eve all the time; Eve fights (besides with Chris) with Al and Di all the time; and Al and Bob.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
NP-Completeness: 3D Matching
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
NP-COMPLETENESS PRESENTED BY TUSHAR KUMAR J. RITESH BAGGA.
EMIS 8373: Integer Programming NP-Complete Problems updated 21 April 2009.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Data Structures & Algorithms Graphs
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Instructor Neelima Gupta Table of Contents Class NP Class NPC Approximation Algorithms.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Introduction to Graphs And Breadth First Search. Graphs: what are they? Representations of pairwise relationships Collections of objects under some specified.
LIMITATIONS OF ALGORITHM POWER
NP-completeness NP-complete problems. Homework Vertex Cover Instance. A graph G and an integer k. Question. Is there a vertex cover of cardinality k?
NPC.
Comp. Genomics Recitation 7 Clustering and analysis of microarrays.
C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Introduction to NP Instructor: Neelima Gupta 1.
Approximation Algorithms by bounding the OPT Instructor Neelima Gupta
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Spanning Trees Dijkstra (Unit 10) SOL: DM.2 Classwork worksheet Homework (day 70) Worksheet Quiz next block.
Discrete Structures Li Tak Sing( 李德成 ) Lectures
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
The NP class. NP-completeness
More NP-Complete and NP-hard Problems
P & NP.
Mathematical Foundations of AI
Richard Anderson Lecture 26 NP-Completeness
Approximate Algorithms (chap. 35)
Greedy Algorithms / Minimum Spanning Tree Yin Tat Lee
Chapter 5. Optimal Matchings
CSE 421: Introduction to Algorithms
ICS 353: Design and Analysis of Algorithms
Richard Anderson Lecture 25 NP-Completeness
Computational Genomics Lecture #3a
Presentation transcript:

Comp. Genomics Recitation 10 Clustering and analysis of microarrays

Exercise 1 A microarray that contains probes for all the N metabolic enzymes of the bacterium D.Angerous was used for the following time-series microarray experiment: The bacteria population were exposed to a drug, and gene expression was measured every hour for M hours. The expression values are discretized to {-1,0,1}

Exercise 1 Find the longest expression pattern that is common to at least k enzymes. Each enzyme may start the pattern at a different time. T7T6T5T4T3T2T E E2 010 E E E5 0 1E6 K=3

Solution Treat each expression vector as a string Create a generalized suffix tree O(MN) Find longest k-common substring

Exercise 2 Expression of N genes was measured under a certain condition using a microarray. No discretization was performed. Give a polynomial time algorithm for clustering these genes into exactly k clusters. The objective function is

Pictorially G1G2G3G4G5G6 Expression level If {G3,G4,G5}is a cluster, its contribution to the objective function is d(G3,G5)

Solution Create a weighted directed graph, every gene is a node and the edge from i to j has weight d(i,j-1) if i’s expression is lower than j’s (otherwise ∞) G1G2G3G4G5G6 The path in the graph that corresponds to this clustering is G1  G3  G6. The value of the objective function is d(G1,G2)+d(G3,G5)+0

Solution Next: Find the shortest path that visits exactly k nodes Dynamic programming: Start from k because if l<k P l (k-1)=∞

Exercise 3 A microarray experiment with N genes and M conditions was conducted Describe a polynomial algorithm that determines whether the genes can be clustered into 2 clusters such that the maximum distance d(Gi,Gj) in each cluster < W

Illustration W=2 G1 G2 G3 G4

Solution Create a graph with a node for every gene Add an edge (i,j) if d(i,j)> W Check if the resulting graph is bipartite: Run BFS, if you discover an edge (u,v) to a gray node and the depths of u and v are both even or both odd, answer: “no”.

Solution Not Bipartite

Exercise 4 We are given a microarray with N genes and M experiments We want to cluster the genes into k clusters such that the distance between genes that belong to the same cluster will be < W Can you give a polynomial algorithm that solves this problem?

Solution Probably not More specifically, if we could solve this problem in polynomial time, we could solve a large class of problem that are widely believed to be unsolvable in polynomial time

Solution How can we show that we can probably not find a solution in polynomial time? We will take a problem for which this has already been shown We will construct a polynomial time reduction to our problem So, if our problem could be solved efficiently the “hard” problem could also be solved efficiently

Graph description The following graph can describe our problem: G1 G2 G3 G6 G5 G4 There’s an edge (Gi,Gj) if the distance between Gi and Gj is less than W

Graph description Clustering with k=3:

3COL 3-Colorability: Given a graph G, can we dye its vertices with 3 different colors such that no two adjacent nodes have the same color?

Comparing the problems What is common to both these problems? In both we “cluster” the nodes What are the differences? First, in 3COL there are only 3 clusters instead of k Second, the elements that belong to the same group in 3COL must not have edges between them

Reduction Now that we understand the differences, we can take a graph G that is an input to 3COL, and transform it to a graph G’ and a constant k that are the input to the k- clustering problem We assume that we have a polynomial k- clustering algorithm, and we apply it to (G’,k) and translate the solution to 3COL

Reduction Given the first difference that we noted, what should be the value of k? We set k to 3, i.e. the algorithm should find exactly 3 clusters How do we change G to get G’? G’ has the complement edges of G

Example

Proof  Suppose that G is 3 colorable. Let V 1,V 2,V 3 be the groups of nodes that can be colored by distinct colors. There are no edges between any pair of nodes in V 1, and therefore it forms a legal cluster in G’. Similarly, the nodes of V 2 and V 3 form clusters. Since V 1 UV 2 UV 3 contains all the nodes all the genes are clustered in the 3 corresponding clusters.

Proof, second direction  Suppose that G’ contains a clustering to 3 legal clusters. These clusters correspond to 3 nodes sets in G such that within each set there are no edges between pairs of nodes. Therefore, assigning a different color to every set is a 3-coloring.

HW 2 question 5 Uniform lifted alignment – alignment in which for each level all string are either lifted from right or left. Prove that the optimal uniform lifted alignment has cost at most twice of the optimal alignment tree. Give a polynomial algorithm to find the optimal uniform lifted alignment.

HW 2 question 5 Uniform lifted alignment, proof: Assume we had the optimal tree T*. Transform it in the following way: To assign string at level k, consider: Pick the minimal sum.

HW 2 – question 5 – cont’d Assign each ‘non-zero’ edge (T,S) to a path in the optimal tree: The path from leaf (T) to node (S*). S (S*) T S T Together, these paths cover all edges of the tree.

HW 2 – question 5 – cont’d By triangle inequality: D(S, T) ≤ D(S, S*) + D(S*, T) S (S*) T S T By choice of left/right: Σs D(S,S*)+D(S*,T) ≤ Σs D(S*,T)+D(S*,T) = Σs 2D(S*,T) => One-sided tree with cost at most twice the optimal.

HW 2 – question 5 – cont’d Algorithm: Preprocess pairwise sequence distances. Try all different assignments for a left/right for each level, and pick the minimal one. Running time (n sequences of length m): Proprocessing: O(m 2 n 2 ). Height h, different assignment 2 h. Calculation cost of tree O(n).