Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Similar presentations


Presentation on theme: "Comp. Genomics Recitation 10 Clustering and analysis of microarrays."— Presentation transcript:

1 Comp. Genomics Recitation 10 Clustering and analysis of microarrays

2 Exercise 1 A microarray that contains probes for all the N metabolic enzymes of the bacterium D.Angerous was used for the following time-series microarray experiment: The bacteria population were exposed to a drug, and gene expression was measured every hour for M hours. The expression values are discretized to {-1,0,1}

3 Exercise 1 Find the longest expression pattern that is common to at least k enzymes. Each enzyme may start the pattern at a different time. T7T6T5T4T3T2T1 10 10 E1 1010 E2 010 E3 0000000E4 1101111E5 0 1E6 K=3

4 Solution Treat each expression vector as a string Create a generalized suffix tree O(MN) Find longest k-common substring

5 Exercise 2 Expression of N genes was measured under a certain condition using a microarray. No discretization was performed. Give a polynomial time algorithm for clustering these genes into exactly k clusters. The objective function is

6 Pictorially G1G2G3G4G5G6 Expression level If {G3,G4,G5}is a cluster, its contribution to the objective function is d(G3,G5)

7 Solution Create a weighted directed graph, every gene is a node and the edge from i to j has weight d(i,j-1) if i’s expression is lower than j’s (otherwise ∞) G1G2G3G4G5G6 The path in the graph that corresponds to this clustering is G1  G3  G6. The value of the objective function is d(G1,G2)+d(G3,G5)+0

8 Solution Next: Find the shortest path that visits exactly k nodes Dynamic programming: Start from k because if l<k P l (k-1)=∞

9 Exercise 3 A microarray experiment with N genes and M conditions was conducted Describe a polynomial algorithm that determines whether the genes can be clustered into 2 clusters such that the maximum distance d(Gi,Gj) in each cluster < W

10 Illustration 0 1 1 0 0 1 111 1 1 0 W=2 G1 G2 G3 G4

11 Solution Create a graph with a node for every gene Add an edge (i,j) if d(i,j)> W Check if the resulting graph is bipartite: Run BFS, if you discover an edge (u,v) to a gray node and the depths of u and v are both even or both odd, answer: “no”.

12 Solution Not Bipartite

13 Exercise 4 We are given a microarray with N genes and M experiments We want to cluster the genes into k clusters such that the distance between genes that belong to the same cluster will be < W Can you give a polynomial algorithm that solves this problem?

14 Solution Probably not More specifically, if we could solve this problem in polynomial time, we could solve a large class of problem that are widely believed to be unsolvable in polynomial time

15 Solution How can we show that we can probably not find a solution in polynomial time? We will take a problem for which this has already been shown We will construct a polynomial time reduction to our problem So, if our problem could be solved efficiently the “hard” problem could also be solved efficiently

16 Graph description The following graph can describe our problem: G1 G2 G3 G6 G5 G4 There’s an edge (Gi,Gj) if the distance between Gi and Gj is less than W

17 Graph description Clustering with k=3:

18 3COL 3-Colorability: Given a graph G, can we dye its vertices with 3 different colors such that no two adjacent nodes have the same color?

19 Comparing the problems What is common to both these problems? In both we “cluster” the nodes What are the differences? First, in 3COL there are only 3 clusters instead of k Second, the elements that belong to the same group in 3COL must not have edges between them

20 Reduction Now that we understand the differences, we can take a graph G that is an input to 3COL, and transform it to a graph G’ and a constant k that are the input to the k- clustering problem We assume that we have a polynomial k- clustering algorithm, and we apply it to (G’,k) and translate the solution to 3COL

21 Reduction Given the first difference that we noted, what should be the value of k? We set k to 3, i.e. the algorithm should find exactly 3 clusters How do we change G to get G’? G’ has the complement edges of G

22 Example

23 Proof  Suppose that G is 3 colorable. Let V 1,V 2,V 3 be the groups of nodes that can be colored by distinct colors. There are no edges between any pair of nodes in V 1, and therefore it forms a legal cluster in G’. Similarly, the nodes of V 2 and V 3 form clusters. Since V 1 UV 2 UV 3 contains all the nodes all the genes are clustered in the 3 corresponding clusters.

24 Proof, second direction  Suppose that G’ contains a clustering to 3 legal clusters. These clusters correspond to 3 nodes sets in G such that within each set there are no edges between pairs of nodes. Therefore, assigning a different color to every set is a 3-coloring.

25 HW 2 question 5 Uniform lifted alignment – alignment in which for each level all string are either lifted from right or left. Prove that the optimal uniform lifted alignment has cost at most twice of the optimal alignment tree. Give a polynomial algorithm to find the optimal uniform lifted alignment.

26 HW 2 question 5 Uniform lifted alignment, proof: Assume we had the optimal tree T*. Transform it in the following way: To assign string at level k, consider: Pick the minimal sum.

27 HW 2 – question 5 – cont’d Assign each ‘non-zero’ edge (T,S) to a path in the optimal tree: The path from leaf (T) to node (S*). S (S*) T S T Together, these paths cover all edges of the tree.

28 HW 2 – question 5 – cont’d By triangle inequality: D(S, T) ≤ D(S, S*) + D(S*, T) S (S*) T S T By choice of left/right: Σs D(S,S*)+D(S*,T) ≤ Σs D(S*,T)+D(S*,T) = Σs 2D(S*,T) => One-sided tree with cost at most twice the optimal.

29 HW 2 – question 5 – cont’d Algorithm: Preprocess pairwise sequence distances. Try all different assignments for a left/right for each level, and pick the minimal one. Running time (n sequences of length m): Proprocessing: O(m 2 n 2 ). Height h, different assignment 2 h. Calculation cost of tree O(n).


Download ppt "Comp. Genomics Recitation 10 Clustering and analysis of microarrays."

Similar presentations


Ads by Google