Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering (Part II) 11/26/07. Spectral Clustering.

Similar presentations


Presentation on theme: "Clustering (Part II) 11/26/07. Spectral Clustering."— Presentation transcript:

1 Clustering (Part II) 11/26/07

2 Spectral Clustering

3 Represent data similarity by a graph

4 For example, Connect two data points if their similarity is greater than a threshold Weight each edge inversely proportional to distance Represent data similarity by a graph

5 Similarity Matrix Each edge is weighted by the similarity between two data points. The similarity matrix contains all the weights W = (w ij )

6 Spectral Clustering Mincut: Min cutsize cutsize = total weight of cut edges (Chris Ding)

7 Spectral Clustering Mincut: Min cutsize cutsize = total weight of cut edges Constraint on sizes: for example |A| = |B| (Chris Ding)

8 Why this might be useful

9 K-means clustering (k=2) cannot separate red from green.

10 Why this might be useful K-means clustering (k=2) cannot separate red from green. Spectral clustering separates the two groups naturally.

11 Partition into two clusters Graph Laplacian Allow q i to take not just discrete but also continuous values. Solution is the eigenvalues of L=D-W. Minimize the following

12 Properties of graph Laplacian L is semi-positive definite. y T Ly >= 0, for any y. First eigenvector is q 1 =(1,…,1), with 1 =0. Second eigenvector is the desired solution. Smaller 2 means better partitioning.

13 Convert q to partition Method 1: A = {i | q i = 0}. But this does satisfy the size constraint.

14 Method 1: A = {i | q i = 0}. But this does not satisfy the size constraint. J is not changed if q i is replaced by q i + c. Method 2: A = {i | q i = c}. Find c so that |A| = |B|. Convert q to partition

15 Partition into more than two clusters Recursively apply 2-cluster procedure. Or Use higher order eigenvectors.

16 More general size constraints Ratio cut Normalized cut Min-max cut

17 Solution Ratio cut –2 nd eigenvector of L=D-W Normalized cut –Solution is eigenvector of Min-max cut –Solution is eigenvector of

18 A simple example

19 More than 2 clusters Ratio cut Normalized cut Min-max Cut

20 Solution Solution lies in the subspace spanned by the first k-eigenvectors.

21 Applications Lymphoma Cancer (Alizadeh et al. 2000) 4025 genes in total 900 genes selected by variable selection methods. (Chris Ding)

22 Affinity Propagation

23 Main Idea Data points can be exemplar (cluster center) or non-examplar (other data points). Message is passed between exemplar (centroid) and non-exemplar data points. The total number of clusters will be automatically found by the algorithm.

24 Responsibility r(j,k) A non-exemplar data point informs each candidate exemplar whether it is suitable for joining as a member. candidate exemplar k data point j

25 Availability a(j,k) A candidate exemplar data point informs other data points whether it is a good exemplar. candidate exemplar k data point j

26 Self-availability a(k,k) A candidate exemplar data point evaluates itself whether it is a good exemplar. candidate exemplar k data point j

27 An iterative procedure Update r(j, k) candidate exemplar k data point j r(j,k) a(j,k’) similarity between i and k

28 An iterative procedure Update a(j, k) candidate exemplar k data point j r(j’,k) a(j,k)

29 An iterative procedure Update a(k, k)

30 Step-by-step affinity propagation

31 Applications Multi-exon gene detection in mouse. Expression level at different exons within a gene are corregulated among different tissue types. 37 mouse tissues involved. 12 tiling arrays. (Frey et al. 2005)

32 Biclustering

33 Gene expression conditions genes 1D-approach: To identify condition cluster, all genes are used. But probably only a few genes are differentially expressed. Motivation

34 Gene expression conditions genes 1D-approach: To identify gene cluster, all conditions are used. But a set of genes may only be expressed under a few conditions. Motivation

35 Gene expression conditions genes Bi-clustering Objective: To isolate genes that are co- expressed under a specific set of conditions. Motivation

36 Coupled Two-Way Clustering An iterative procedure involving the following two steps. –Within a cluster of conditions, search for gene clusters. –Using features from a cluster of genes, search for condition clusters. (Getz et al. 2001)

37 SAMBA – A bipartite graph model V = GenesU = Conditions Tanay et al. 2002

38 V = GenesU = Conditions E = “respond” = differential expression Tanay et al. 2002 SAMBA – A bipartite graph model

39 V = GenesU = Conditions E = “respond” = differential expression Cluster = subgraph (U’, V’, E’) =subset of corregulated genes V’ in conditions U’ Tanay et al. 2002 SAMBA – A bipartite graph model

40 SAMBA -- algorithm Goal: Find the “heaviest” subgraphs. H = (U’, V’, E’) Tanay et al. 2002

41 SAMBA -- algorithm Goal: Find the “heavy” subgraphs. missing edge H = (U’, V’, E’) Tanay et al. 2002

42 SAMBA -- algorithm p u,v -- probability of edge expected at random p c – probability of edge within cluster Compute a weight score for H. H = (U’, V’, E’) Tanay et al. 2002

43 SAMBA -- algorithm Finding the heaviest graph is an NP-hard problem. Use a polynomial algorithm to search for minima efficiently. H = (U’, V’, E’) Tanay et al. 2002

44 Significance of weight Let H = (U’, V’, E’) be a subgraph. Fix U’, random select a new V” with the same size as V’. The weight for the new subgraph (U’, V”, E”) gives a background distribution. Estimate p-value bp comparing log L(H) with the background distribution.

45 Model evaluation The p-value distribution for the top candidate clusters. If biological classification data are available, evaluate the purity of class membership within each bicluster.

46 Reading List Luxberg 2006 (Pages 1-12) –A tutorial for spectral clustering Frey and Dueck 2007 –Affinity propagation Tanay et al. 2002 –SAMBA for biclustering


Download ppt "Clustering (Part II) 11/26/07. Spectral Clustering."

Similar presentations


Ads by Google