Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007.

Similar presentations


Presentation on theme: "Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007."— Presentation transcript:

1 Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007

2 Outline Introduction The Algorithm and Method Analysis Experimental results Discussion 2

3 Introduction Affinity Propagation seeks to identify each cluster by one of its elements, exemplar. – each point in the cluster refers to this exemplar. – each exemplar is required to refer to itself as a self-exemplar. However, it forces clusters to appear as stars. 3 There’s only one central node, and all other nodes are directly connected to it.

4 Introduction Some drawbacks in Affinity Propagation: – The hard constraint in AP relies strongly on cluster-shape regularity. – All information about the internal structure and the hierarchical merging/dissociation of cluster is lost. – AP has robustness limitations. – AP forces each exemplar to point to itself. 4

5 Introduction How to improve it? The hard constraint: exemplars would be self- exemplars. We relax the hard constraint by introducing a finite penalty term for each constraint violation. 5

6 The Algorithm and Method Analysis The Soft Constraint Affinity Propagation(SCAP) equations. Efficient implementation of the algorithm. Extracting cluster signatures. 6

7 The SCAP equations We write the constraint attached to a given data point as follows, with : The first case assigns a penalty if data point is chosen as exemplar by some other data point, without being a self-exemplar. 7

8 The SCAP equations The penalty presents a compromise between the minimization the cost function and the search of compact clusters. Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints. 8

9 The SCAP equations So, we can define the probability of an arbitrary clustering as: Original AP is recovered by taking since any violated constraint sets to zero. 9

10 The SCAP equations For general, the optimal clustering can be determined by maximizing the marginal probabilities for all data points : 10

11 The SCAP equations Assume, we find the SCAP equations: The exemplar of any data point can be computed as: 11

12 The SCAP equations Compared to original AP, SCAP amounts to an additional threshold on the self-availabilities and the self-responsibilities. For small enough, in many case. The self-responsibility is substituted with. For (i.e. ), the original AP equations are recovered. 12

13 The SCAP equations This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them. 13

14 Efficient implementation The iterative solution: 14

15 Efficient implementation Difference between the original AP: – Step 3 is formulated as a sequential update. – The original AP used damped parallel update. 15

16 Extracting cluster signatures Only a few components carry useful information about the cluster structure, they are called cluster signatures. We assume the similarity between data points and to be additive in single-gene contributions: 16

17 Extracting cluster signatures Having found a clustering given by the exemplar selection, we can calculate the similarity of a cluster C defined as a connected component of the directed graph: as a sum over single-gene contributions 17

18 Extracting cluster signatures Then, we compare to random exemplar choices which are characterized by their mean: and variance 18

19 Extracting cluster signatures The relevance of a gene can be ranked by which measures the distance of the actual from the distribution of random exemplar mappings. Genes can be ranked according to, highest-ranking genes are considered a cluster signature. 19

20 Experimental results Iris data Brain cancer data Other benchmark cancer data – Lymphoma cancer data – SRBCT cancer data – Leukemia 20

21 Iris data Three clusters: setosa, versicolor, virginica. Four features for 150 flowers: – sepal length – sepal width – petal length – petal width 21

22 Iris data Experimental results: – Affinity Propagation: 16 errors. – SCAP: 9 errors with Manhattan distance measure for the similarity. On increasing the value of, the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa. 22

23 Brain cancer data Five diagnosis types for 42 patients: – 10 medulloblastoma – 10 malignant glioma – 10 atypical teratoid/rhabdoid tumors – 4 normal cerebella – 8 primitive neuroectodermal tumors – PNET 23

24 Brain cancer data Clustering with AP(for ): 24 There are three well- distinguishable clusters. Five clusters for lowest errors.

25 Brain cancer data Clustering with SCAP: 25 The SCAP identifies four clusters with 8 errors.

26 Brain cancer data Eight errors are due to misclassifications of the fifth diagnosis(PNET). We use the procedure to extract cluster signatures in the case of four clusters: No. 34~41 are the fifth diagnosis. 26

27 Other benchmark cancer data Lymphoma cancer data – Three diagnoses for 62 patients. SRBCT cancer data – Four expression diagnosis patterns for 63 samples. Leukemia – Two diagnoses for 72 samples. 27

28 Other benchmark cancer data Lymphoma cancer data – AP: 3 errors with 3 clusters. – SCAP: 1 error with 3 clusters. SRBCT cancer data – AP: 22 errors with 5 clusters. – SCAP: 7 errors with 4 clusters. Leukemia – AP: 4 errors with 2 clusters. – SCAP: 2 errors with 2 clusters. 28

29 Discussion If clusters cannot be well represented by a single cluster exemplar, AP has to fail. SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data. The cluster structure can be efficiently probed. 29


Download ppt "Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007."

Similar presentations


Ads by Google