Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Outline Introduction Methods Experiment Discussion Conclusion 2

Introduction Many clustering algorithms depend on similarity or distance measures to quantify the degree of association between expression profiles. It’s a key factor for a successful identification of relationships between genes and networks. 3

Introduction In general, many clustering algorithms use the Euclidean distance or the Pearson correlation coefficient as default similarity measure. However, those measurement are sensitive to noise effects and outliers. 4

Introduction How to improve? To evaluate the mutual information(MI) between gene expression patterns. 5

Methods Similarity measures The implement of mutual information Assessment of clustering quality 6

Similarity measures Consider two vector: and Euclidean Distance: Pearson Correlation coefficient: 7

Similarity measures The used Mutual Information requires the expression patterns to be represented by discrete random variable. Given two random variable X, Y with respective ranges and probability distribution functions, the Mutual Information is: 8

Mutual Information The MI is always non-negative. A zero MI indicates that the patterns do not follow any kind of dependence. The MI treats each expression level equally, regardless of actual level, and thus is less biased by outliers. 9

The implement of mutual information We use a two-dimensional histogram to approximate the joint probability density function of two expression patterns. 10 We use the same number of bins for all expression patterns. The number of bins should be moderate enough to allow good estimates of the probability function. Expression pattern of y Expression pattern of x

The implement of mutual information The joint probabilities are then estimated by the corresponding relative frequencies of expression values in each bin in the two- dimensional histogram. The number of bins is often obtained heuristically. 11

Assessment of clustering quality When the true solution is unknown, we often use the homogeneity and the separation functions to determine the quality of a clustering solution. 12

Assessment of clustering quality Consider that: a set of N elements, divided into k clusters. Denote by and the expression pattern of element and the expression pattern of its cluster. 13

Homogeneity The homogeneity is: where represents a given similarity measure. 14

Separation The separation is: where, are the number of elements in cluster, and the expression pattern of cluster are. 15

Assessment of clustering quality High homogeneity implies that elements in the same cluster are very similar to each other. Low separation implies that elements from different clusters are very dissimilar to each other. 16

Experiment Experiment 1: Robustness of compared distance measures. Experiment 2: Comparison of known clustering algorithms by the MI measure. 17

Robustness of compared distance measures Evaluate the performance of the three distance measures based on clustering solutions with a known number of clustering errors. How to generate clustering errors? 18 Transferring samples from true cluster to the erroneous one.

Robustness of compared distance measures The smaller is the number of errors in a solution, the better should be its homogeneity and separation scores and vice versa. It is expected that scores of groups of different quality will significantly differ from each other. It means “robustness” for those distance measures. 19

Robustness of compared distance measures The datasets: 20

Robustness of compared distance measures Experimental results: 21 The MI outperform the Pearson correlation and the Euclidean distance.

Robustness of compared distance measures Experimental results: 22 The MI outperform the Pearson correlation and the Euclidean distance.

Robustness of compared distance measures Experimental results: 23 The homogeneity and separation of MI based are better than the ones of Pearson correlation or Euclidean distance.

Robustness of compared distance measures For any number of clustering errors higher than one, the obtained MI-based scores are statistically more significant than the Pearson- based or Euclidean-based scores. Therefore, the use of MI-based scores results in a smaller type-II error (false negative) in comparison to the other distance measures when used to evaluate the quality of a clustering solution. 24

Comparison of known clustering algorithms by the MI measure In this experiment, we compare the effectiveness of several known clustering algorithm. Four compared algorithms: – K-means – Self-Organizing Maps(SOM) – Click – sIB(a MI based clustering algorithm.) 25

Comparison of known clustering algorithms by the MI measure The dataset: The Yeast cell-cycle dataset – 72 experimental conditions. – Transcript level vary periodically within the cell cycle. Spellman et al. assumed that those expression patterns can be correlated to five different profiles(G1, S G2, M and M/G1 stages). 26

Comparison of known clustering algorithms by the MI measure Experimental results: 27 The sIB has higher homogeneity, lower separation.

Comparison of known clustering algorithms by the MI measure Experimental results: 28 For sIB, it could get better Homogeneity and Separation scores than other algorithms.

Comparison of known clustering algorithms by the MI measure Experimental results: 29 However, sIB became worst when Pearson correlation based score used!

Comparison of known clustering algorithms by the MI measure Once the solutions are evaluated by a different distance measure, the ranking obtained is almost the opposite of the MI- based ranking. 30

Discussion In the first experiment, we show the statistical superiority of the average MI-based measure independently of the selected clustering algorithm. In the second experiment, we show that the use of different distance measures can yield very different results when evaluating the solutions of known clustering algorithms. 31

Discussion The use of equal-probability bins to estimate the MI score provides considerable protection against outlier, since the contributions of all expression values within a bin to this estimation are identical, regardless of their actual values. 32

Conclusion The MI measure is a generalized measure of statistical dependence in the data and is reasonably immune against missing data and outliers. The selection of a proper distance measure is more important in clustering algorithms. 33

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Similar presentations

Presentation on theme: "Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Similar presentations

Presentation on theme: "Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007."— Presentation transcript:

Similar presentations

About project

Feedback