Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Tests of Significance for Regression & Correlation b* will equal the population parameter of the slope rather thanbecause beta has another meaning with.
Inference for Regression
1 XX X1X1 XX X Random variable X with unknown population mean  X function of X probability density Sample of n observations X 1, X 2,..., X n : potential.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Bayesian Nonparametric Matrix Factorization for Recorded Music Reading Group Presenter: Shujie Hou Cognitive Radio Institute Friday, October 15, 2010 Authors:
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Correlation and Autocorrelation
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Introduction to Inference Estimating with Confidence Chapter 6.1.
A quick introduction to the analysis of questionnaire data John Richardson.
Evaluating Hypotheses
OMS 201 Review. Range The range of a data set is the difference between the largest and smallest data values. It is the simplest measure of dispersion.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Experimental Evaluation
IENG 486 Statistical Quality & Process Control
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Christopher Dougherty EC220 - Introduction to econometrics (review chapter) Slideshow: sampling and estimators Original citation: Dougherty, C. (2012)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Missing value estimation methods for DNA microarrays
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Inferential Statistics: SPSS
Hypothesis Testing.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
by B. Zadrozny and C. Elkan
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Random Sampling, Point Estimation and Maximum Likelihood.
Inferential Statistics 2 Maarten Buis January 11, 2006.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
Basic Statistics Correlation Var Relationships Associations.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Yaomin Jin Design of Experiments Morris Method.
Paper: Large-Scale Clustering of cDNA-Fingerprinting Data Presented by: Srilatha Bhuvanapalli INFS 795 – Special Topics in Data Mining.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures of variability.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.
CHAPTER SEVEN ESTIMATION. 7.1 A Point Estimate: A point estimate of some population parameter is a single value of a statistic (parameter space). For.
Confidence Interval Estimation For statistical inference in decision making:
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
STOCHASTIC HYDROLOGY Stochastic Simulation of Bivariate Distributions Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Statistical methods for real estate data prof. RNDr. Beáta Stehlíková, CSc
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Pearson Product-Moment Correlation Test PowerPoint.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Generation of patterns from gene expression by assigning confidence to differentially expressed genes Elisabetta Manduchi, Gregory R. Grant, Steven E.McKenzie,
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
MEASURES OF CENTRAL TENDENCY Central tendency means average performance, while dispersion of a data is how it spreads from a central tendency. He measures.
ESTIMATION.
Ch9 Random Function Models (II)
Presentation transcript:

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Outline Introduction Methods Experiment Discussion Conclusion 2

Introduction Many clustering algorithms depend on similarity or distance measures to quantify the degree of association between expression profiles. It’s a key factor for a successful identification of relationships between genes and networks. 3

Introduction In general, many clustering algorithms use the Euclidean distance or the Pearson correlation coefficient as default similarity measure. However, those measurement are sensitive to noise effects and outliers. 4

Introduction How to improve? To evaluate the mutual information(MI) between gene expression patterns. 5

Methods Similarity measures The implement of mutual information Assessment of clustering quality 6

Similarity measures Consider two vector: and Euclidean Distance: Pearson Correlation coefficient: 7

Similarity measures The used Mutual Information requires the expression patterns to be represented by discrete random variable. Given two random variable X, Y with respective ranges and probability distribution functions, the Mutual Information is: 8

Mutual Information The MI is always non-negative. A zero MI indicates that the patterns do not follow any kind of dependence. The MI treats each expression level equally, regardless of actual level, and thus is less biased by outliers. 9

The implement of mutual information We use a two-dimensional histogram to approximate the joint probability density function of two expression patterns. 10 We use the same number of bins for all expression patterns. The number of bins should be moderate enough to allow good estimates of the probability function. Expression pattern of y Expression pattern of x

The implement of mutual information The joint probabilities are then estimated by the corresponding relative frequencies of expression values in each bin in the two- dimensional histogram. The number of bins is often obtained heuristically. 11

Assessment of clustering quality When the true solution is unknown, we often use the homogeneity and the separation functions to determine the quality of a clustering solution. 12

Assessment of clustering quality Consider that: a set of N elements, divided into k clusters. Denote by and the expression pattern of element and the expression pattern of its cluster. 13

Homogeneity The homogeneity is: where represents a given similarity measure. 14

Separation The separation is: where, are the number of elements in cluster, and the expression pattern of cluster are. 15

Assessment of clustering quality High homogeneity implies that elements in the same cluster are very similar to each other. Low separation implies that elements from different clusters are very dissimilar to each other. 16

Experiment Experiment 1: Robustness of compared distance measures. Experiment 2: Comparison of known clustering algorithms by the MI measure. 17

Robustness of compared distance measures Evaluate the performance of the three distance measures based on clustering solutions with a known number of clustering errors. How to generate clustering errors? 18 Transferring samples from true cluster to the erroneous one.

Robustness of compared distance measures The smaller is the number of errors in a solution, the better should be its homogeneity and separation scores and vice versa. It is expected that scores of groups of different quality will significantly differ from each other. It means “robustness” for those distance measures. 19

Robustness of compared distance measures The datasets: 20

Robustness of compared distance measures Experimental results: 21 The MI outperform the Pearson correlation and the Euclidean distance.

Robustness of compared distance measures Experimental results: 22 The MI outperform the Pearson correlation and the Euclidean distance.

Robustness of compared distance measures Experimental results: 23 The homogeneity and separation of MI based are better than the ones of Pearson correlation or Euclidean distance.

Robustness of compared distance measures For any number of clustering errors higher than one, the obtained MI-based scores are statistically more significant than the Pearson- based or Euclidean-based scores. Therefore, the use of MI-based scores results in a smaller type-II error (false negative) in comparison to the other distance measures when used to evaluate the quality of a clustering solution. 24

Comparison of known clustering algorithms by the MI measure In this experiment, we compare the effectiveness of several known clustering algorithm. Four compared algorithms: – K-means – Self-Organizing Maps(SOM) – Click – sIB(a MI based clustering algorithm.) 25

Comparison of known clustering algorithms by the MI measure The dataset: The Yeast cell-cycle dataset – 72 experimental conditions. – Transcript level vary periodically within the cell cycle. Spellman et al. assumed that those expression patterns can be correlated to five different profiles(G1, S G2, M and M/G1 stages). 26

Comparison of known clustering algorithms by the MI measure Experimental results: 27 The sIB has higher homogeneity, lower separation.

Comparison of known clustering algorithms by the MI measure Experimental results: 28 For sIB, it could get better Homogeneity and Separation scores than other algorithms.

Comparison of known clustering algorithms by the MI measure Experimental results: 29 However, sIB became worst when Pearson correlation based score used!

Comparison of known clustering algorithms by the MI measure Once the solutions are evaluated by a different distance measure, the ranking obtained is almost the opposite of the MI- based ranking. 30

Discussion In the first experiment, we show the statistical superiority of the average MI-based measure independently of the selected clustering algorithm. In the second experiment, we show that the use of different distance measures can yield very different results when evaluating the solutions of known clustering algorithms. 31

Discussion The use of equal-probability bins to estimate the MI score provides considerable protection against outlier, since the contributions of all expression values within a bin to this estimation are identical, regardless of their actual values. 32

Conclusion The MI measure is a generalized measure of statistical dependence in the data and is reasonably immune against missing data and outliers. The selection of a proper distance measure is more important in clustering algorithms. 33