Mutual Information Mathematical Biology Seminar 23.5.2005.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
K-means Clustering Given a data point v and a set of points X,
DECISION TREES. Decision trees  One possible representation for hypotheses.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
CMPUT 466/551 Principal Source: CMU
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Introduction to Bioinformatics
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Algorithms for Smoothing Array CGH data
Reduced Support Vector Machine
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Ensemble Learning (2), Tree and Forest
Clustering Unsupervised learning Generating “classes”
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Entropy and some applications in image processing Neucimar J. Leite Institute of Computing
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Correlation.
More on Microarrays Chitta Baral Arizona State University.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Paper: Large-Scale Clustering of cDNA-Fingerprinting Data Presented by: Srilatha Bhuvanapalli INFS 795 – Special Topics in Data Mining.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Estimating standard error using bootstrap
Cluster Analysis II 10/03/2012.
Data Mining K-means Algorithm
Clustering (3) Center-based algorithms Fuzzy k-means
Clustering and Multidimensional Scaling
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Dimension reduction : PCA and Clustering
Image Registration 박성진.
Cluster Analysis.
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Mutual Information Mathematical Biology Seminar

1. Information Theory and, are terms which describe any process that selects one or more objects from a set of objects. Mathematical Biology Seminar

Information Theory Mathematical Biology Seminar Uncertainty = 3 Symbol ABC 12 Uncertainty = 2 Symbol A1A2B1B2C1C2 Uncertainty = 6 Symbol Uncertainty = Log (M) M = The Number of Symbols

Information Theory Very Surprised Not Surprised Mathematical Biology Seminar

Entropy (self information) – a discrete random variable - probability distribution measure of the uncertainty information of a discrete random variable. How certain we are of the outcome. Mathematical Biology Seminar

Entropy – properties: maximum entropy – a uniform distribution Mathematical Biology Seminar

Joint Entropy measure of the uncertainty between X and Y. Mathematical Biology Seminar

Conditional Entropy measure the remaining uncertainty when X is known. Mathematical Biology Seminar

Mutual Information It is the reduction of uncertainty of one variable due to knowing about the other, or the amount of information one variable contains about the other. Mathematical Biology Seminar  MI(X,Y) 0  MI(X,Y) = 0 only when X,Y are independent: H(X|Y) = H(X).  MI(X,X) = H(X)-H(X|X) = H(X) Entropy is the self-information. Mutual Information – properties:

2. Applications: Clustering algorithms Clustering quality Mathematical Biology Seminar

Clustering algorithms Motivation: MI ’ s capability to measure a general dependence among random variables. Use MI as a similarity measure. Minimize the statistical correlation among clusters in contrast to distance-based algorithms which minimize the total variance within different clusters. Mathematical Biology Seminar

Clustering algorithms Mathematical Biology Seminar Two methods: 1. Mutual-information – MI, PMI 2. Combined mutual-information and distance-based – MIK, MIF

MI – mutual information minimization Grouping property: 1. Compute a proximity matrix based on pairwise mutual informations; assign n clusters such that each cluster contains exactly one object; 2. find the two closest clusters i and j; 3. create a new cluster (ij) by combining i and j; 4. delete the lines/columns with indices i and j from the proximity matrix, and add one line/column containing the proximities between cluster (ij) and all other clusters; 5. if the number of clusters is still > 2, goto (2); else join the two clusters and stop. Mathematical Biology Seminar

PMI – threshold based on pairwise mutual information 1. Start with the first gene and grouping genes that has smallest mutual-information-based distance with it. Repeat, until no gene can be added without surpassing the threshold. Then start with the second gene and repeat the same procedure (all genes are available). Repeat for all genes. 2. The largest candidate cluster is selected. 3. Repeat 1 and 2 until the K clusters. Mathematical Biology Seminar

PMI Threshold – 1. Mean of the distances of all gene pairs 2. Choose empirically Optimal solution – simulated annealing algorithm (optimization). cost function: Mathematical Biology Seminar

Combined methods Euclidean distance – positive correlation. Mutual information – nonlinear correlation. A small data sample size combined algorithms Mathematical Biology Seminar

MIF - combined metric of MI and fuzzy membership distance The objective function: - a weight factor, - normalization constants Mathematical Biology Seminar

Performance on simulated data 8 clustering algorithms. measure of performance: percentage of points placed into correct clusters variables: The sample size (M) is changed. Mathematical Biology Seminar

Performance on simulated data Result (1): 1. MI method outperforms the Fuzzy, K- means, linkage, biclustering, PMI. 2. MIF – best clustering accuracy. 3. MIK has similar performance as the MI. 4. MI based clustering methods – more accurate as the sample size increases. Mathematical Biology Seminar

Performance on simulated data 2. different number of genes (N) M=30 The data are generated according to: Results (2): In addition to the previous results … 1. Performances degrade as the number of gene increase. 2. Degree of degradation depends on the distributions governing the data. Mathematical Biology Seminar

Experimental Analysis Clustering genes based on similarity of their expression patterns in a limited set of experiments. Gene with similar expression patterns are more likely to have similar biological function (it is not provide the best possible grouping). Higher entropy for a gene means that its expression data are more randomly distributed. Higher MI between genes, it is more likely that they have a biological relationship. Mathematical Biology Seminar

Experimental Analysis Mathematical Biology Seminar 579 genes from 26 human glioma surgical tissue samples. 526 genes after filtering out genes with insufficient variability.

Glioma Gliomas are tumors that can be found in various parts of the brain. They arise from the support cells of the brain, the glial cells. Mathematical Biology Seminar

Fuzzy K-meansMIF binary profiles

Experimental Analysis Results (Fuzzy vs. MIF): Two small clusters were broken out from the Fuzzy clusters. While the number of genes changed is small, the error decrease is significant (2.013 decrease to 1.084). Mathematical Biology Seminar

Experimental Analysis Results:  The results are the same for MIK and Fuzzy.  Compared with MIF and MIK, MI and PMI gives different results. Mathematical Biology Seminar

Applications: Clustering algorithms Clustering quality Mathematical Biology Seminar

Clustering quality What choice of number of clusters generally yields the most information about gene function (where function is known)?  9 different algorithms, 2 databases, 4 data sets.  a table of 6300 genes * 2000 attributes.  a cogency table for each cluster-attribute pairs. Mathematical Biology Seminar

Clustering quality Calculate entropies: and the total MI between the cluster result C and all the attributes as: Mathematical Biology Seminar

1. How does MI change? given: 3000 genes 30 clusters Perform random swaps – the cluster sizes were held but the degree of correlation within the clusters, slowly destroy. Mathematical Biology Seminar

Results: 1. MI decreases 2. MI converges to a non-zero value Mathematical Biology Seminar

2. Score the partition 1. Compute MI for the clustered data –. 2. Compute MI for clustering obtained by random swaps, Repeating until a distribution of values is obtained. 3. Compute z-score : Mathematical Biology Seminar

large z-score greater distance clustering results more significantly related to gene function. Results: 1. low cluster numbers 2. clustering algorithm which produce nonuniform cluster size distribution, perform better. Mathematical Biology Seminar

Conclusion – Advantages(1):  Very simple and natural hierarchical clustering algorithm (As MI estimates are becoming better, also the results should improve).  Optimal results when the sample size is large.  MI is a proximity measure, which also recognizes negatively and nonlinearly correlated data set. So it is more general to use it modeling relationship between genes.  MI is not biased by outliers. Euclidian distance is more easily distorted when variables are not uniformly distributed. Mathematical Biology Seminar

Conclusion – Advantages(2):  Expression levels can be modeled to include measurement noise. Mathematical Biology Seminar

Conclusion - Disadvantages:  In general, It is not easy to estimate MI (as an example, continuous random variables).  The performances degrade substantially as the number of genes increases. Mathematical Biology Seminar

Conclusion It is not so accurate to look at each condition as a independent observation. Each point is significant. There are analyses on datasets which do not miss any non-linear correlations. Its more accurate as a validation method. Mathematical Biology Seminar