Paper: Large-Scale Clustering of cDNA-Fingerprinting Data Presented by: Srilatha Bhuvanapalli 04-01-2004 INFS 795 – Special Topics in Data Mining.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

K-means Clustering Given a data point v and a set of points X,
Basic Gene Expression Data Analysis--Clustering
Clustering Categorical Data The Case of Quran Verses
Han-na Yang Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
CLUSTERING PROXIMITY MEASURES
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Artificial neural networks:
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Introduction to Bioinformatics
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
Active Calibration of Cameras: Theory and Implementation Anup Basu Sung Huh CPSC 643 Individual Presentation II March 4 th,
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Simple Neural Nets For Pattern Classification
Mutual Information Mathematical Biology Seminar
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Lecture 12 Splicing and gene prediction in eukaryotes
Lecture 09 Clustering-based Learning
Birch: An efficient data clustering method for very large databases
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Gene expression & Clustering (Chapter 10)
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Blind Pattern Matching Attack on Watermark Systems D. Kirovski and F. A. P. Petitcolas IEEE Transactions on Signal Processing, VOL. 51, NO. 4, April 2003.
Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.
Lecture 20: Cluster Validation
Microarray - Leukemia vs. normal GeneChip System.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
CHAPTER 5 SIGNAL SPACE ANALYSIS
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Topics 1 Specific topics to be covered are: Discrete-time signals Z-transforms Sampling and reconstruction Aliasing and anti-aliasing filters Sampled-data.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
CHAPTER- 3.2 ERROR ANALYSIS. 3.3 SPECIFIC ERROR FORMULAS  The expressions of Equations (3.13) and (3.14) were derived for the general relationship of.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Virtual University of Pakistan
Microarray - Leukemia vs. normal GeneChip System.
Clustering.
Information Theoretical Probe Selection for Hybridisation Experiments
Lecture # 2 MATHEMATICAL STATISTICS
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Presentation transcript:

Paper: Large-Scale Clustering of cDNA-Fingerprinting Data Presented by: Srilatha Bhuvanapalli INFS 795 – Special Topics in Data Mining

OUTLINE OBJECTIVE K-Means METHOD INTRODUCTION EXPERIMENTAL REQUIREMENTS HOW DO WE MEASURE SIMILARITY ??? SIMULATION SETUP / DATA SET METHODS CLUSTERING ALGORITHM VALIDATION OF CLUSTERING RESULTS CONCLUSION

OBJECTIVE To illustrate a clustering method by gene expression data based on sequential K-Means algorithm on human peripheral blood dendritic cells. A partitioning algorithm with heuristic modifications that finds the number of cluster centroids from the data itself and a “good” partition according to these centroids.

K-Means METHOD To start with we must determine the number of clusters The next step is to place all the points in clusters corresponding to the nearest centroid. Because the centroids move, it may be possible for certain observations to be closer to a cluster centroid in a cluster that it is not a member of. This second sweep through the data acts to "correct" the first sweep, placing observations in the cluster in which they are closest to the cluster centroid. The centroids are updated as new members leave/join the cluster. This correction is done until no single observation changes clusters in one sweep.

INTRODUCTION The task of the clustering procedure is to classify the clone fingerprints according to a well-defined pairwise similarity measure to group similar fingerprints together and to separate the dissimilar ones. What is a finger print ??? Fingerprint: Each clone is described by a characteristic vector of numerical values which is called a fingerprint.

EXPERIMENTAL REQUIREMENTS Ability to handle data sets in the order of hundreds of thousands of high-dimensional data points in an acceptable amount of time. Ability to handle partial information in the form of missing values. Algorithm should be robust enough to cope with experimental noise because high throughout data is usually generated within a production pipeline that involves many different steps and is therefore somewhat error prone.

How do we measure Similarity ??? Euclidean Distance Metric : Distance between the clones. The shorter the distance the similar they are. Mutual Information Similarity Measurement : Takes into account the total matched similarities. Consider for example the following fingerprints: x1 = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)x2 = (1, 1, 0, 0, 0, 0, 0, 0, 0, 0) x3 = (1, 1, 1, 1, 0, 0, 0, 0, 0, 0)x4 = (1, 1, 1, 1, 1, 0, 0, 0, 0, 0) Here, the individual signals are binarized for simplification. Euclidian distance of x1 and x2 as well as x3 and x4 is 1 Mutual Information of x1 and x2 is 1; x3 and x4 is 4 (as it takes into account the number of matches). Another disadvantage of Euclidean distance is the fact that very uninformative fingerprints get a high pairwise similarity just because of the absence of high signal values even if they are from totally different DNA sequences. Mutual Information in this case is very low. By taking into consideration only those signals that are present in both vectors, Mutual Information weights the data implicitly. In other words, missing values are treated as they should be: They are ignored.

How do we measure Similarity ??? – Contd … Relative Mutual Information Coefficient (RMIC): To quantify the quality of the calculated partitions, a novel measure for cluster validity based on a modified version of mutual information. RMIC can be interpreted as the amount of information that the calculated clustering contains about the true clustering.

SIMULATION SETUP / DATA SET Clustering Human cDNA library under analysis is derived from human peripheral blood dendritic cells. Dendritic cells have a key role in the immune system through their ability to present antigen. From clustering point of view, the complexity of the library is interesting: As these cells are specialized to certain biological processes, we estimate the number of different genes to be ~ 15, 000, although there is no exact data available. To test the algorithm on experimental data, 15 different cDNA clones are partially sequenced, identified, and afterwards hybridized to the entire cDNA library, and a total set of 2029 cDNA clones are extracted that give strong positive signals with one of the genes. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.

METHODS We assume the series of hybridization signals for each clone to be independent of each other. To allow mutual information measurement, we digitalize the signals by introducing a finite number K of intervals. For two clone fingerprints, x = (x l,...,x p ) and y = (y l,...,y p ), similarity can be measured by mutual information; N xy ij is the number of pairs where x falls into interval i and y falls into interval j and n x i and n y j are the respective marginal frequencies of x and y and n xy is the number of pairs where signals are present in both vectors.

METHODS - Contd Mutual information can be interpreted as the amount of information that each of the signals detects about the other. It tends to zero if x and y are independent (no correlation) and is maximal if they are identical (perfect correlation). H(x) = -  k i=1 n x i /n xy log2 n x i /n xy is the entropy of x and H(y) is the entropy of y. As the equation holds the range of s lies within the interval [0,1]. It is 1 if both signal series are perfectly correlated and 0 if there is no correlation. If t(x,y) is proportional to the pairs in the diagonal of the K  K contingency table; it is low incase of anticorrelation.

CLUSTERING ALGORITHM

To allow the algorithm to find the clusters from data itself, two threshold parameters (  and ,  >=  ) are introduced;  is the minimal admissible similarity for merging two cluster centroids, and  corresponds to the maximal admissible similarity of a data point to a cluster centroid. Cluster centroids are initialized with weights = 1 and pairwise similarities < 

VALIDATION OF CLUSTERING Assume a data structure of N data points, x l,...,x N, in which the true clustering, T, is known. Let t ij = 1 if x i and x J belong to the same cluster and t ij = 0 otherwise (l <= i, j, <= N). For a calculated clustering, C, define similarly c ij = 1 if x i and x J belong to the same cluster and c ij = 0 otherwise, (1 <= i, j <= N). To measure clustering quality, we evaluate the 2  2 contingency table of the following form: in which N kl = #{(i, j); t ij = l, c ij = k, l <= i, j <= N}, 0 <= k, l <= 1, and in which N.k and N l. are the respective marginal frequencies. columns correspond to the true clustering, and the rows correspond to the calculated clustering.

VALIDATION OF CLUSTERING - Contd As a measure of quality for a given true clustering, T, and a given calculated clustering, C, we introduce the RMIC in which sgn(y) = 1, if y >= 0, and sgn(y) = -1, if y < 0, and H(C;T) and H(T) are defined as before with the number K of intervals equal to 2. RMIC can be interpreted as the amount of information that the calculated clustering contains about the true clustering. The multiplication factor is necessary to filter out anticorrelation. RMIC is negative if more pairs are clustered incorrectly than correctly. By equation 3 it is clear that the range of RMIC is within the interval [-1,1]. It is 1 in the case of perfect correlation of C and T and tends to smaller values if the partitions are less similar. In the case of anticorrelation it tends to negative values.

RESULTS Clustering is tested on three error parameters: (A) false-positive rate, (B) false- negative rate, and (C) cDNA length variation. False-positive rate and false- negative rate are measured in percents, and length variation is measured in base pairs. Clustering quality is calculated according to two different quality measures: (Broken lines) The relative Minkowsky metric, which is low if clustering quality is good and high if clustering quality is bad; (solid lines) the RMIC, which is high if clustering quality is good and low if clustering quality is bad.

RESULTS - Contd

Diversity Index: To evaluate the splitting of individual gene clusters numerically, we calculate a diversity index using entropy. Given that gene, g i, is present in the library with N i copies and given that these copies are split in K different clusters with frequencies n 1,...,n k (n n k = N i ) The diversity is maximal [d(gi) = 1] if all copies belong to different cluster, it is minimal [d(gi) = 0] if all copies belong to the same cluster.

RESULTS - Contd

The subset of 2029 control clones is used to compare the mutual information similarity measure with Euclidean distance and Pearson Correlation. Given two vectors and x = (x 1, …, x N ) and y = (y 1, …, y N ). Pearson correlation is given by the formula MetricNo. Of Clusters No. Of Singletons RMICMinkowsky Euclidean Pearson Correlation

CONCLUSION Sequential K-Means has proved a efficient way of clustering for the given data set. Further improvements on pairwise clone fingerprint similarity can be made. One shortcoming of the measure is that all pairwise events are weighted equally regardless of their information content. For example, the event that both clones do not match with a given probe gets the same weight as the event that both clones do although the latter information is far more important than the former one. This can be taken into account by using a weighted form of mutual information in which the events are weighted. The weights depend hereby on the specific set of probes and the clones under analysis.