Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/1/2015Raffaele Giancarlo1 Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.

Similar presentations


Presentation on theme: "6/1/2015Raffaele Giancarlo1 Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo."— Presentation transcript:

1 6/1/2015Raffaele Giancarlo1 Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy

2 6/1/2015Raffaele Giancarlo2 What we want (tipically) Genes Expression Matrix Group functionally related genes together Basic Axiom of Computational Biology: Guilt by Association A high similarity among object, as measured by mathematical functions, is strong indication of functional relatedness…Not always Clustering Expression LevelsGenes

3 6/1/2015Raffaele Giancarlo3 What we want (tipically) Clustering Solution

4 6/1/2015Raffaele Giancarlo4 Limitations in the Analysis Process

5 6/1/2015Raffaele Giancarlo5 Limitations: Microarray Technology MIAME, we have a problem-Robert Shields, Trends in Genetics, 2006 –…no amount of statistical or algorithmic knowledge can compensate for limitations of the technology itself –A large proportion of the transcriptome is beyond the reach of current technology, i.e, the signal is too weak

6 6/1/2015Raffaele Giancarlo6 Limitations : Visualization Tools One of those two Clusters is random noise … Which One ???

7 6/1/2015Raffaele Giancarlo7 Limitations: Statistics Towards sound epistemological foundations of statistical methods for high- dimensional biology- T. Mehta et al, Nature Genetics, 2004 –Many papers for omic research describe development or application of statistical methods— Many of those are questionable

8 6/1/2015Raffaele Giancarlo8 Overview Of Remaining Part Clustering as a three step process Internal validation Techniques External Validation Techniques Experiments One stop shops software systems Some Issues I Really Had to Talk About

9 6/1/2015Raffaele Giancarlo9 Cluster Analysis as a Three Step Process

10 6/1/2015Raffaele Giancarlo10 What is clustering? Group similar objects together E1E2E3E4 Gene 1-2+2 Gene 2+8+30+4 Gene 3-4+5+4-2 Gene 4+4+3 Clustering genes Clustering experiments

11 6/1/2015Raffaele Giancarlo11 What is Clustering? Goal: partition the observations {x i } so that –C(i)=C(j) if x i and x j are “similar” –C(i)  C(j) if x i and x j are “dissimilar” natural questions: –What is a cluster –How do I choose a good similarity function –How do I choose a good algorithm APPLICATION and DATA DEPENDENT –How many clusters are REALLY present in the data

12 6/1/2015Raffaele Giancarlo12 What’s a Cluster? No rigorous definition Subjective Scale/Resolution dependent (e.g. hierarchy)

13 6/1/2015Raffaele Giancarlo13 Step One Choose a good similarity function- –Euclidean Distance- capture magnetudo and pattern of expression, i.e., direction –Correlation functions Captures pattern of expression, i.e. direction –Etc…

14 6/1/2015Raffaele Giancarlo14 Step Two Choose a good clustering algorithm. Algorithms may be broadly classified according to the objective function they optimize –Compactness: Intra- Cluster Variation Small They like well separated or spherical clusters but fail on more complex cluster shapes Kmeans, Average Link Hierarchical Clustering –Connectedness- neighboring items should share the same cluster Robust with respect to cluster shapes, but fail when separation in the data is poor. Single Link Hierarchical Clustering, CAST, CLICK –Spatial Separation- Poor performer by itself, usually coupled with other criteria Simulated Annealing, Tabu Search

15 6/1/2015Raffaele Giancarlo15 Step Three An index that tells us how many clusters are really present in the data : Consistency/Uniformity more likely to be 2 than 3 more likely to be 2 than 36? (depends, what if each circle represents 1000 objects?)

16 6/1/2015Raffaele Giancarlo16 Step Three An index that tells us: Separability increasing confidence to be 2

17 6/1/2015Raffaele Giancarlo17 Step Three An index that tells us: Separability increasing confidence to be 2

18 6/1/2015Raffaele Giancarlo18 Step Three An index that tells us: Separability increasing confidence to be 2

19 6/1/2015Raffaele Giancarlo19 Step Three An index that is –independent of cluster “volume”? –independent of cluster size? –independent of cluster shape? –sensitive to outliers? –etc… Theoretically Sound-Gap Statistics Data Driven and Validated-Many

20 6/1/2015Raffaele Giancarlo20 Internal Validation Measures How many clusters are really present in the data Assess Cluster Quality Internal: No external knowledge about the dataset is given

21 6/1/2015Raffaele Giancarlo21 The Basic Scheme Given an Index F – a function of clustering solution black box producing clustering solutions with k=2,…,m clusters Compute F( ) to decide which k is best

22 6/1/2015Raffaele Giancarlo22 Internal Validation Measures Within-Cluster Sum of Squares [Folklore] Gap Statistics [Tibshirani, Walther, Hastie 2001] FOM [Yeung, Haynor, Ruzzo 2001] Consensus Clustering [Monti et al., 2003] Etc…

23 6/1/2015Raffaele Giancarlo23 Within-Cluster Sum of Squares xixi xjxj

24 6/1/2015Raffaele Giancarlo24 Within-Cluster Sum of Squares Measure of compactness of clusters

25 6/1/2015Raffaele Giancarlo25 Using W k to determine # clusters Idea of L-Curve Method: use the k corresponding to the “elbow” (the most significant increase in goodness-of-fit)

26 6/1/2015Raffaele Giancarlo26 Example Yeast Cell Cycle Dataset, 698 genes and 72 conditions Five functional classes-The gold solution Algorithm, K-means with Av. Link input and Euclidean Distance We want to know how many clusters are predicted by W k, with K-means as an “oracle”

27 6/1/2015Raffaele Giancarlo27 Example

28 6/1/2015Raffaele Giancarlo28 Problems with Use of W k No reference clustering solution to compare against, i.e., no model The values of W k are not normalized and therefore cannot be compared In a nutshell: we get values of W k but we do not quite know how far we are from randomness Gap Statistics takes care of those problems

29 6/1/2015Raffaele Giancarlo29 The Gap Statistics Based on solid statistical work for the 1-D case, i.e., the objects to be clustered are scalars, takes care of the problems outlined for W k Extended to work in higher dimensions – No Theory Validated experimentally

30 6/1/2015Raffaele Giancarlo30 Sample Uniformly and at Random 1.Align with feature axes (data-geometry independent) Observations Bounding Box (aligned with feature axes) Monte Carlo Simulations

31 6/1/2015Raffaele Giancarlo31 Computation of the Gap Statistic for l = 1 to B Compute Monte Carlo sample X 1b, X 2b, …, X nb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log W k for l = 1 to B Cluster the M.C. sample into k groups and compute log W kb Compute Compute sd(k), the s.d. of {log W kb } l=1,…,B Set the total s.e. Find the smallest k such that

32 6/1/2015Raffaele Giancarlo32 Example The same experimental setting as for Within-Sum of Squares We want to know whether the Gap Statistics predicts 5 clusters, with K- means as an “oracle”

33 6/1/2015Raffaele Giancarlo33 Example

34 6/1/2015Raffaele Giancarlo34 Figure of Merit A purely experimental approach, designed and validated specifically for microarray data

35 6/1/2015Raffaele Giancarlo35 FOM Experiments 1m genes 1 n e Cluster C 1 Cluster C i Cluster C k g R(g,e)

36 6/1/2015Raffaele Giancarlo36 FOM

37 6/1/2015Raffaele Giancarlo37 Example Same experimental setting as in the Within Sum of Squares We want to know whether FOM indicates 5 clusters in the data set, with K-means as an “oracle” Hint: look for the elbow in the FOM plot, exactly as for the W k curve.

38 6/1/2015Raffaele Giancarlo38 Example

39 6/1/2015Raffaele Giancarlo39 External Validation Measures Given two partitions of the same dataset, how close they are ? Assess Quality of a partition against a given gold standard External: the gold standard, i.e., the refernce partition must be given and trusted. In case of Biology, the elements in a cluster must be biologically correlated, i.e., same functional group of genes

40 6/1/2015Raffaele Giancarlo40 Some External Validation Measures The two partitions must have the same number of classes –Jaccard Index –Minkowski score –Rand Index [Rand 71] The two partitions can have a different number of classes –The Adjusted Rand Index [Hubert and Arabie 85] –The F measure [van Rijsbergen 79]

41 6/1/2015Raffaele Giancarlo41 Some External Validation Measures Problem with the mentioned indexes: –What is their expected value ? In very intuitive terms, if one picks blindly two partitions, among the possible partitions of the data, what is the value of the index we should expect ? Same problem we had with Gap Statistics.

42 6/1/2015Raffaele Giancarlo42 The Adjusted Rand Index It takes in input two partitions, not necessarely having the same number of classes. –Value 1, its maximum, means perfect agreement –The expected value of the index, i.e., its value on two randomly correlated partitions, is zero Note1: the index may take negative values Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index –The index must be maximased –We will see some of its uses later

43 6/1/2015Raffaele Giancarlo43 The Adjusted Rand Index It takes in input two partitions, not necessarely having the same number of classes. –Value 1, its maximum, means perfect agreement –The expected value of the index, i.e., its value on two randomly correlated partitions, is zero Note1: the index may take negative values Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index –The index must be maximased –We will see some of its uses later

44 6/1/2015Raffaele Giancarlo44 Adjusted Rand index Compare clusters to classes Consider # pairs of objects Same clusterDifferent cluster Same classa c Different class bd

45 6/1/2015Raffaele Giancarlo45 Example (Adjusted Rand) Closed form in the paper by Handl et al. (supplementary material)

46 6/1/2015Raffaele Giancarlo46 Some Experiments or on the Need of Benchmark Data Set

47 6/1/2015Raffaele Giancarlo47 How Do I Pick: Distance and Similarity Functions, given algorithm and data set algorithm, given data set Internal Validation Measures, given data set

48 6/1/2015Raffaele Giancarlo48 Different Distances-Same Algorithm and implementation (k-means)

49 6/1/2015Raffaele Giancarlo49 Same Distance-Two Different Implementations of the Same Algorithm: not all k-means are equal

50 6/1/2015Raffaele Giancarlo50 Performance of Different Algorithms- precision MethodClustersAdjusted Rand Max K-means Random50,44 Min K-means Random50,49 Cast50,529 K-means Avlink50,508 Avlink50,559 Click80,51

51 6/1/2015Raffaele Giancarlo51 Performance of Different Indexes- Precision

52 6/1/2015Raffaele Giancarlo52 Performance of Different Indexes- Precision

53 6/1/2015Raffaele Giancarlo53 Performance of Different Indexes- Time MeasureTime in ms Wk157672 FOM3695437 Gap MC28082500 Gap P26468125

54 6/1/2015Raffaele Giancarlo54 Performance Evaluation Which conclusions can one draw from the shown experiments ? –Some indication of which distance, algorithm and measure to pick A much more extensive analysis is need, with well designed benchmark datasets

55 6/1/2015Raffaele Giancarlo55 Performance Evaluation Benchmark data sets –Hard to design, in particular for Microarrays –Worth the trouble (see Tompa et al, Nature Biotechnology, 2005)

56 6/1/2015Raffaele Giancarlo56 One Stop Shop Systems for Analysis of Micro Array Data

57 6/1/2015Raffaele Giancarlo57 MIDAS and MEV Filtering and data normalization tools Clustering Algorithms (K-means, Cast) Validation Measures (FOM) Statistical Analysis tools

58 6/1/2015Raffaele Giancarlo58 Click and Expander Data Normalization and Filtering Clustering Algorithm (In particular Click) Biclusterting algorithms Validation Methods Statistical and Visualization Tools

59 6/1/2015Raffaele Giancarlo59 Visualization Methods for Statistical Analysis of Microarray Data A system that combines statistical methods and data visualization Sinoptyc views and limited navigation on the data are supported

60 6/1/2015Raffaele Giancarlo60 Some Issues I Should Have Talked About Issue 25: Over-expression and Under-expression of genes –Problem: one gene subject to “normal” conditions; same gene subject to “different” conditions. –Question: Are the measured expression levels different ? –Sensitivity Analysis in Microarray Data: Quite a bit of work– see for instance http://www-stat.stanford.edu/~tibs/SAM/

61 6/1/2015Raffaele Giancarlo61 Advertisement Second Lipari International Summer School in Bioinformatics and Computational Biology Where and When- Lipari Island, Italy-June 14-21, 2008 Theme- Biological Networks: Evolution, Interaction and Computation More Info at http://lipari.cs.unict.it/LipariSchool/Bio/index.php

62 6/1/2015Raffaele Giancarlo62 Conclusions Data analysis for microarrays (and not only) is a complicated interactive process with no clear-cut recipe Reliable tools or knowledge of their limitations is a must GOOD LUCK!!!


Download ppt "6/1/2015Raffaele Giancarlo1 Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo."

Similar presentations


Ads by Google