Presentation is loading. Please wait.

Presentation is loading. Please wait.

University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.

Similar presentations


Presentation on theme: "University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results."— Presentation transcript:

1 University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results. Why validation? qTo avoid finding clusters formed by chance qTo compare clustering algorithms qTo choose clustering parameters  e.g., the number of clusters in the K-means algorithm

2 University at BuffaloThe State University of New York Clusters found in Random Data Random Points K-means DBSCAN Complete Link

3 University at BuffaloThe State University of New York Aspects of Cluster Validation Comparing the clustering results to ground truth (externally known results). qExternal Index Evaluating the quality of clusters without reference to external information. qUse only the data qInternal Index Determining the reliability of clusters. qTo what confidence level, the clusters are not formed by chance qStatistical framework

4 University at BuffaloThe State University of New York Comparing to Ground Truth Notation q N: number of objects in the data set; q P={P 1,…,P m }: the set of “ground truth” clusters; q C={C 1,…,C n }: the set of clusters reported by a clustering algorithm. The “incidence matrix” q N  N (both rows and columns correspond to objects). q P ij = 1 if O i and O j belong to the same “ground truth” cluster in P; P ij =0 otherwise. q C ij = 1 if O i and O j belong to the same cluster in C; C ij =0 otherwise.

5 University at BuffaloThe State University of New York External Index A pair of data object (O i,O j ) falls into one of the following categories q SS: C ij =1 and P ij =1; (agree) q DD: C ij =0 and P ij =0; (agree) q SD: C ij =1 and P ij =0; (disagree) q DS: C ij =0 and P ij =1; (disagree) Rand index qmay be dominated by DD Jaccard Coefficient

6 University at BuffaloThe State University of New York Internal Index “Ground truth” may be unavailable Use only the data to measure cluster quality q Measure the “homogeneity” and “separation” of clusters.  SSE: Sum of squared errors. q Calculate the correlation between clustering results and distance matrix.

7 University at BuffaloThe State University of New York Sum of Squared Error Homogeneity is measured by the within cluster sum of squares Exactly the objective function of K-means. Separation is measured by the between cluster sum of squares Where |Ci| is the size of cluster i, m is the centroid of the whole data set. BSS + WSS = constant A larger number of clusters tend to result in smaller WSS.

8 University at BuffaloThe State University of New York Sum of Squared Error 12345  m1m1 m2m2 m K=2 : K=1 : K=4:

9 University at BuffaloThe State University of New York Can also be used to estimate the number of clusters. Sum of Squared Error

10 University at BuffaloThe State University of New York Internal Measures: SSE SSE curve for a more complicated data set SSE of clusters found using K-means

11 University at BuffaloThe State University of New York Correlation with Distance Matrix Distance Matrix q D ij is the similarity between object O i and O j. Incidence Matrix q C ij =1 if O i and O j belong to the same cluster, C ij =0 otherwise Compute the correlation between the two matrices qOnly n(n-1)/2 entries needs to be calculated. High correlation indicates good clustering.

12 University at BuffaloThe State University of New York Given Distance Matrix D = {d 11,d 12, …, d nn } and Incidence Matrix C= { c 11, c 12,…, c nn }. Correlation r between D and C is given by Correlation with Distance Matrix

13 University at BuffaloThe State University of New York Measuring Cluster Validity Via Correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235Corr = -0.5810

14 University at BuffaloThe State University of New York Clusters found in Random Data Random Points K-means DBSCAN Complete Link

15 University at BuffaloThe State University of New York Order the similarity matrix with respect to cluster labels and inspect visually. Using Similarity Matrix for Cluster Validation

16 University at BuffaloThe State University of New York Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp K-means

17 University at BuffaloThe State University of New York Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp Complete Link

18 University at BuffaloThe State University of New York Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp DBSCAN

19 University at BuffaloThe State University of New York Reliability of Clusters Need a framework to interpret any measure. q For example, if our measure of evaluation has the value, 10, is that good, fair, or poor? Statistics provide a framework for cluster validity q The more “atypical” a clustering result is, the more likely it represents valid structure in the data.

20 University at BuffaloThe State University of New York Example qCompare SSE of 0.005 against three clusters in random data qSSE Histogram of 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values Statistical Framework for SSE SSE = 0.005

21 University at BuffaloThe State University of New York Correlation of incidence and distance matrices for the K- means of the following two data sets. Statistical Framework for Correlation Corr = -0.9235 Corr = -0.5810 Correlation histogram of random data

22 University at BuffaloThe State University of New York Hyper-geometric Distribution Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T?

23 University at BuffaloThe State University of New York P-Value Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows:


Download ppt "University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results."

Similar presentations


Ads by Google