Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and Technology.

Similar presentations


Presentation on theme: "AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and Technology."— Presentation transcript:

1 AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and Technology

2 Cluster Analysis 2 1. Partitioning Methods + EM algorithm 2. Hierarchical Methods 3. Density-Based Methods 4. Clustering quality evaluation 5. How to decide the number of clusters ? 6. Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

3 The quality of Clustering For supervised classification we have a variety of measures to evaluate how good our model is –Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the goodness of the resulting clusters? But clusters are in the eye of the beholder! Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters 3 Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

4 Measures of Cluster Validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types: External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Purity, Normalized Mutual Information Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Cophenetic correlation coefficient, silhouette 4

5 The class labels are externally supplied (q classes) Purity: Larger purity values indicate better clustering solutions. Purity of each cluster C r of size n r Purity of the entire clustering Cluster Validity: External Index 5

6 Purity: Cluster Validity: External Index 6

7 The class labels are externally supplied (q classes) NMI (Normalized Mutual Information) : where I is mutual information and H is entropy Cluster Validity: External Index 7

8 NMI (Normalized Mutual Information) : Larger NMI values indicate better clustering solutions. Cluster Validity: External Index 8

9 Internal Index: Used to measure the goodness of a clustering structure without respect to external information SSE is good for comparing two clustering results average SSE SSE curves w.r.t. various K Can also be used to estimate the number of clusters Internal Measures: SSE 9

10 Cophenetic correlation coefficient: a measure of how faithfully a dendrogram preserves the pairwise distances between the original data points. Compare two hierarchical clusterings of the data Internal Measures: Cophenetic correlation coefficient D F 0.71 A B 1.00 E 1.41 C 2.50 Compute the correlation coefficient between Dist and CP Matlab functions: cophenet

11 Cluster Analysis Partitioning Methods + EM algorithm 2. Hierarchical Methods 3. Density-Based Methods 4. Clustering quality evaluation 5. How to decide the number of clusters ? 6. Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

12 Internal Measures: Cohesion and Separation Cluster cohesion measures how closely related are objects in a cluster = SSE or the sum of the weight of all links within a cluster. Cluster separation measures how distinct or well-separated a cluster is from other clusters = sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation 12

13 Internal Measures: Silhouette Coefficient Silhouette Coefficient combines ideas of both cohesion and separation For an individual point, i Calculate a = average distance of i to the points in its cluster Calculate b = min (average distance of i to points in another cluster) The silhouette coefficient for a point is then given by o Typically between 0 and 1. o The closer to 1 the better. Can calculate the Average Silhouette width for a cluster or a clustering 13 Matlab functions: silhouette

14 Determine number of clusters by Silhouette Coefficient compare different clusterings by the average silhouette values K=4 mean(silh) = K=3 mean(silh) = K=5 mean(silh) = 0.527

15 Determine the number of clusters 1.Select the number K of clusters as the one maximizing averaged silhouette value of all points 2.Optimizing an objective criterion – Gap statistics of the decreasing of SSE w.r.t. K 3.Model-based method: optimizing a global criterion (e.g. the maximum likelihood of data) 4.Try to use clustering methods which need not to set K, e.g., DbScan, 5.Prior knowledge….. 15 Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

16 Cluster Analysis Partitioning Methods + EM algorithm 2. Hierarchical Methods 3. Density-Based Methods 4. Clustering quality evaluation 5. How to decide the number of clusters ? 6. Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

17 Clustering VS Classification 17 Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

18 Problems and Challenges Considerable progress has been made in scalable clustering methods Partitioning: k-means, k-medoids, CLARANS Hierarchical: BIRCH, ROCK, CHAMELEON Density-based: DBSCAN, OPTICS, DenClue Grid-based: STING, WaveCluster, CLIQUE Model-based: EM, SOM Spectral clustering Affinity Propagation Frequent pattern-based: Bi-clustering, pCluster Current clustering techniques do not address all the requirements adequately, still an active area of research 18 Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

19 Cluster Analysis 19 Open issues in clustering 1. Clustering quality evaluation 2. How to decide the number of clusters ?

20 What you should know 20 What is clustering? How does k-means work? What is the difference between k-means and k-mediods? What is EM algorithm? How does it work? What is the relationship between k-means and EM? How to define inter-cluster similarity in Hierarchical clustering? What kinds of options do you have ? How does DBSCAN work ? Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

21 What you should know 21 What are the advantages and disadvantages of DbScan? How to evaluate the clustering results ? Usually how to decide the number of clusters ? What are the main differences between clustering and classification? Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning


Download ppt "AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and Technology."

Similar presentations


Ads by Google