© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Applications of Cluster Analysis l Understanding –Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations l Summarization –Reduce the size of large data sets Clustering precipitation in Australia
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Notion of a Cluster can be Ambiguous How many clusters? Four ClustersTwo Clusters Six Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Types of Clusterings l A clustering is a set of clusters l Important distinction between hierarchical and partitional sets of clusters l Partitional Clustering –A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset l Hierarchical clustering –A set of nested clusters organized as a hierarchical tree
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Partitional Clustering Original Points A Partitional Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical ClusteringNon-traditional Dendrogram Traditional Dendrogram
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ K-means Clustering l Partitional clustering approach l Each cluster is associated with a centroid (center point) l Each point is assigned to the cluster with the closest centroid l Number of clusters, K, must be specified l The basic algorithm is very simple
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ K-means: Example
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ K-means: Example
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sub-optimal ClusteringOptimal Clustering Original Points Importance of Choosing Initial Centroids …
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Importance of Choosing Initial Centroids …
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Importance of Choosing Initial Centroids …
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Problems with Selecting Initial Points l If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. –Chance is relatively small when K is large –If clusters are the same size, n, then –For example, if K = 10, then probability = 10!/10 10 = –Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t –Consider an example of five pairs of clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solutions to Initial Centroids Problem l Multiple runs –Helps, but probability is not on your side l Sample and use hierarchical clustering to determine initial centroids l Select more than k initial centroids and then select among these initial centroids –Select most widely separated l Postprocessing
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means l K-means has problems when clusters are of differing –Sizes –Densities –Non-globular shapes l K-means has problems when the data contains outliers.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means: Differing Density Original Points K-means (3 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Overcoming K-means Limitations Original PointsK-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Overcoming K-means Limitations Original PointsK-means Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Overcoming K-means Limitations Original PointsK-means Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Density based clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Density based clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Density-based clustering: DBSCAN l DBSCAN is a density-based algorithm. –Density = number of points within a specified radius (Eps) –A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster –A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point –A noise point is any point that is not a core point or a border point.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN: Core, Border, and Noise Points
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN Algorithm l Eliminate noise points l Perform clustering on the remaining points –Connect all core points with an edge that are less than Eps from each other. –Make each group of connected core points into a separate cluster. –Assign each border point to one of its associated clusters.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ First point is selected All points that are density-reachable from this point form a new cluster Second point is selected Example: DBSCAN Second cluster is formed Third point selected and cluster is formed
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ When DBSCAN Works Well Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92) Varying densities High-dimensional data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ DBSCAN: Determining EPS and MinPts l Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance l Noise points have the k th nearest neighbor at farther distance l So, plot sorted distance of every point to its k th nearest neighbor
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering l Produces a set of nested clusters organized as a hierarchical tree l Can be visualized as a dendrogram –A tree like diagram that records the sequences of merges or splits
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Strengths of Hierarchical Clustering l Do not have to assume any particular number of clusters –Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level l They may correspond to meaningful taxonomies –Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering l Two main types of hierarchical clustering –Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left –Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) l Traditional hierarchical algorithms use a similarity or distance matrix –Merge or split one cluster at a time
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward 1.Compute the proximity matrix 2.Let each data point be a cluster 3.Repeat 4.Merge the two closest clusters 5.Update the proximity matrix 6.Until only a single cluster remains l Key operation is the computation of the proximity of two clusters –Different approaches to defining the distance between clusters distinguish the different algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Starting Situation l Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Intermediate Situation l After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Intermediate Situation l We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ After Merging l The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function Proximity Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Hierarchical Clustering: Problems and Limitations l Once a decision is made to combine two clusters, it cannot be undone l No objective function is directly minimized l Different schemes have problems with one or more of the following: –Sensitivity to noise and outliers –Difficulty handling different sized clusters and convex shapes –Breaking large clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Cluster Validity l For supervised classification we have a variety of measures to evaluate how good our model is –Accuracy, precision, recall l For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? l But “clusters are in the eye of the beholder”! l Then why do we want to evaluate them? –To avoid finding patterns in noise –To compare clustering algorithms –To compare two sets of clusters –To compare two clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ l Order the similarity matrix with respect to cluster labels and inspect visually. Using Similarity Matrix for Cluster Validation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Using Similarity Matrix for Cluster Validation l Clusters in random data are not so crisp DBSCAN
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Using Similarity Matrix for Cluster Validation l Clusters in random data are not so crisp K-means
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Using Similarity Matrix for Cluster Validation DBSCAN
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes Final Comment on Cluster Validity