Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Similar presentations


Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will."— Presentation transcript:

1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Applications of Cluster Analysis l Understanding –Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations l Summarization –Reduce the size of large data sets Clustering precipitation in Australia

3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Notion of a Cluster can be Ambiguous How many clusters? Four ClustersTwo Clusters Six Clusters

4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Types of Clusterings l A clustering is a set of clusters l Important distinction between hierarchical and partitional sets of clusters l Partitional Clustering –A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset l Hierarchical clustering –A set of nested clusters organized as a hierarchical tree

5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Partitional Clustering Original Points A Partitional Clustering

6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical ClusteringNon-traditional Dendrogram Traditional Dendrogram

7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering

8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 K-means Clustering l Partitional clustering approach l Each cluster is associated with a centroid (center point) l Each point is assigned to the cluster with the closest centroid l Number of clusters, K, must be specified l The basic algorithm is very simple

9 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 K-means: Example

10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 K-means: Example

11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Sub-optimal ClusteringOptimal Clustering Original Points Importance of Choosing Initial Centroids …

12 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Importance of Choosing Initial Centroids …

13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Importance of Choosing Initial Centroids …

14 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Problems with Selecting Initial Points l If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. –Chance is relatively small when K is large –If clusters are the same size, n, then –For example, if K = 10, then probability = 10!/10 10 = 0.00036 –Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t –Consider an example of five pairs of clusters

15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Solutions to Initial Centroids Problem l Multiple runs –Helps, but probability is not on your side l Sample and use hierarchical clustering to determine initial centroids l Select more than k initial centroids and then select among these initial centroids –Select most widely separated l Postprocessing

16 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Limitations of K-means l K-means has problems when clusters are of differing –Sizes –Densities –Non-globular shapes l K-means has problems when the data contains outliers.

17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

18 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

19 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

20 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Overcoming K-means Limitations Original PointsK-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

21 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Overcoming K-means Limitations Original PointsK-means Clusters

22 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Overcoming K-means Limitations Original PointsK-means Clusters

23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering

24 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Density based clustering

25 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Density based clustering

26 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Density-based clustering: DBSCAN l DBSCAN is a density-based algorithm. –Density = number of points within a specified radius (Eps) –A point is a core point if it has more than a specified number of points (MinPts) within Eps  These are points that are at the interior of a cluster –A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point –A noise point is any point that is not a core point or a border point.

27 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 DBSCAN: Core, Border, and Noise Points

28 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 DBSCAN Algorithm l Eliminate noise points l Perform clustering on the remaining points –Connect all core points with an edge that are less than Eps from each other. –Make each group of connected core points into a separate cluster. –Assign each border point to one of its associated clusters.

29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 First point is selected All points that are density-reachable from this point form a new cluster Second point is selected Example: DBSCAN Second cluster is formed Third point selected and cluster is formed

30 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

31 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 When DBSCAN Works Well Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes

32 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92) Varying densities High-dimensional data

33 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 DBSCAN: Determining EPS and MinPts l Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance l Noise points have the k th nearest neighbor at farther distance l So, plot sorted distance of every point to its k th nearest neighbor

34 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Clustering Algorithms l K-means and its variants l Density-based clustering l Hierarchical clustering

35 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Hierarchical Clustering l Produces a set of nested clusters organized as a hierarchical tree l Can be visualized as a dendrogram –A tree like diagram that records the sequences of merges or splits

36 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Strengths of Hierarchical Clustering l Do not have to assume any particular number of clusters –Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level l They may correspond to meaningful taxonomies –Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

37 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Hierarchical Clustering l Two main types of hierarchical clustering –Agglomerative:  Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one cluster (or k clusters) left –Divisive:  Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point (or there are k clusters) l Traditional hierarchical algorithms use a similarity or distance matrix –Merge or split one cluster at a time

38 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward 1.Compute the proximity matrix 2.Let each data point be a cluster 3.Repeat 4.Merge the two closest clusters 5.Update the proximity matrix 6.Until only a single cluster remains l Key operation is the computation of the proximity of two clusters –Different approaches to defining the distance between clusters distinguish the different algorithms

39 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Starting Situation l Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix

40 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Intermediate Situation l After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

41 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Intermediate Situation l We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

42 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 After Merging l The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

43 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function Proximity Matrix

44 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function

45 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function

46 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function

47 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function 

48 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Hierarchical Clustering: Problems and Limitations l Once a decision is made to combine two clusters, it cannot be undone l No objective function is directly minimized l Different schemes have problems with one or more of the following: –Sensitivity to noise and outliers –Difficulty handling different sized clusters and convex shapes –Breaking large clusters

49 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 Cluster Validity l For supervised classification we have a variety of measures to evaluate how good our model is –Accuracy, precision, recall l For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? l But “clusters are in the eye of the beholder”! l Then why do we want to evaluate them? –To avoid finding patterns in noise –To compare clustering algorithms –To compare two sets of clusters –To compare two clusters

50 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50 l Order the similarity matrix with respect to cluster labels and inspect visually. Using Similarity Matrix for Cluster Validation

51 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51 Using Similarity Matrix for Cluster Validation l Clusters in random data are not so crisp DBSCAN

52 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52 Using Similarity Matrix for Cluster Validation l Clusters in random data are not so crisp K-means

53 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53 Using Similarity Matrix for Cluster Validation DBSCAN

54 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54 “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes Final Comment on Cluster Validity


Download ppt "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will."

Similar presentations


Ads by Google