© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Applications of Cluster Analysis l Understanding –Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations l Summarization –Reduce the size of large data sets Clustering precipitation in Australia

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Types of Clusterings l A clustering is a set of clusters l Important distinction between hierarchical and partitional sets of clusters l Partitional Clustering –A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset l Hierarchical clustering –A set of nested clusters organized as a hierarchical tree

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical ClusteringNon-traditional Dendrogram Traditional Dendrogram

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 K-means Clustering l Partitional clustering approach l Each cluster is associated with a centroid (center point) l Each point is assigned to the cluster with the closest centroid l Number of clusters, K, must be specified l The basic algorithm is very simple

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Problems with Selecting Initial Points l If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. –Chance is relatively small when K is large –If clusters are the same size, n, then –For example, if K = 10, then probability = 10!/10 10 = 0.00036 –Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t –Consider an example of five pairs of clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Solutions to Initial Centroids Problem l Multiple runs –Helps, but probability is not on your side l Sample and use hierarchical clustering to determine initial centroids l Select more than k initial centroids and then select among these initial centroids –Select most widely separated l Postprocessing

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Limitations of K-means l K-means has problems when clusters are of differing –Sizes –Densities –Non-globular shapes l K-means has problems when the data contains outliers.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Overcoming K-means Limitations Original PointsK-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Density-based clustering: DBSCAN l DBSCAN is a density-based algorithm. –Density = number of points within a specified radius (Eps) –A point is a core point if it has more than a specified number of points (MinPts) within Eps  These are points that are at the interior of a cluster –A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point –A noise point is any point that is not a core point or a border point.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 DBSCAN Algorithm l Eliminate noise points l Perform clustering on the remaining points –Connect all core points with an edge that are less than Eps from each other. –Make each group of connected core points into a separate cluster. –Assign each border point to one of its associated clusters.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 First point is selected All points that are density-reachable from this point form a new cluster Second point is selected Example: DBSCAN Second cluster is formed Third point selected and cluster is formed

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 DBSCAN: Determining EPS and MinPts l Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance l Noise points have the k th nearest neighbor at farther distance l So, plot sorted distance of every point to its k th nearest neighbor

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Hierarchical Clustering l Produces a set of nested clusters organized as a hierarchical tree l Can be visualized as a dendrogram –A tree like diagram that records the sequences of merges or splits

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Strengths of Hierarchical Clustering l Do not have to assume any particular number of clusters –Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level l They may correspond to meaningful taxonomies –Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Hierarchical Clustering l Two main types of hierarchical clustering –Agglomerative:  Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one cluster (or k clusters) left –Divisive:  Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point (or there are k clusters) l Traditional hierarchical algorithms use a similarity or distance matrix –Merge or split one cluster at a time

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward 1.Compute the proximity matrix 2.Let each data point be a cluster 3.Repeat 4.Merge the two closest clusters 5.Update the proximity matrix 6.Until only a single cluster remains l Key operation is the computation of the proximity of two clusters –Different approaches to defining the distance between clusters distinguish the different algorithms

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Intermediate Situation l We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 After Merging l The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function Proximity Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function 

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Hierarchical Clustering: Problems and Limitations l Once a decision is made to combine two clusters, it cannot be undone l No objective function is directly minimized l Different schemes have problems with one or more of the following: –Sensitivity to noise and outliers –Difficulty handling different sized clusters and convex shapes –Breaking large clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 Cluster Validity l For supervised classification we have a variety of measures to evaluate how good our model is –Accuracy, precision, recall l For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? l But “clusters are in the eye of the beholder”! l Then why do we want to evaluate them? –To avoid finding patterns in noise –To compare clustering algorithms –To compare two sets of clusters –To compare two clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54 “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes Final Comment on Cluster Validity

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Similar presentations

Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Similar presentations

Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? l Finding groups of objects such that the objects in a group will."— Presentation transcript:

Similar presentations

About project

Feedback