Presentation is loading. Please wait.

Presentation is loading. Please wait.

Marcus Sampaio DSC/UFCG. Marcus Sampaio DSC/UFCG Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Cluster Analysis: Basic.

Similar presentations


Presentation on theme: "Marcus Sampaio DSC/UFCG. Marcus Sampaio DSC/UFCG Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Cluster Analysis: Basic."— Presentation transcript:

1 Marcus Sampaio DSC/UFCG

2 Marcus Sampaio DSC/UFCG Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Cluster Analysis: Basic Concepts and Algorithm

3 Marcus Sampaio DSC/UFCG Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized What Is Cluster Analysis?

4 Marcus Sampaio DSC/UFCG Description –Group related stocks with similar price fluctuations Summarization –Reduce the size of large data sets Clustering precipitation in Australia Applications of Cluster Analysis

5 Marcus Sampaio DSC/UFCG Supervised classification –Have class label information Simple segmentation –Dividing students into different registration groups alphabetically, by last name Results of a query –Groupings are a result of an external specification Graph partitioning –Some mutual relevance and synergy, but areas are not identical What Is Not Cluster Analysis?

6 Marcus Sampaio DSC/UFCG How many clusters? Four ClustersTwo Clusters Six Clusters Notion of a Cluster Can Be Ambiguous

7 Marcus Sampaio DSC/UFCG A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering –A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering –A set of nested clusters organized as a hierarchical tree Obs.: Hierarchical clustering is out of the scope of the discipline Types of Clusterings

8 Marcus Sampaio DSC/UFCG Original Points A Partitional Clustering Partitional Clustering

9 Marcus Sampaio DSC/UFCG Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function Types of Clusters

10 Marcus Sampaio DSC/UFCG Well-Separated Clusters –A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster 3 well-separated clusters Types of Clusters: Well-Separated

11 Marcus Sampaio DSC/UFCG Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster –The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters Types of Clusters: Center-Based

12 Marcus Sampaio DSC/UFCG Contiguous Cluster (Nearest neighbor or Transitive) –A cluster is a set of points such that a point in a cluster is transitively closer (or more similar) to one or more other points in the cluster than to any point not in the cluster 8 contiguous clusters Types of Clusters: Contiguity-Based

13 Marcus Sampaio DSC/UFCG Density-based –A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density –Used when the clusters are irregular or intertwined, and when noise and outliers are present 6 density-based clusters Types of Clusters: Density-Based

14 Marcus Sampaio DSC/UFCG Shared Property or Conceptual Clusters –Finds clusters that share some common property or represent a particular concept. 2 Overlapping Circles Types of Clusters: Conceptual Clusters

15 Marcus Sampaio DSC/UFCG Points as Representation of Instances How to represent an instance as a geometric point? –The points P1 and P2 are to each other closer than of P3 NameDeptCourse MarcusDSCComputer Science CláudioDSCComputer Science PériclesDEEElectric Engineering (P1) (P2) (P3)

16 Marcus Sampaio DSC/UFCG WEKA –Points by instance and by attribute –The most frequent attribute values in the cluster

17 Marcus Sampaio DSC/UFCG Clusters Defined by an Objective Function –Finds clusters that minimize or maximize an objective function –Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function (NP Hard) – Can have global or local objectives Hierarchical clustering algorithms typically have local objectives Partitional algorithms typically have global objectives Types of Clusters: Objective Function

18 Marcus Sampaio DSC/UFCG Map the clustering problem to a different domain and solve a related problem in that domain –Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points –Clustering is equivalent to breaking the graph into connected components, one for each cluster. –Want to minimize the edge weight between clusters and maximize the edge weight within clusters

19 Marcus Sampaio DSC/UFCG Proximity Function CentroidObjective Funciotn Manhattan (L 1 )median Minimize sum of the L 1 distance of an object to its cluster centroid Squared Euclidean (L 2 2 ) mean Minimize sum of the L 2 2 distance of an object to its cluster centroid Cosinemean Maxiimize sum of the cosine similarity of an object to its cluster centroid Bregman divergence mean Minimize sum of the Bregman divergence of an object to its cluster centroid

20 Marcus Sampaio DSC/UFCG Type of proximity or density measure –This is a derived measure, but central to clustering Sparseness –Dictates type of similarity –Adds to efficiency Attribute type –Dictates type of similarity Type of Data –Dictates type of similarity –Other characteristics, e.g., autocorrelation Dimensionality Noise and Outliers Type of Distribution Characteristics of the Input Data Are Important

21 Marcus Sampaio DSC/UFCG Measuring the Clustering Performance Notice that Clustering is a descriptive model, instead of a predictive model (see Introduction) –Performance metrics are different from the ones of supervised classification We will see the metric SSE later Other metrics –Precision –Recall –Entropy –Purity

22 Marcus Sampaio DSC/UFCG K-means and one variant Clustering Algorithms

23 Marcus Sampaio DSC/UFCG Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid  centroid-based objective function Number of clusters, K, must be specified The basic algorithm is very simple K-means Clustering

24 Marcus Sampaio DSC/UFCG K-means Clustering – Details Proximity function –Squared Euclidean L 2 2 Type of centroid: mean –Example of a centroid or mean A cluster containing the three points (1,1), (2,3) e (6,2) –Centroid = ((1+2+6)/3, (1+3+2)/3) = (3,2) A problem of minimization

25 Marcus Sampaio DSC/UFCG K-means always converges to a solution –Reaches a state in which no points are shifting from one cluster to another, and hence, the centroids don´t change Initial centroids are often chosen randomly –Clusters produced vary from one run to another Complexity is O( n * K * I * d ) –n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

26 Marcus Sampaio DSC/UFCG The actions of K-means in Steps 3 and 4 are only guaranteed to find a local minimum with respect to the sum of the squared error (SSE) –Based on optimizing the SSE for specific choices of the centroids and clusters, rather than for all possible choices

27 Marcus Sampaio DSC/UFCG Sub-optimal ClusteringOptimal Clustering Original Points Two different K-means Clusterings

28 Marcus Sampaio DSC/UFCG

29 Marcus Sampaio DSC/UFCG

30 Marcus Sampaio DSC/UFCG Most common measure is Sum of Squared Error (SSE) –For each point, the error is the distance to the nearest cluster –To get SSE, we square these errors and sum them. –x is a data point in cluster C i and m i is the representative point for cluster C i can show that m i corresponds to the center (mean) of the cluster Evaluating K-means Clusters

31 Marcus Sampaio DSC/UFCG Step 3: forms clusters by assigning points to their nearest centroid, which minimizes the SSE for the given set of centroids Step 4: recomputes the centroids so as to further minimize the SSE Given two different set of clusters, we can choose the one with the smallest error

32 Marcus Sampaio DSC/UFCG Step 1: Choosing Initial Centroids When random initialization of centroids is used, different runs of K-means typically produce different total SSEs –The resulting clusters are often poor In next slides, we provide another example of initial centroids, using the same data in the former example –Now, the solution is suboptimal, or the minimum SSE clustering is not found Or the solution is only local optimal

33 Marcus Sampaio DSC/UFCG Step 1: Choosing Initial Centroids

34 Marcus Sampaio DSC/UFCG

35 Marcus Sampaio DSC/UFCG How to overcome the problem of well choosing the initial centroids? Three approaches –Multiple runs –A variant of K-means that is less susceptible to initialization problems Bisecting K-means –Using postprocessing to “fixup” the set of clusters produced

36 Marcus Sampaio DSC/UFCG Multiple Runs of k-means Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one with the smallest SSE error –The centroids of this clustering are a better representation of the points in their cluster The technique –Perform multiple runs Each with a different set of randomly chosen initial centroids –Select the clusters with the minimum SSE May not work very well

37 Marcus Sampaio DSC/UFCG Reducing the SSE with Postprocessing An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger K –The local SSEs become smaller, and then the global SSE also becomes smaller However, in many cases, we would like to improve the SSE, but don´t want to increase the number of clusters

38 Marcus Sampaio DSC/UFCG One strategy that decreases the total SSE by increasing the number of clusters –Split a cluster: the cluster with the largest SSE is usually chosen One strategy that decreases the number of clusters, while trying to minimize the increase in total SSE –Merge two clusters: the clusters with the closest centroids are typically chosen

39 Marcus Sampaio DSC/UFCG Bisecting K-means algorithm –Variant of K-means that can produce a partitional or a hierarchical clustering Bisecting K-means Algorithm Bisecting K-means algorithm 1: Initialize the list of clusters to contain the cluster consisting of all points 2: repeat 3: Remove a cluster from the list of clusters 4: {Perform several “trial” bisections of the chosen cluster.} 5: for i = 1 to number of trials do 6: Bisect the selected cluster using basic 2-means 7: end for 8: Select the two clusters from the bisection with the lowest total SSE 9: Add these two clusters to the list of clusters 10:until Until the list of clusters contains K clusters

40 Marcus Sampaio DSC/UFCG There are a number of different ways to choose which cluster to split at each step –The largest cluster –The cluster with the largest SSE –A criterion based on both size and SSE One can refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm

41 Marcus Sampaio DSC/UFCG Bisecting K-means Example

42 Marcus Sampaio DSC/UFCG K-means has problems when clusters are of differing –Sizes –Densities –Non-globular shapes K-means has problems when the data contains outliers Limitations of K-means

43 Marcus Sampaio DSC/UFCG Original Points K-means (3 Clusters) Limitations of K-means: Differing Sizes

44 Marcus Sampaio DSC/UFCG Original Points K-means (3 Clusters) Limitations of K-means: Differing Density

45 Marcus Sampaio DSC/UFCG Original Points K-means (2 Clusters) Limitations of K-means: Non-globular Shapes

46 Marcus Sampaio DSC/UFCG Original PointsK-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. Overcoming K-means Limitations

47 Marcus Sampaio DSC/UFCG Original PointsK-means Clusters

48 Marcus Sampaio DSC/UFCG Original PointsK-means Clusters

49 Marcus Sampaio DSC/UFCG Rodando o WEKA SimpleKMeans === Run information === Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10 Relation: weather.symbolic Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data === Model and evaluation on training set ===

50 Marcus Sampaio DSC/UFCG

51 Marcus Sampaio DSC/UFCG kMeans ====== Number of iterations: 4 Within cluster sum of squared errors: 26.0 Cluster centroids: Cluster 0 Mean/Mode: sunny mild high FALSE yes Std Devs: N/A N/A N/A N/A N/A Cluster 1 Mean/Mode: overcast cool normal TRUE yes Std Devs: N/A N/A N/A N/A N/A Clustered Instances 0 10 ( 71%) 1 4 ( 29%)

52 Marcus Sampaio DSC/UFCG

53 Marcus Sampaio DSC/UFCG 0 sunny,hot,high,FALSE,no 1 sunny,hot,high,TRUE,no 2 overcast,hot,high,FALSE,yes 3 rainy,mild,high,FALSE,yes 4 rainy,cool,normal,FALSE,yes 5 rainy,cool,normal,TRUE,no 6 overcast,cool,normal,TRUE,yes 7 sunny,mild,high,FALSE,no 8 sunny,cool,normal,FALSE,yes 9 rainy,mild,normal,FALSE,yes 10 sunny,mild,normal,TRUE,yes 11 overcast,mild,high,TRUE,yes 12 overcast,hot,normal,FALSE,yes 13 rainy,mild,high,TRUE,no

54 Marcus Sampaio DSC/UFCG DBSCAN DBSCAN is a density-based algorithm. –Density = number of points within a specified radius (Eps) –A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster –A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point –A noise point is any point that is not a core point or a border point.

55 Marcus Sampaio DSC/UFCG DBSCAN: Core, Border, and Noise Points

56 Marcus Sampaio DSC/UFCG DBSCAN Algorithm Eliminate noise points Perform clustering on the remaining points

57 Marcus Sampaio DSC/UFCG DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

58 Marcus Sampaio DSC/UFCG When DBSCAN Works Well Clusters Original Points Resistant to Noise Can handle clusters of different shapes and sizes

59 Marcus Sampaio DSC/UFCG When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points (MinPts=4, Eps=9.92) Varying densities High-dimensional data

60 Marcus Sampaio DSC/UFCG DBSCAN: Determining EPS and MinPts Idea is that for points in a cluster, their k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at farther distance So, plot sorted distance of every point to its k th nearest neighbor

61 Marcus Sampaio DSC/UFCG For supervised classification we have a variety of measures to evaluate how good our model is –Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? But “clusters are in the eye of the beholder”! Then why do we want to evaluate them? –To avoid finding patterns in noise –To compare clustering algorithms –To compare two sets of clusters –To compare two clusters Cluster Validity

62 Marcus Sampaio DSC/UFCG Random Points K-means DBSCAN Complete Link Clusters Found in Random Data

63 Marcus Sampaio DSC/UFCG 1.Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2.Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3.Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4.Comparing the results of two different sets of cluster analyses to determine which is better. 5.Determining the ‘correct’ number of clusters. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters. Different Aspects of Cluster Validation

64 Marcus Sampaio DSC/UFCG Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235Corr = -0.5810

65 Marcus Sampaio DSC/UFCG WEKA Clustering algorithms –SimpleKMeans –MakeDensityBasedClusterer –CowWeb –EM –FarthestFirst


Download ppt "Marcus Sampaio DSC/UFCG. Marcus Sampaio DSC/UFCG Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Cluster Analysis: Basic."

Similar presentations


Ads by Google