Slide 4 What is Clustering? It is a unsupervised learning method (no predefined classes, in our buy_computer example, there is no ‘no’ and ‘yes’). Imagine you are given a set of data objects for analysis, unlike in classification, the class label of each example is not known. Clustering is the process of grouping the data into classes or clusters so that examples within a cluster have high similarity in comparison to one another, but are very dissimilar to examples in other clusters. Dissimilarities are assessed based on the attribute values describing the examples. Often, distance measures are used.
Slide 5 Clustering Note: You do not know which type of Star each star is, they are unlabelled, you Just use the information given in the attributes (or features) of the star
Slide 6 EE3J2 Data Mining Structure of data Typical real data is not uniformly distributed It has structure Variables might be correlated The data might be grouped into natural ‘clusters’ The purpose of cluster analysis is to find this underlying structure automatically
Slide 7 7 Data Structures Clustering algorithms typically operate on either: Data matrix – represents n objects (a.k.a. examples e.g. persons) with p variables (e.g. age,height,gender, etc.). Its n examples x p variables Dissimilarity matrix – stores a collection of distances between examples. d(x,y) = difference or dissimilarity between examples x and y. How can dissimilarity d(x,y) be assessed?
Slide 8 EE3J2 Data Mining Clusters and centroids In another words…… If we assume that the clusters are spherical, then they are determined by their centres The cluster centres are called centroids How many centroids do we need? Where should we put them? centroids d(x,y) x y
Slide 9 Measuring dissimilarity (or similarity) To measure similarity, often a distance function ‘d’ is used Measures “dissimilarity” between pairs objects x and y Small distance d(x, y): objects x and y are more similar Large distance d(x, y): objects x and y are less similar
Slide 10 Properties of the distance function So, a function d(x,y) defined on pairs of points x and y is called a distance (d) if it satisfies: d(x,y)≥ 0: Distance is a nonnegative number d(x,x) = 0 the distance of an object to itself is 0. d(x,y) = d(y,x) for all points x and y (d is symmetric) d(x,y) d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)
Slide 11 Euclidean Distance The most popular distance measure is Euclidean distance. If x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then: This corresponds to the standard notion of distance in Euclidean space
Slide 12 EE3J2 Data Mining Distortion Distortion is a measure of how well a set of centroids models a set of data Suppose we have: data points y 1, y 2,…,y T centroids c 1,…,c M For each data point y t let c i(t) be the closest centroid In other words: d(y t, c i(t) ) = min m d(y t,c m )
Slide 13 EE3J2 Data Mining Distortion The distortion for the centroid set C = c 1,…,c M is defined by: In other words, the distortion is the sum of distances between each data point and its nearest centroid The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised
Slide 14 14 The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.
Slide 15 15 The K-Means Clustering Method Example
Slide 16 Lets watch an animation! http://r.yihui.name/stat/multivariate_stat/kmeans/index.htm
Slide 17 K-means Clustering Suppose that we have decided how many centroids we need - denote this number by K Suppose that we have an initial estimate of suitable positions for our K centroids K-means clustering is an iterative procedure for moving these centroids to reduce distortion
Slide 18 K-means clustering - notation Suppose there are T data points, denoted by: Suppose that the initial K clusters are denoted by: One iteration of K-means clustering will produce a new set of clusters Such that
Slide 19 K-means clustering (1) For each data point y t let c i(t) be the closest centroid In other words: d(y t, c i(t) ) = min m d(y t,c m ) Now, for each centroid c 0 k define: In other words, Y 0 k is the set of data points which are closer to c 0 k than any other cluster
Slide 20 K-means clustering (2) Now define a new k th centroid c 1 k by: where |Y k 0 | is the number of samples in Y k 0 In other words, c 1 k is the average value of the samples which were closest to c 0 k
Slide 21 K-means clustering (3) Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges Each new set of centroids has smaller distortion than the previous set
Slide 22 22 Comments on the K-Means Method Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non- convex shapes