Download presentation

Presentation is loading. Please wait.

Published byGiancarlo Hammitt Modified over 2 years ago

1
COMP24111 Machine Learning K-means Clustering Ke Chen

2
COMP24111 Machine Learning 2 Outline Introduction K-means Algorithm Example and Exercise How K-means partitions? K-means Demo Relevant Issues Cluster Validity Conclusion

3
COMP24111 Machine Learning 3 Introduction Partitioning Clustering Approach –a typical clustering analysis approach via iteratively partitioning training data set to learn a partition of the given data space –learning a partition on a data set to produce several non-empty clusters (usually, the number of clusters given in advance) –in principle, optimal partition achieved via minimising the sum of squared distance to its “representative object” in each cluster e.g., Euclidean distance

4
COMP24111 Machine Learning 4 Introduction Given a K, find a partition of K clusters to optimise the chosen partitioning criterion (cost function) o global optimum: exhaustively search all partitions The K-means algorithm: a heuristic method o K-means algorithm (MacQueen’67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centriods of clusters. o K-means algorithm is the simplest partitioning method for clustering analysis and widely used in data mining applications.

5
COMP24111 Machine Learning 5 K-means Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps after initialisation: Initialisation: set seed points (randomly) 1)Assign each object to the cluster with the nearest seed point measured with a specific distance metric 2)Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) 3)Go back to Step 1), stop when no more new assignment or membership in each cluster no longer change

6
COMP24111 Machine Learning 6 Problem Example Suppose we have 4 types of medicines and each has two attributes (pH and weight index). Our goal is to group these objects into K=2 group of medicine. MedicineWeightpH- Index A11 B21 C43 D54 AB C D

7
COMP24111 Machine Learning 7 Example Step 1: Use initial seed points for partitioning Assign each object to the cluster with the nearest seed point Euclidean distance D C A B

8
COMP24111 Machine Learning 8 Example Step 2: Compute new centroids of the current partition Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

9
COMP24111 Machine Learning 9 Example Step 2: Renew membership based on new centroids Compute the distance of all objects to the new centroids Assign the membership to objects

10
COMP24111 Machine Learning 10 Example Step 3: Repeat the first two steps until its convergence Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

11
COMP24111 Machine Learning 11 Example Step 3: Repeat the first two steps until its convergence Compute the distance of all objects to the new centroids Stop due to no new assignment Membership in each cluster no longer change

12
COMP24111 Machine Learning 12 Exercise For the medicine data set, use K-means with the Manhattan distance metric for clustering analysis by setting K=2 and initialising seeds as C 1 = A and C 2 = C. Answer three questions as follows: 1.How many steps are required for convergence? 2.What are memberships of two clusters after convergence? 3.What are centroids of two clusters after convergence? Medicine WeightpH- Index A11 B21 C43 D54 AB C D

13
COMP24111 Machine Learning 13 How K-mean partitions? When K centroids are set/fixed, they partition the whole data space into K mutually exclusive subspaces to form a partition. A partition amounts to a Changing positions of centroids leads to a new partitioning. Voronoi Diagram

14
COMP24111 Machine Learning 14 K-means Demo 1.User set up the number of clusters they’d like. (e.g. k=5)

15
COMP24111 Machine Learning 15 K-means Demo 1.User set up the number of clusters they’d like. (e.g. K=5) 2.Randomly guess K cluster Center locations

16
COMP24111 Machine Learning 16 K-means Demo 1.User set up the number of clusters they’d like. (e.g. K=5) 2.Randomly guess K cluster Center locations 3.Each data point finds out which Center it’s closest to. (Thus each Center “owns” a set of data points)

17
COMP24111 Machine Learning 17 K-means Demo 1.User set up the number of clusters they’d like. (e.g. K=5) 2.Randomly guess K cluster centre locations 3.Each data point finds out which centre it’s closest to. (Thus each Center “owns” a set of data points) 4.Each centre finds the centroid of the points it owns

18
COMP24111 Machine Learning 18 K-means Demo 1.User set up the number of clusters they’d like. (e.g. K=5) 2.Randomly guess K cluster centre locations 3.Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) 4.Each centre finds the centroid of the points it owns 5.…and jumps there

19
COMP24111 Machine Learning 19 K-means Demo 1.User set up the number of clusters they’d like. (e.g. K=5) 2.Randomly guess K cluster centre locations 3.Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) 4.Each centre finds the centroid of the points it owns 5.…and jumps there 6.…Repeat until terminated!

20
COMP24111 Machine Learning 20 K-means Demo

21
COMP24111 Machine Learning 21 Relevant Issues Efficient in computation –O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. Local optimum –sensitive to initial seed points –converge to a local optimum that may be unwanted solution Other problems –Need to specify K, the number of clusters, in advance –Unable to handle noisy data and outliers (K-Medoids algorithm) –Not suitable for discovering clusters with non-convex shapes –Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)

22
COMP24111 Machine Learning 22 Cluster Validity With different initial conditions, the K-means algorithm may result in different partitions for a given data set. Which partition is the “best” one for the given data set? In theory, no answer to this question as there is no ground- truth available in unsupervised learning (ill-posed problem!) Nevertheless, there are several cluster validity criteria to assess the quality of clustering analysis from different perspectives. Prior Knowledge might be required to design/select a cluster validity criterion.

23
COMP24111 Machine Learning 23 Cluster validity criterion Example –Between-cluster distance (D B ): the distance between means of two clusters –Within-cluster distance (D W ): the averaging distance among data points and the mean in a specific cluster –A large ratio of D B : D W suggests good compactness inside clusters and good separability among different clusters! DBDB DWDW DWDW Cluster Validity

24
COMP24111 Machine Learning 24 Conclusion K-means algorithm is a simple yet popular method for clustering analysis Its performance is determined by initialisation and appropriate distance measure There are several variants of K-means to overcome its weaknesses –K-Medoids: resistance to noise and/or outliers –K-Modes: extension to categorical data clustering analysis –CLARA: dealing with large data sets –Mixture models (EM algorithm): handling uncertainty of clusters

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google