Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining – Algorithms: K Means Clustering

Similar presentations


Presentation on theme: "Data Mining – Algorithms: K Means Clustering"— Presentation transcript:

1 Data Mining – Algorithms: K Means Clustering
Chapter 4, Section 4.8

2 K Means Clusting K – is the number of clusters
K must be specified in advance (option or parameter to algorithm) Develops “Cluster Centers” Starts with random center points Puts instances into “closest” cluster – based on euclidean distance Creates new centers based on instances included Refines iteratively until no change

3 Example See bankrawnumericKMeansVersion2.xls

4 Pseudo-code for K Means Clustering
Loop through K times current centroid = Randomly generate values for each attribute Done = False All instances cluster = none WHILE not Done Total distance = 0 Done = true For each instance instance’s previous cluster = instance’s cluster measure euclidean distance to each centroid find smallest distance and assign instance to that cluster if new cluster != previous cluster Done=False add smallest distance to total distance Report total distance For each cluster loop through attributes loop through instances assigned to cluster update totals calculate average for attribute for cluster – producing new centroid END While

5 K Means Clustering Simple and Effective The minimum is a local minimum
No guarantee that the total Euclidean distance is a global minimum Final clusters are quite sensitive to the initial (random) cluster centers This is true for all practical clustering techniques (since they are greedy hill climbers) Common to run several times and manually choose the best final result (one with the smallest total Euclidean distance)

6 Let’s run WEKA on this …

7 WEKA - Take-Home Number of iterations: 2
Within cluster sum of squared errors: Cluster centroids: Cluster 0 Mean/Mode: Std Devs: Cluster 1 Mean/Mode: Std Devs: Clustered Instances ( 25%) ( 75%) This was with default – k = 2 (2 clusters) It only had to loop twice Sum of euclidean distances is shown Means (and SDs) for each attribute for each cluster are shown Number of instances in each cluster are shown You can visualize the cluster (right click on result list) <DO> You can change the number of clusters generated <DO> You can change the random seed to see how results differ <DO> Weka doesn’t give you a list of which instance is in which cluster – but can add to arff file by using Preprocess tab – Filters.Unsupervised.Attribute.AddCluster

8 Numeric Attributes Simple K Means is designed for Numeric Attributes
Nominal Attributes similarity measurement has to use all or nothing Centroid uses mode instead of mean

9 End Section 4.8


Download ppt "Data Mining – Algorithms: K Means Clustering"

Similar presentations


Ads by Google