Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Genetic Algorithm Approach to K-Means Clustering

Similar presentations


Presentation on theme: "A Genetic Algorithm Approach to K-Means Clustering"— Presentation transcript:

1 A Genetic Algorithm Approach to K-Means Clustering
Craig Stanek CS401 November 17, 2004

2 What Is Clustering? “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: Each cluster has instances that are very similar (or “near”) to each other, and The instances in each cluster are very different (or “far away”) from the instances in the other clusters” --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”

3 Segmentation and Differentiation
Why Cluster? Segmentation and Differentiation

4 Why Cluster? Outlier Detection

5 Why Cluster? Classification

6 K-Means Clustering Specify K clusters
Randomly initialize K “centroids” Classify each data instance to closest cluster according to distance from centroid Recalculate cluster centroids Repeat steps (3) and (4) until no data instances move to a different cluster

7 Drawbacks of K-Means Algorithm
Local rather than global optimum Sensitive to initial choice of centroids K must be chosen apriori Minimizes intra-cluster distance but does not consider inter-cluster distance

8 Problem Statement Can a Genetic Algorithm approach do better than standard K-means Algorithm? Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? Can a GA be used to find the optimum number of clusters for a given data set?

9 Representation of Individuals
Randomly generated number of clusters Medoid-based integer string (each gene is a distinct data instance) Example: 58 113 162 23 244

10 Genetic Algorithm Approach
Why Medoids?

11 Genetic Algorithm Approach
Why Medoids?

12 Genetic Algorithm Approach
Why Medoids?

13 Recombination Parent #1: 36 108 82 Parent #2: 5 80 147 82 108 6 36 6
Child #1: 5 82 80 Child #2:

14 Fitness Function Let rij represent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X

15 Experimental Setup Iris Plant Data (UCI Repository) 150 data instances
4 dimensions Known classifications 3 classes 50 instances of each

16 Experimental Setup Iris Data Set

17 Experimental Setup Iris Data Set

18 Standard K-Means vs. Medoid-Based EA
Total Trials 30 Avg. Correct 120.1 134.9 Avg. % Correct 80.1% 89.9% Min. Correct 77 133 Max. Correct 134 135 Avg. Fitness 78.94 84.00

19 Standard K-Means Clustering
Iris Data Set

20 Medoid-Based EA Iris Data Set

21 Standard Fitness EA vs. Proposed Fitness EA
Total Trials 30 Avg. Correct 134.9 134.0 Avg. % Correct 89.9% 89.3% Min. Correct 133 134 Max. Correct 135 Avg. Generations 82.7 24.9

22 Fixed vs. Variable Number of Clusters EA
Total Trials 30 Avg. Correct 134.0 Avg. % Correct 89.3% Min. Correct 134 Max. Correct Avg. # of Clusters 3 7

23 Variable Number of Clusters EA
Iris Data Set

24 Conclusions GA better at obtaining globally optimal solution
Proposed fitness function shows promise Difficulty letting GA determine “correct” number of clusters on its own

25 Future Work Other data sets Alternative fitness function Scalability
GA comparison to simulated annealing


Download ppt "A Genetic Algorithm Approach to K-Means Clustering"

Similar presentations


Ads by Google