# K-Means Clustering Algorithm Mining Lab. 2004 10 27.

## Presentation on theme: "K-Means Clustering Algorithm Mining Lab. 2004 10 27."— Presentation transcript:

K-Means Clustering Algorithm Mining Lab. 2004 10 27

Content Clustering K-Means via EM

Clustering (1/2) Clustering ? Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each other, they share certain properties. e.g Customer Segmentation. Clustering vs. Classification Supervised Learning Unsupervised Learning Not target variable to be predicted.

Clustering (2/2) Categorization of Clustering Methods Partitioning mehtods K-Means / K-medoids / PAM / CRARA / CRARANS Hierachical methods CURE / CHAMELON / BIRCH Density-based methods DBSCAN / OPTICS Grid-based methods STING / CLIQUE / Wave-Cluster Model-based methods EM / COBWEB / Bayesian / Neural Model-Based Clustering Statistical Clustering Probability-based Clustering

K-Means (1) Algorithm Step 0 : Select K objects as initial centroids. Step 1 : (Assignment) For each object compute distances to k centroids. Assign each object to the cluster to which it is the closest. Step 2 : (New Centroids) Compute a new centroid for each cluster. Step 3: (Converage) Stop if the change in the centroids is less than the selected covergence criterion. Otherwise repeat Step 1.

K-Means (2) simple example Random Centroids Assignment New Centroids & (Check) Assignment New Centroids & (check) AssignmentCentroids & (check) Input Data

K-Means (3) weakness on outlier (noise)

K-Means (4) Calculation 0. (4,4), (3,4) (4,2), (0,2), (1,1), (1,0) 1. 1) 2) - (3, 4), (4, 4), (4, 2) - (0, 2) (1, 1), (1, 0) 2. 2) 3) - (3, 4), (4, 4), (4, 2) - (0, 2) (1, 1), (1, 0) 1. (4,4), (3,4) (4,2), (0,2), (1,1), (1,0) (100, 0) 1. 1) 2) - (0,2), (1,1), (1,0),(3,4),(4,4),(4,2) - (100,1) 2. 1) 2) - (0, 2),(1,1),(1,0),(3,4),(4,4),(4,2) - (100, 1)

K-Means (5) comparison with EM K-Means Hard Clustering. A instance belong to only one Cluster. Based on Euclidean distance. Not Robust on outlier, value range. EM Soft Clustering. A instance belong to several clusters with membership probability. Based on density probability. Can handle both numeric and nominal attributes. I C1 C2 I C1 C2 0.7 0.3