Clustering (1/2) Clustering ? Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each other, they share certain properties. e.g Customer Segmentation. Clustering vs. Classification Supervised Learning Unsupervised Learning Not target variable to be predicted.
K-Means (1) Algorithm Step 0 : Select K objects as initial centroids. Step 1 : (Assignment) For each object compute distances to k centroids. Assign each object to the cluster to which it is the closest. Step 2 : (New Centroids) Compute a new centroid for each cluster. Step 3: (Converage) Stop if the change in the centroids is less than the selected covergence criterion. Otherwise repeat Step 1.
K-Means (2) simple example Random Centroids Assignment New Centroids & (Check) Assignment New Centroids & (check) AssignmentCentroids & (check) Input Data
K-Means (5) comparison with EM K-Means Hard Clustering. A instance belong to only one Cluster. Based on Euclidean distance. Not Robust on outlier, value range. EM Soft Clustering. A instance belong to several clusters with membership probability. Based on density probability. Can handle both numeric and nominal attributes. I C1 C2 I C1 C2 0.7 0.3