Clustering Categorical Data

Name: Clustering Categorical Data
Uploaded: 2017-07-29T08:42:17+00:00
Duration: PTM4S53
Channel: Kathleen Harper
Description: Clustering Categorical Data

Clustering Categorical Data
Pasi Fränti

K-means clustering

Definitions and data Set of N data points: Partition of the data:
X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},

Distance and cost function
Euclidean distance of data vectors: Mean square error:

Clustering result as partition
Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls

Duality of partition and centroids
Partition of data Cluster prototypes Partition by nearest prototype mapping Centroids as prototypes

Categorical data

Categorical clustering
Three attributes director actor genre t1 (Godfather II) Coppola De Niro Crime t2 (Good Fellas) Scorsese t3 (Vertigo) Hitchcock Stewart Thriller t4 (N by NW) Grant t5 (Bishop's Wife) Koster Comedy t6 (Harvey)

Categorical clustering Sample 2-d data: color and shape
Model A Model B Model C

Hamming Distance (Binary and categorical data)
Number of different attribute values. Distance of ( ) and ( ) is 2. Distance ( ) and ( ) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

Histogram-based methods:
K-means variants Methods: Histogram-based methods: k-modes k-medoids k-distributions k-histograms k-populations k-representatives

Entropy-based cost functions
Category utility: Entropy of data set: Entropies of the clusters relative to the data:

Iterative algorithms

K-modes clustering Distance function

K-modes clustering Prototype of cluster

K-medoids clustering Prototype of cluster
Vector with minimal total distance to every other 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5

K-medoids Example

K-medoids Calculation

K-histograms D 2/3 F 1/3

K-distributions Cost function with ε addition

Example of cluster allocation Change of entropy

Problem of non-convergence Non-convergence

Results with Census dataset

Literature Modified k-modes + k-histograms: M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), , March, 2007. ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp , Berkeley, USA, 2005. ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp , 200x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp , 1998. K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp , Qingdao, China, 2007. K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/ ,

Clustering Categorical Data

Similar presentations

Presentation on theme: "Clustering Categorical Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Categorical Data

Similar presentations

Presentation on theme: "Clustering Categorical Data"— Presentation transcript:

Similar presentations

About project

Feedback