Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Clustering Objects in a cluster should share closely related properties have small mutual distances be clearly distinguishable from objects not in the same cluster A cluster should be a densely populated region surrounded by relatively empty regions. Compact cluster --- can be represented by a center Chained cluster --- higher order structures
Clustering
Clustering Types of clustering:
Similarity measures A metric distance function should satisfy
Similarity measures Similarity function:
Similarity measures From a dataset, Distance matrix: Similarity matrix:
Some similarity measures for continuous data Euclidean distance Mahattan distance Mahattan segmental distance (using only part of the dimensions)
Some similarity measures for continuous data Maximum distance (sup distance) Minkowski distance. This is the general case. R=2, Euclidean distance; R=1, Manhattan distance; R=∞, maximum distance.
Some similarity measures for continuous data Mahalanobis distance It is invariant under non-singular transformations C is any nonsingular d × d matrix. The new covariant matrix is
Some similarity measures for continuous data The Mahalanobis distance doesn’t change
Some similarity measures for categorical data In one dimension: Simple matching distance for multi-dimensions: Taking category frequency into account:
Some similarity measures for categorical data For more general definitions of similarity, define: Number of match: Number of match to NA (? means missing here): Number of non-match:
Some example similarity measures for categorical data
Some similarity measures for categorical data Binary feature vectors: Define: S is the number of occurrences of the case.
Some similarity measures for categorical data
Some similarity measures for mixed-type data General similarity coefficient by Gower:
Similarity measures Similarity between clusters Mean-based distance (between mean vectors): Nearest neighbor
Similarity measures Farthest neighbor Average neighbor
Hierarchical clustering Agglomerative: build tree by joining nodes; Divisive: build tree by dividing groups of objects.
Hierarchical clustering
Hierarchical clustering Example data:
Hierarchical clustering Single linkage: find the distance between any two nodes by nearest neighbor distance.
Hierarchical clustering Single linkage:
Hierarchical clustering Complete linkage: find the distance between any two nodes by farthest neighbor distance. Average linkage: find the distance between any two nodes by average distance.
Hierarchical clustering Comments: Hierarchical clustering generates a tree; to find clusters, the tree needs to be cut at a certain height; Complete linkage method favors compact, ball-shaped clusters; single linkage method favors chain-shaped clusters; average linkage is somewhere in between.
Model-based clustering Impose certain model assumptions on potential clusters; try to optimize the fit between data and model. The data is viewed as coming from a mixture of probability distributions; each of the distributions represents a cluster.
Model-based clustering For example, if we believe the data come from a mixture of several Gaussian densities, the likelihood that data point i is from cluster j is: Classification likelihood approach: find cluster assignments and parameters that maximize
Model-based clustering Mixture likelihood approach: The most commonly used method is the EM algorithm. It iterates between soft cluster assignment and parameter estimation.
Model-based clustering EM algorithm in the simplest case: two component Gaussian in 1D
Model-based clustering
Model-based clustering
Model-based clustering Gaussian cluster models.
Model-based clustering Common assumptions: From 1 to 4, the model becomes more flexible, yet more parameters need to be estimated. May become less stable.