Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering (1) Clustering Similarity measure Hierarchical clustering

Similar presentations


Presentation on theme: "Clustering (1) Clustering Similarity measure Hierarchical clustering"— Presentation transcript:

1 Clustering (1) Clustering Similarity measure Hierarchical clustering
Model-based clustering Figures from the book Data Clustering by Gan et al.

2 Clustering Objects in a cluster should share closely related properties have small mutual distances be clearly distinguishable from objects not in the same cluster A cluster should be a densely populated region surrounded by relatively empty regions. Compact cluster --- can be represented by a center Chained cluster --- higher order structures

3 Clustering

4 Clustering Types of clustering:

5 Similarity measures A metric distance function should satisfy

6 Similarity measures Similarity function:

7 Similarity measures From a dataset, Distance matrix: Similarity matrix:

8 Some similarity measures for continuous data
Euclidean distance Mahattan distance Mahattan segmental distance (using only part of the dimensions)

9 Some similarity measures for continuous data
Maximum distance (sup distance) Minkowski distance. This is the general case. R=2, Euclidean distance; R=1, Manhattan distance; R=∞, maximum distance.

10 Some similarity measures for continuous data
Mahalanobis distance It is invariant under non-singular transformations C is any nonsingular d × d matrix. The new covariant matrix is

11 Some similarity measures for continuous data
The Mahalanobis distance doesn’t change

12 Some similarity measures for categorical data
In one dimension: Simple matching distance for multi-dimensions: Taking category frequency into account:

13 Some similarity measures for categorical data
For more general definitions of similarity, define: Number of match: Number of match to NA (? means missing here): Number of non-match:

14 Some example similarity measures for categorical data

15 Some similarity measures for categorical data
Binary feature vectors: Define: S is the number of occurrences of the case.

16 Some similarity measures for categorical data

17 Some similarity measures for mixed-type data
General similarity coefficient by Gower:

18 Similarity measures Similarity between clusters Mean-based distance (between mean vectors): Nearest neighbor

19 Similarity measures Farthest neighbor Average neighbor

20 Hierarchical clustering
Agglomerative: build tree by joining nodes; Divisive: build tree by dividing groups of objects.

21 Hierarchical clustering

22 Hierarchical clustering
Example data:

23 Hierarchical clustering
Single linkage: find the distance between any two nodes by nearest neighbor distance.

24 Hierarchical clustering
Single linkage:

25 Hierarchical clustering
Complete linkage: find the distance between any two nodes by farthest neighbor distance. Average linkage: find the distance between any two nodes by average distance.

26 Hierarchical clustering
Comments: Hierarchical clustering generates a tree; to find clusters, the tree needs to be cut at a certain height; Complete linkage method favors compact, ball-shaped clusters; single linkage method favors chain-shaped clusters; average linkage is somewhere in between.

27 Model-based clustering
Impose certain model assumptions on potential clusters; try to optimize the fit between data and model. The data is viewed as coming from a mixture of probability distributions; each of the distributions represents a cluster.

28 Model-based clustering
For example, if we believe the data come from a mixture of several Gaussian densities, the likelihood that data point i is from cluster j is: Classification likelihood approach: find cluster assignments and parameters that maximize

29 Model-based clustering
Mixture likelihood approach: The most commonly used method is the EM algorithm. It iterates between soft cluster assignment and parameter estimation.

30 Model-based clustering
EM algorithm in the simplest case: two component Gaussian in 1D

31 Model-based clustering

32 Model-based clustering

33 Model-based clustering
Gaussian cluster models.

34 Model-based clustering
Common assumptions: From 1 to 4, the model becomes more flexible, yet more parameters need to be estimated. May become less stable.


Download ppt "Clustering (1) Clustering Similarity measure Hierarchical clustering"

Similar presentations


Ads by Google