Download presentation

Presentation is loading. Please wait.

Published byBenjamin Jefferson Modified over 6 years ago

1
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.

2
Objects in a cluster should share closely related properties have small mutual distances be clearly distinguishable from objects not in the same cluster A cluster should be a densely populated region surrounded by relatively empty regions. Compact cluster --- can be represented by a center Chained cluster --- higher order structures Clustering

4
The process of clustering Clustering

5
Types of clustering: Clustering

6
A distance function should satisfy Similarity measures

7
Similarity function: Similarity measures

8
From a dataset, Distance matrix: Similarity matrix: Similarity measures

9
Euclidean distance Mahattan distance Mahattan segmental distance (using only part of the dimensions) Similarity measures

10
Maximum distance (sup distance) Minkowski distance. This is the general case. R=2, Euclidean distance; R=1, Manhattan distance; R=∞, maximum distance. Similarity measures

11
Mahalanobis distance It is invariant under non-singular transformations The new covariant matrix is Similarity measures

12
The Mahalanobis distance doesn’t change Similarity measures

13
Chord distance: the length of the chord joining the two normalized points within a hypersphere of radius one Geodesic distance: the length of the shorter arc connecting the two normalized data points at the surface of the hypersphere of unit radius Similarity measures

15
Categorical data: In one dimension: Simple matching distance: Taking category frequency into account: Similarity measures

16
For more general definitions of similarity, define: Number of match: Number of match to NA (? means missing here): Number of non-match: Similarity measures

18
Binary feature vectors: Define: S is the number of occurrences of the case.

19
Similarity measures

20
Mixed-type data: General similarity coefficient by Gower: For quantitative attributes, (R is range), if neither is missing. For binary attributes, if x k =1 & y k =1; if x k =1 or y k =1. For nominal attributes, if x k = y k ; if neither is missing. Similarity measures

21
Similarity between clusters Mean-based distance: Nearest neighbor Farthest neighbor Average neighbor Similarity measures

22
Hierarchical clustering Agglomerative: build tree by joining nodes; Divisive: build tree by dividing groups of objects.

23
Example data: Hierarchical clustering

24
Single linkage: find the distance between any two nodes by nearest neighbor distance. Hierarchical clustering

25
Single linkage: Hierarchical clustering

26
Complete linkage: find the distance between any two nodes by farthest neighbor distance. Average linkage: find the distance between any two nodes by average distance. Hierarchical clustering

27
Comments: Hierarchical clustering generates a tree; to find clusters, the tree needs to be cut at a certain height; Complete linkage method favors compact, ball- shaped clusters; single linkage method favors chain-shaped clusters; average linkage is somewhere in between. Hierarchical clustering

28
Model-based clustering Impose certain model assumptions on potential clusters; try to optimize the fit between data and model. The data is viewed as coming from a mixture of probability distributions; each of the distributions represents a cluster.

29
For example, if we believe the data come from a mixture of several Gaussian densities, the likelihood that data point i is from cluster j is: Model-based clustering

30
Given the number of clusters, we try to maximize the likelihood Where is the probability that the observation belongs to cluster j The most commonly used method is the EM algorithm. It iterates between soft cluster assignment and parameter estimation. Model-based clustering

32
Gaussian cluster models. Common assumptions: From 1 to 4, the model becomes more flexible, yet more parameters need to be estimated. May become less stable. Model-based clustering

Similar presentations

© 2021 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google