Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.

Similar presentations


Presentation on theme: "Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al."— Presentation transcript:

1 Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.

2 Objects in a cluster should share closely related properties have small mutual distances be clearly distinguishable from objects not in the same cluster A cluster should be a densely populated region surrounded by relatively empty regions. Compact cluster --- can be represented by a center Chained cluster --- higher order structures Clustering

3

4 The process of clustering Clustering

5 Types of clustering: Clustering

6 A distance function should satisfy Similarity measures

7 Similarity function: Similarity measures

8 From a dataset, Distance matrix: Similarity matrix: Similarity measures

9 Euclidean distance Mahattan distance Mahattan segmental distance (using only part of the dimensions) Similarity measures

10 Maximum distance (sup distance) Minkowski distance. This is the general case. R=2, Euclidean distance; R=1, Manhattan distance; R=∞, maximum distance. Similarity measures

11 Mahalanobis distance It is invariant under non-singular transformations The new covariant matrix is Similarity measures

12 The Mahalanobis distance doesn’t change Similarity measures

13 Chord distance: the length of the chord joining the two normalized points within a hypersphere of radius one Geodesic distance: the length of the shorter arc connecting the two normalized data points at the surface of the hypersphere of unit radius Similarity measures

14

15 Categorical data: In one dimension: Simple matching distance: Taking category frequency into account: Similarity measures

16 For more general definitions of similarity, define: Number of match: Number of match to NA (? means missing here): Number of non-match: Similarity measures

17

18 Binary feature vectors: Define: S is the number of occurrences of the case.

19 Similarity measures

20 Mixed-type data: General similarity coefficient by Gower: For quantitative attributes, (R is range), if neither is missing. For binary attributes, if x k =1 & y k =1; if x k =1 or y k =1. For nominal attributes, if x k = y k ; if neither is missing. Similarity measures

21 Similarity between clusters Mean-based distance: Nearest neighbor Farthest neighbor Average neighbor Similarity measures

22 Hierarchical clustering Agglomerative: build tree by joining nodes; Divisive: build tree by dividing groups of objects.

23 Example data: Hierarchical clustering

24 Single linkage: find the distance between any two nodes by nearest neighbor distance. Hierarchical clustering

25 Single linkage: Hierarchical clustering

26 Complete linkage: find the distance between any two nodes by farthest neighbor distance. Average linkage: find the distance between any two nodes by average distance. Hierarchical clustering

27 Comments: Hierarchical clustering generates a tree; to find clusters, the tree needs to be cut at a certain height; Complete linkage method favors compact, ball- shaped clusters; single linkage method favors chain-shaped clusters; average linkage is somewhere in between. Hierarchical clustering

28 Model-based clustering Impose certain model assumptions on potential clusters; try to optimize the fit between data and model. The data is viewed as coming from a mixture of probability distributions; each of the distributions represents a cluster.

29 For example, if we believe the data come from a mixture of several Gaussian densities, the likelihood that data point i is from cluster j is: Model-based clustering

30 Given the number of clusters, we try to maximize the likelihood Where is the probability that the observation belongs to cluster j The most commonly used method is the EM algorithm. It iterates between soft cluster assignment and parameter estimation. Model-based clustering

31

32 Gaussian cluster models. Common assumptions: From 1 to 4, the model becomes more flexible, yet more parameters need to be estimated. May become less stable. Model-based clustering


Download ppt "Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al."

Similar presentations


Ads by Google