Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.

Similar presentations


Presentation on theme: "Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering."— Presentation transcript:

1 Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering (CACTUS, STIRR) …… STC QDC

2 Hierarchical clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, 1. Start by assigning each item to its own cluster 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

3 Iwona Białynicka-Birula - Clustering Web Search Results Agglomerative hierarchical clustering

4 Iwona Białynicka-Birula - Clustering Web Search Results Clustering result: dendrogram

5 Iwona Białynicka-Birula - Clustering Web Search Results AHC variants Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)

6 Strength and weakness Can find clusters of arbitrary shapes Single link has a chaining problem Complete link is sensitive to outliers Computation complexities and space requirements 6

7 Data Clustering K-means Partitional clustering Initial number of clusters k

8 K-means 1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 8

9 Example by Andrew W. Moore 9

10 10

11 K-means 11

12 Iwona Białynicka-Birula - Clustering Web Search Results K-means clustering (k=3)

13 Strengths and Weaknesses Only applicable to data sets where the notion of the mean is defined Need to now the number of clusters K in advance Sensitive to outliers Sensitive to initial seeds Not suitable for discovering clusters that are not hyper-ellipsoids (e.g. L shape) 13

14 14

15 15

16 Iwona Białynicka-Birula - Clustering Web Search Results Single-pass threshold

17 Document Clustering: k-means k-means: distance-based flat clustering Advantage: linear time complexity works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial k clusters affect the quality of clusters 0. Input: D::={d 1,d 2,…d n }; k::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2. Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output:k clusters of documents

18 Document Clustering: HAC Hierarchic agglomerative clustering(HAC):distance-based hierarchic clustering Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity 0. Input: D::={d 1,d 2,…d n }; 1. Calculate similarity matrix SIM[i,j] 2. Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6. Output: dendogram of clusters


Download ppt "Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering."

Similar presentations


Ads by Google