Presentation is loading. Please wait.

Presentation is loading. Please wait.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Similar presentations


Presentation on theme: "The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang."— Presentation transcript:

1 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang

2 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 2 Hierarchical Clustering Group data objects into a tree of clusters Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA)

3 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 3 AGNES (Agglomerative Nesting) Initially, each object is a cluster Step-by-step cluster merging, until all objects form a cluster Single-link approach Each cluster is represented by all of the objects in the cluster The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters

4 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 4 Dendrogram Show how to merge clusters hierarchically Decompose data objects into a multi-level nested partitioning (a tree of clusters) A clustering of the data objects: cutting the dendrogram at the desired level Each connected component forms a cluster

5 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 5 DIANA (DIvisive ANAlysis) Initially, all objects are in one cluster Step-by-step splitting clusters until each cluster contains only one object

6 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 6 Distance Measures Minimum distance Maximum distance Mean distance Average distance m: mean for a cluster C: a cluster n: the number of objects in a cluster

7 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 7 Challenges of Hierarchical Clustering Methods Hard to choose merge/split points Never undo merging/splitting Merging/splitting decisions are critical Do not scale well: O(n 2 ) What is the bottleneck when the data can’t fit in memory? Integrating hierarchical clustering with other techniques BIRCH, CURE, CHAMELEON, ROCK

8 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 8 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies CF (Clustering Feature) tree: a hierarchical data structure summarizing object info Clustering objects  clustering leaf nodes of the CF tree

9 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 9 Clustering Feature: CF = (N, LS, SS) N: Number of data points LS:  N i=1 =X i SS:  N i=1 =X i 2 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) Clustering Feature Vector

10 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 10 CF-tree in BIRCH Clustering feature: Summarize the statistics for a subcluster: the 0 th, 1 st and 2 nd moments of the subcluster Register crucial measurements for computing cluster and utilize storage efficiently A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of children

11 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 11 CF Tree CF 1 child 1 CF 3 child 3 CF 2 child 2 CF 6 child 6 CF 1 child 1 CF 3 child 3 CF 2 child 2 CF 5 child 5 CF 1 CF 2 CF 6 prevnext CF 1 CF 2 CF 4 prevnext B = 7 L = 6 Root Non-leaf node Leaf node

12 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 12 Parameters of A CF-tree Branching factor: the maximum number of children Threshold: max diameter of sub-clusters stored at the leaf nodes

13 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 13 BIRCH Clustering Phase 1: scan DB to build an initial in- memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

14 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 14 Pros & Cons of BIRCH Linear scalability Good clustering with a single scan Quality can be further improved by a few additional scans Can handle only numeric data Sensitive to the order of the data records

15 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 15 Drawbacks of Square Error Based Methods One representative per cluster Good only for convex shaped having similar size and density A number of clusters parameter k Good only if k can be reasonably estimated

16 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 16 CURE: the Ideas Each cluster has c representatives Choose c well scattered points in the cluster Shrink them towards the mean of the cluster by a fraction of  The representatives capture the physical shape and geometry of the cluster Merge the closest two clusters Distance of two clusters: the distance between the two closest representatives


Download ppt "The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang."

Similar presentations


Ads by Google