Download presentation

Presentation is loading. Please wait.

Published byDenisse Asbury Modified about 1 year ago

1
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10

2
What is Data Clustering? A cluster is a closely-packed group. A collection of data objects that are similar to one another and treated collectively as a group. Data Clustering is the partitioning of a dataset into clusters 2 of 28 Vladimir Jelić

3
Data Clustering Helps understand the natural grouping or structure in a dataset Provided a large set of multidimensional data – Data space is usually not uniformly occupied – Identify the sparse and crowded places – Helps visualization 3 of 28 Vladimir Jelić

4
Some Clustering Applications Biology – building groups of genes with related patterns Marketing – partition the population of consumers to market segments Division of WWW pages into genres. Image segmentations – for object recognition Land use – Identification of areas of similar land use from satellite images 4 of 28 Vladimir Jelić

5
Clustering Problems Today many datasets are too large to fit into main memory The dominating cost of any clustering algorithm is I/O, because seek times on disk are orders of a magnitude higher than RAM access times 5 of 28 Vladimir Jelić

6
Previous Work Two classes of clustering algorithms: Probability-Based Examples: COBWEB and CLASSIT Distance-Based Examples: KMEANS, KMEDOIDS, and CLARANS 6 of 28 Vladimir Jelić

7
Previous Work: COBWEB Probabilistic approach to make decisions Clusters are represented with probabilistic description Probability representations of clusters is expensive Every instance (data point) translates into a terminal node in the hierarchy, so large hierarchies tend to over fit data 7 of 28 Vladimir Jelić

8
Previous Work: KMeans Distance based approach, so there must be distance measurement between any two instances Sensitive to instance order Instances must be stored in memory All instances must be initially available May have exponential run time 8 of 28 Vladimir Jelić

9
Previous Work: CLARANS Also distance based approach, so there must be distance measurement between any two instances computational complexity of CLARANS is about O(n2) Sensitive to instance order Ignore the fact that not all data points in the dataset are equally important 9 of 28 Vladimir Jelić

10
Contributions of BIRCH Each clustering decision is made without scanning all data points BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes BIRCH makes full use of available memory to derive the finest possible subclusters ( to ensure accuracy) while minimizing I/O costs ( to ensure efficiency) 10 of 28 Vladimir Jelić

11
Background Knowledge (1) Given a cluster of instances, we define: 11 of 28 Vladimir Jelić Centroid: Radius: Diameter:

12
Background Knowledge (2) Vladimir Jelić 12 of 28 centroid Manhattan distance: centroid Euclidian distance: average inter-cluster: average intra-cluster: variance increase:

13
Clustering Features (CF) The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set. Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. 13 of 28 Vladimir Jelić

14
Clustering Feature (CF) Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N, CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points. 14 of 28 Vladimir Jelić

15
CF Additivity Theorem (1) If CF1 = (N1, LS1, SS1), and CF2 = (N2,LS2, SS2) are the CF entries of two disjoint sub-clusters. The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is: CF1 + CF2 = (N1 + N2, LS1 + LS2, SS1 + SS2) 15 of 28 Vladimir Jelić

16
CF Additivity Theorem (2) Vladimir Jelić 16 of 28 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) Example:

17
Properties of CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries which each satisfy threshold T Node size is determined by dimensionality of data space and input parameter P (page size) 17 of 28 Vladimir Jelić

18
CF Tree Insertion Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node Modifying the path: update CF information up the path. 18 of 28 Vladimir Jelić

19
Example of the BIRCH Algorithm Root LN1 LN2 LN3 LN1 LN2LN3 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6sc7 sc8 New subcluster 19 of 28 Vladimir Jelić

20
19 of 28 Merge Operation in BIRCH Root LN1” LN2 LN3 LN1’ LN2LN3 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6sc7 sc8 LN1’ LN1” If the branching factor of a leaf node can not exceed 3, then LN1 is split

21
Vladimir Jelić 19 of 28 Merge Operation in BIRCH Root LN1” LN2 LN3 LN1’ LN2 LN3 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc8 LN1’ LN1” If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one NLN1 NLN2

22
Vladimir Jelić 19 of 28 Merge Operation in BIRCH root LN1 LN2 sc1 sc2sc3sc4 sc5 sc6 LN1 Assume that the subclusters are numbered according to the order of formation sc3 sc4 sc5 sc6 sc2 sc1 LN2

23
Vladimir Jelić 19 of 28 LN2” LN1 sc3 sc4 sc5 sc6 sc2 If the branching factor of a leaf node can not exceed 3, then LN2 is split sc1 LN2’ root LN2’ sc1 sc2 sc3 sc4 sc5 sc6 LN1 LN2” Merge Operation in BIRCH

24
Vladimir Jelić 19 of 28 root LN2” LN3’ LN3” sc3 sc4 sc5 sc6 sc2 sc1 sc2sc3 sc4 sc5 sc6 LN3’ LN2’ and LN1 will be merged, and the newly formed node wil be split immediately sc1 LN3” LN2” Merge Operation in BIRCH

25
Birch Clustering Algorithm (1) Phase 1: Scan all data and build an initial in- memory CF tree. Phase 2: condense into desirable length by building a smaller CF tree. Phase 3: Global clustering Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results 20 of 28 Vladimir Jelić

26
Birch Clustering Algorithm (2) Vladimir Jelić 21 of 28

27
Birch – Phase 1 Start with initial threshold and insert points into the tree If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values Good initial threshold is important but hard to figure out Outlier removal – when rebuilding tree remove outliers 22 of 28 Vladimir Jelić

28
Birch - Phase 2 Optional Phase 3 sometime have minimum size which performs well, so phase 2 prepares the tree for phase 3. BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. 23 of 28 Vladimir Jelić

29
Birch – Phase 3 Problems after phase 1: – Input order affects results – Splitting triggered by node size Phase 3: – cluster all leaf nodes on the CF values according to an existing algorithm – Algorithm used here: agglomerative hierarchical clustering 24 of 28 Vladimir Jelić

30
Birch – Phase 4 Optional Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 Recalculating the centroids and redistributing the items. Always converges (no matter how many time phase 4 is repeated) 25 of 28 Vladimir Jelić

31
Conclusions (1) Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets Scans whole data only once Handles outliers better Superior to other algorithms in stability and scalability 26 of 28 Vladimir Jelić

32
Conclusions (2) Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster Vladimir Jelić 27 of 28

33
References T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96 Jan Oberst: Efficient Data Clustering and How to Groom Fast-Growing Trees Tan, Steinbach, Kumar: Introduction to Data Mining Vladimir Jelić 28 of 28

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google