BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
CS690L: Clustering References:
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Chapter 3: Cluster Analysis
DATA MINING - CLUSTERING
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Clustering II.
Clustering Algorithms BIRCH and CURE
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Clustering Unsupervised learning Generating “classes”
Introduction to Database Systems1 B+-Trees Storage Technology: Topic 5.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
Hierarchical Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
INTRODUCTION TO MULTIWAY TREES P INTRO - Binary Trees are useful for quick retrieval of items stored in the tree (using linked list) - often,
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Adapted from Mike Franklin
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Hierarchical Clustering
Other Clustering Techniques
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
DATA MINING Spatial Clustering
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering feature tree (CF tree) These structures help the clustering method achieve good speed and scalability in large databases.

2 Motivation  Major weakness of agglomerative clustering methods Do not scale well; time complexity of at least O(n 2 ), where n is total number of objects Can never undo what was done previously  Birch: Balanced Iterative Reducing and Clustering using Hierarchies Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure summarizing cluster info; finds a good clustering with a single scan Apply multi-phase clustering; improves quality with a few additional scans

3  Given a cluster with N objects Centroid Radius Diameter Summarized Info for Single cluster

4 Summarized Info for Two Clusters  Given two clusters with N1 and N2 objects, respectively Centroid Euclidean distance Centroid Manhattan distance Average inter-cluster distance

5 Clustering Feature (CF)  CF = (N, LS, SS) N =|C| Number of data points LS = (V i1, …V id ) = O i N N N N LS = ∑ Oi = (∑Vi1, ∑ Vi2,… ∑Vid) i=1 i=1 i=1 i =1 Linear sum of N data points SS = square sum of N data points

6 CF = (5, (16, 30), (54, 190)) (3,4) (2,6) (4,5) (4,7) (3,8) Example of Clustering Feature Vector  Clustering Feature:  N: Number of data points

7

8 CF Additive Theorem  Suppose cluster C 1 has CF 1 =(N 1, LS 1,SS 1 ), cluster C 2 has CF 2 =(N 2,LS 2,SS 2 )  If we merge C 1 with C 2, the CF for the merged cluster C is  Why CF?  Summarized info for single cluster  Summarized info for two clusters  Additive theorem

9 CF 1 = (5, (16, 30), (54, 190)) (3,4) (2,6) (4,5) (4,7) (3,8) Example of Clustering Feature Vector  Clustering Feature:  N: Number of data points CF 2 = (5, (36, 17), (262, 61)) (6,2) (7,2) (7,4) (8,4) (8,5) CF = (10, (52, 47), (316, 251))

10 Clustering Feature Tree (CFT)  Clustering feature tree (CFT) is an alternative representation of data set: Each non-leaf node is a cluster comprising sub- clusters corresponding to entries (at most B) in non-leaf node Each leaf node is a cluster comprising sub-clusters corresponding to entries (at most L) in leaf node and each sub-cluster’s diameter is at most T; when T is larger, CFT is smaller Each node must fit a memory page

11 Example of CF Tree CF 9 child 1 CF 11 child 3 CF 10 child 2 CF 13 child 5 CF 90 CF 91 CF 94 prevnext CF 95 CF 96 CF 98 prevnext B = 7 L = 6 Root Non-leaf node Leaf node CF 1 child 1 CF 3 child 3 CF 2 child 2 CF 6 child 6

12 BIRCH Phase 1  Phase 1 scans data points and build in-memory CFT;  Start from root, traverse down tree to choose closest leaf node for d  Search for closest entry L i in leaf node  If d can be inserted in L i, then update CF vector of L i  Else if node has space to insert new entry, insert; else split node  Once inserted, update nodes along path to the root; if there is splitting, need to insert new entry in parent node (which may result in further splitting)

13 Example of the BIRCH Algorithm Root LN1 LN2 LN3 LN1 LN2LN3 sc1 sc2 sc3 sc4 sc5sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6sc7 sc8 New subcluster

14 Merge Operation in BIRCH Root LN1” LN2 LN3 LN1’ LN2LN3 sc1 sc2 sc3 sc4 sc5sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6sc7 sc8 LN1’ LN1” If the branching factor of a leaf node can not exceed 3, then LN1 is split.

15 Merge Operation in BIRCH Root LN1” LN2 LN3 LN1’ LN2LN3 sc1 sc2 sc3 sc4 sc5sc6 sc7 sc1 sc2 sc3 sc4 sc5 sc6sc7 sc8 LN1’ LN1” If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one. NLN1NLN2

16 Merge Operation in BIRCH root LN1 LN2 sc1 sc2 sc3 sc4 sc5 sc6 LN1 Assume that the subclusters are numbered according to the order of formation. sc3 sc4 sc5 sc6 sc2 sc1 LN2

17 LN2” LN3’ sc3 sc4 sc5 sc6 sc2 If the branching factor of a leaf node can not exceed 3, Then LN2 is split. sc1 LN2’ root LN2’ sc1 sc2 sc3sc4sc5 sc6 LN1 LN2”

18 root LN2” LN3’ LN3” sc3 sc4 sc5 sc6 sc2 sc1 sc2 sc3sc4sc5 sc6 LN3’ LN2’ and LN1 will be merged, and the newly formed Node will be split immediately. sc1 LN3” LN2”

19 Cases that Troubles BIRCH  The objects are numbered by the incoming order and assume that the distance between objects 1 and 2 exceeds the diameter threshold Subcluster 1 Subcluster 2

20 Order Dependence  Incremental clustering algorithm such as BIRCH suffers order dependence.  As the previous example demonstrates, the split and merge operations can alleviate order dependence to certain extent.  In the example that demonstrates the merge operation, the split and merge operations together improve clustering quality.  However, order dependence can not be completely avoided. If no new objects were added to form subcluster 6, then the clustering quality is not satisfactory.

21 Several Issues with the CF Tree  No of entries in CFT node limited by page size; thus, node may not correspond to natural cluster Two sub-clusters that should be in one cluster are splitted across nodes Two sub-clusters that should not be in one cluster are kept in same node (dependent on input order and skewness)  Sensitive to skewed input order Data point may end in leaf node where it should not have been If data point is inserted twice at different times, may end up in two copies at two distinct leaf nodes

22 BIRCH Phases X  Phase: Global clustering Use existing clustering (e.g., AGNES) algorithm on sub-clusters at leaf nodes May treat each sub-cluster as a point (its centroid) and perform clustering on these points Clusters produced closer to data distribution pattern  Phase: Cluster refinement Redistribute (re-label) all data points w.r.t. clusters produced by global clustering This phase may be repeated to improve cluster quality

23 Data set 10 rows 10 columns clusters

24 BIRCH and CLARANS comparison

25 BIRCH and CLARANS comparison

26 BIRCH and CLARANS comparison

27 BIRCH and CLARANS comparison  Result visualization: Cluster represented as a circle Circle center is centroid; circle radius is cluster radius Number of points in cluster labeled in circle  BIRCH results: BIRCH clusters are similar to actual clusters Maximal and average difference between centroids of actual and corresponding BIRCH cluster are 0.17 and 0.07 respectively No of points in BIRCH cluster is no more than 4% different from that of corresponding actual cluster Radii of BIRCH clusters are close to those of actual clusters

28  CLARANS results: Pattern of location of cluster centers distorted No of data points in a cluster as much as 57% different from that of actual cluster Radii of cluster vary from 1.15 to 1.94 w.r.t. avg of 1.44

29 BIRCH performance on Base Workload w.r.t. Time, data set and input order CLARANS performance on Base Workload w.r.t. Time, data set and input order BIRCH performance on Base Workload w.r.t. Time, diameter and input order CLARANS performance on Base Workload w.r.t. Time, diameter and input order

30 BIRCH and CLARANS comparison  Parameters: D: average diameter; smaller means better cluster quality Time: time to cluster datasets (in seconds) Order (‘o’): points in the same cluster are placed together in the input data  Results: BIRCH took less than 50 secs to cluster 100,000 data points of each dataset (on an HP 9000 workstation with 80K memory) Ordering of data points also have no impact CLARANS is at least 15 times slower than BIRCH; when data points are ordered, CLARANS performance degrades