BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

Slides:



Advertisements
Similar presentations
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Advertisements

Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Chapter 3: Cluster Analysis
DATA MINING - CLUSTERING
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
ICS 421 Spring 2010 Indexing (2) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/23/20101Lipyeow Lim.
CSE332: Data Abstractions Lecture 9: B Trees Dan Grossman Spring 2010.
Clustering Algorithms BIRCH and CURE
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Birch: An efficient data clustering method for very large databases
CS4432: Database Systems II
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
1 B-Trees Section AVL (Adelson-Velskii and Landis) Trees AVL tree is binary search tree with balance condition –To ensure depth of the tree is.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
1 Overview of Storage and Indexing Chapter 8 (part 1)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
CS4432: Database Systems II Query Processing- Part 2.
Presented by Ho Wai Shing
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Graph Indexing From managing and mining graph data.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
DATA MINING Spatial Clustering
CS522 Advanced database Systems
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Spatial Indexing I Point Access Methods.
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Evaluation of Relational Operations
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Clustering Wei Wang.
CPS216: Advanced Database Systems
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Indexing 4/11/2019.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan and Miron Livny Daniel Chang ICS624 Spring 2011 Lipyeow Lim University of Hawaii at Manoa

Clustering in general Clustering can be thought of as a kind of data mining problem. The C in BIRCH is for clustering. ◦ Authors claim that it is suitable for large databases. BIRCH performs some clustering in a single pass for data sets larger than memory allows. ◦ Reduces IO cost. ◦ Noise in the form of outliers is handled.  What is noise in terms of data in a database?

Clustering some data In a large set of multidimensional data, the space is not uniformly occupied. Clustering clusters the data, thereby identifying groups that share some measurable similarity. The problem is finding a minimal solution. It’s further complicated by database- related constraints of memory and IO.

Other approaches Probability-based approach ◦ Assumes statistical independence ◦ Large overhead in computation and storage Distanced-based approach ◦ Assumes all data points are given in advance and can be continually scanned ◦ Global examination of data ◦ Local minima  High sensitivity to starting partition

CLARANS Based on randomized search Cluster is represented by its medoid ◦ Most centrally located data point Clustering is accomplished by searching a graph Not IO efficient May not find the real local minimum

What’s special about BIRCH? Incrementally maintains clusters. ◦ IO is reduced significantly Treats data in terms of densities of data points instead of individual data points. Outliers are rejected. The clustering takes place in memory. It can perform useful clustering in a single read of the data. How effective is this for a database application?

BIRCH’s trees The key to BIRCH is the CF tree. ◦ A CF tree consists of Clustering Features arranged in a binary tree that is height balanced. ◦ Clustering Features or CF vectors  Summarize subsets of data in terms of the number of data points, the linear sum of the data points and the squared sum of the data points.  It doesn’t include all the data points.  How is this useful for a database?

CF tree Self-balancing Parameters: branching factor and threshold Nodes have to fit in P. Tree size is determined by T. Nonleaf nodes contain B entries at most. Leaves and non-leaves are determined by d. Clustering happens through building the tree.

Building the tree Identify the leaf. If the subcluster can be added to the leaf then add it Otherwise, split the node ◦ Recursively, determine the node to split Merge if possible since splits are dependent on page size

Overview of BIRCH

After the tree is built in Phase 1 No IO operations are needed ◦ Clusters can be refined by clustering subclusters Outliers are eliminated ◦ Authors claim greater accuracy ◦ How does this improve DB applications? A tree is an ordered structure

Not everything is perfect The input order gets skewed because of the page size restriction Phase 3 clusters all the leaf nodes in a global way Subclusters are treated as single points Or CF vectors can be used This reduces the problem space significantly But what detail is lost as a result?

Control flow of Phase 1

CF tree rebuilding

Refinements Phase 4 can clean up the clusters as much as desired Outliers are written to disk if disk is available. ◦ All detail is not lost ◦ Efficiency is reduced because of IO

In practical terms Threshold T needs to be configured ◦ Different data sets are going to have different optimal thresholds

Testing Synthetic data (2-d K clusters) ◦ Independent normal distribution ◦ Grid  Clusters centers placed on sqrt(K) * sqrt(K) grid ◦ Sine  Cluster centers arranged in a sine curve ◦ Random  Cluster centers are placed randomly ◦ Noise is added

Data generation parameters

BIRCH parameters

Data set 1 compared to CLARANS

Scalability w.r.t. K

BIRCH summary Incremental single-pass IO Optimizes use of memory ◦ Outliers can be written to disk Extremely fast tree structure ◦ Inherent ordering Refinements only address subclusters Accurate clustering results Dependent upon parameter setting Better than CLARANS

Open Questions How well does clustering work for DBs? Can BIRCH really be used for database applications? ◦ What are the data dependencies for BIRCH to be effective? ◦ The authors claim that BIRCH is “suitable” for very large databases ◦ None of their testing reflected an actual database application ◦ Therefore, BIRCH has theoretical potential but requires additional testing to be truly considered suitable for databases