Birch presented by : Bahare hajihashemi Atefeh Rahimi

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Clustering II.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Chapter 3: Cluster Analysis
DATA MINING - CLUSTERING
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Birch: An efficient data clustering method for very large databases
CS4432: Database Systems II
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Other Clustering Techniques
Spatial Data Management
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
Clustering CSC 600: Data Mining Class 21.
University of Waikato, New Zealand
B-Tree Michael Tsai 2017/06/06.
Multiway Search Trees Data may not fit into main memory
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
Multiway search trees and the (2,4)-tree
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
Chapter 6 Transform and Conquer.
Hierarchical and Ensemble Clustering
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
A Framework for Clustering Evolving Data Streams
Trees & Forests D. J. Foreman.
DATA MINING Introductory and Advanced Topics Part II - Clustering
Forests D. J. Foreman.
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572, CBS572: Data Mining by H. Liu
Database Design and Programming
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Birch presented by : Bahare hajihashemi Atefeh Rahimi advisor : Dr.Vahidipoor December 2017

BIRCH BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies Agglomerative Clustering designed for clustering a large amount of numerical data What Birch algorithm tries to solve? Most of the existing algorithms DO NOT consider the case that datasets can be too large to fit in main memory They DO NOT concentrate on minimizing the number of scans of the dataset I/O costs are very high The complexity of BIRCH is O(n) where n is the number of objects to be clustered.

BIRCH: Key Components Clustering Feature (CF) Summary of the statistics for a given cluster: the 0-th, 1st and 2nd moments of the cluster from the statistical point of view Used to compute centroids, and measures the compactness and distance of clusters CF-Tree height-balanced tree two parameters: number of entries in each node The diameter of all entries in a leaf node Leaf nodes are connected via prev and next pointers

Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points: CF3=CF1+CF2= 〈3+3, (9+35, 10+36), (29+417 , 38+440)〉 = 〈6, (44,46), (446 ,478)〉 Cluster3 Cluster 1 (2,5) (3,2) (4,3) Cluster 2 CF2= 〈3, (35,36), (417 ,440)〉 CF1= 〈3, (2+3+4 , 5+2+3), (22+32+42 , 52+22+32)〉 = 〈3, (9,10), (29 ,38)〉

Properties of Clustering Feature CF entry is a summary of statistics of the cluster A representation of the cluster A CF entry has sufficient information to calculate the centroid, radius, diameter and many other distance measures Additively theorem allows us to merge sub-clusters incrementally

Distance Measures Given a cluster with data points , Centroid: Radius: average distance from any point of the cluster to its centroid Diameter: square root of average mean squared distance between all pairs of points in the cluster

CF Tree B = Branching Factor, maximum children in a non-leaf node T = Threshold for diameter or radius of the cluster in a leaf L = number of entries in a leaf CF entry in parent = sum of CF entries of a child of that entry In-memory, height-balanced tree Root level CF1 CF2 … CFk First level CF1 CF2 … CFk … … … … …

CF Tree Insertion Start with the root Find the CF entry in the root closest to the data point, move to that child and repeat the process until a closest leaf entry is found. At the leaf If the point can be accommodated in the cluster, update the entry If this addition violates the threshold T, split the entry, if this violates the limit imposed by L, split the leaf. If its parent node too is full, split that and so on Update the CF entries from the root to the leaf to accommodate this point

Birch Algorithm Data Phase 1: Load into memory by building a CF tree Initial CF tree Phase 2 (optional): Condense tree into desirable range by building a smaller CF tree Smaller CF tree Phase 3: Global Clustering Good Clusters Phase 4: (optional and offline): Cluster Refining Better Clusters

Birch Algorithm: Phase 1 Choose an initial value for threshold, start inserting the data points one by one into the tree as per the insertion algorithm If, in the middle of the above step, the size of the CF tree exceeds the size of the available memory, increase the value of threshold Convert the partially built tree into a new tree Repeat the above steps until the entire dataset is scanned and a full tree is built Outlier Handling

Birch Algorithm: Phase 2,3, and 4 A bridge between phase 1 and phase 3 Builds a smaller CF tree by increasing the threshold Phase 3 Apply global clustering algorithm to the sub-clusters given by leaf entries of the CF tree Improves clustering quality Phase 4 Scan the entire dataset to label the data points Outlier handling

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 Leaf node 2 3 4 1 2 5 Cluster1 6 If cluster 1 becomes too large (not compact) by adding object 2, then split the cluster

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 2 5 Cluster1 Cluster2 6 Leaf node with two entries

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster2 6 entry1 is the closest to object 3 If cluster 1 becomes too large by adding object 3, then split the cluster

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 4 1 3 2 5 Cluster1 Cluster3 Cluster2 6 Leaf node with three entries

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 1 3 2 4 4 5 Cluster1 Cluster3 Cluster2 Cluster2 6 entry3 is the closest to object 4 Cluster 2 remains compact when adding object 4 then add object 4 to cluster 2

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 entry 1 entry 2 entry 3 Leaf node 2 3 5 4 4 1 3 2 5 Cluster1 Cluster3 Cluster2 6 entry2 is the closest to object 5 Cluster 3 becomes too large by adding object 5 then split cluster 3? BUT there is a limit to the number of entries a node can have Thus, split the node

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 Non-Leaf node 2 entry 1 entry 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 5 2 4 Cluster1 Cluster3 Cluster4 Cluster2

BIRCH: The Idea by example Data Objects Clustering Process (build a tree) 1 Non-Leaf node 2 entry 1 entry 2 3 4 entry 1.1 entry 1.2 entry 2.1 entry 2.2 5 Leaf node Leaf node 6 1 3 6 5 2 4 Cluster1 Cluster3 Cluster3 Cluster4 Cluster2 entry1.2 is the closest to object 6 Cluster 3 remains compact when adding object 6 then add object 6 to cluster 3

Thanks for your attention ! Q&A ? Thanks for your attention !