The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
CS690L: Clustering References:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Chapter 3: Cluster Analysis
DATA MINING - CLUSTERING
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering II.
Cluster Analysis.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
Hierarchical Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Hierarchical Clustering
Other Clustering Techniques
DATA MINING Spatial Clustering
Hierarchical Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
Hierarchical and Ensemble Clustering
CSE572, CBS598: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Hierarchical and Ensemble Clustering
CSE572, CBS572: Data Mining by H. Liu
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
SEEM4630 Tutorial 3 – Clustering.
Lecture 10 Clustering.
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang

COMP Data Mining: Concepts, Algorithms, and Applications 2 Hierarchical Clustering Group data objects into a tree of clusters Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA)

COMP Data Mining: Concepts, Algorithms, and Applications 3 AGNES (Agglomerative Nesting) Initially, each object is a cluster Step-by-step cluster merging, until all objects form a cluster Single-link approach Each cluster is represented by all of the objects in the cluster The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters

COMP Data Mining: Concepts, Algorithms, and Applications 4 Dendrogram Show how to merge clusters hierarchically Decompose data objects into a multi-level nested partitioning (a tree of clusters) A clustering of the data objects: cutting the dendrogram at the desired level Each connected component forms a cluster

COMP Data Mining: Concepts, Algorithms, and Applications 5 DIANA (DIvisive ANAlysis) Initially, all objects are in one cluster Step-by-step splitting clusters until each cluster contains only one object

COMP Data Mining: Concepts, Algorithms, and Applications 6 Distance Measures Minimum distance Maximum distance Mean distance Average distance m: mean for a cluster C: a cluster n: the number of objects in a cluster

COMP Data Mining: Concepts, Algorithms, and Applications 7 Challenges of Hierarchical Clustering Methods Hard to choose merge/split points Never undo merging/splitting Merging/splitting decisions are critical Do not scale well: O(n 2 ) What is the bottleneck when the data can’t fit in memory? Integrating hierarchical clustering with other techniques BIRCH, CURE, CHAMELEON, ROCK

COMP Data Mining: Concepts, Algorithms, and Applications 8 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies CF (Clustering Feature) tree: a hierarchical data structure summarizing object info Clustering objects  clustering leaf nodes of the CF tree

COMP Data Mining: Concepts, Algorithms, and Applications 9 Clustering Feature: CF = (N, LS, SS) N: Number of data points LS:  N i=1 =X i SS:  N i=1 =X i 2 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) Clustering Feature Vector

COMP Data Mining: Concepts, Algorithms, and Applications 10 CF-tree in BIRCH Clustering feature: Summarize the statistics for a subcluster: the 0 th, 1 st and 2 nd moments of the subcluster Register crucial measurements for computing cluster and utilize storage efficiently A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of children

COMP Data Mining: Concepts, Algorithms, and Applications 11 CF Tree CF 1 child 1 CF 3 child 3 CF 2 child 2 CF 6 child 6 CF 1 child 1 CF 3 child 3 CF 2 child 2 CF 5 child 5 CF 1 CF 2 CF 6 prevnext CF 1 CF 2 CF 4 prevnext B = 7 L = 6 Root Non-leaf node Leaf node

COMP Data Mining: Concepts, Algorithms, and Applications 12 Parameters of A CF-tree Branching factor: the maximum number of children Threshold: max diameter of sub-clusters stored at the leaf nodes

COMP Data Mining: Concepts, Algorithms, and Applications 13 BIRCH Clustering Phase 1: scan DB to build an initial in- memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

COMP Data Mining: Concepts, Algorithms, and Applications 14 Pros & Cons of BIRCH Linear scalability Good clustering with a single scan Quality can be further improved by a few additional scans Can handle only numeric data Sensitive to the order of the data records

COMP Data Mining: Concepts, Algorithms, and Applications 15 Drawbacks of Square Error Based Methods One representative per cluster Good only for convex shaped having similar size and density A number of clusters parameter k Good only if k can be reasonably estimated

COMP Data Mining: Concepts, Algorithms, and Applications 16 CURE: the Ideas Each cluster has c representatives Choose c well scattered points in the cluster Shrink them towards the mean of the cluster by a fraction of  The representatives capture the physical shape and geometry of the cluster Merge the closest two clusters Distance of two clusters: the distance between the two closest representatives