Hierarchical Clustering

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
CS690L: Clustering References:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
Clustering II.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
Hierarchical Clustering
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Graph clustering to detect network modules
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Hierarchical Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
Hierarchical Clustering
CSE 5243 Intro. to Data Mining
Data Mining K-means Algorithm
Clustering in Ratemaking: Applications in Territories Clustering
Hierarchical Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
CS 685: Special Topics in Data Mining Jinze Liu
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Hierarchical and Ensemble Clustering
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining: Clustering
Hierarchical and Ensemble Clustering
Text Categorization Berlin Chen 2003 Reference:
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

Hierarchical Clustering Agglomerative approach Initialization: Each object is a cluster Iteration: Merge two clusters which are most similar to each other; Until all objects are merged into a single cluster a a b b a b c d e c c d e d d e e Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

Hierarchical Clustering Divisive Approaches Initialization: All objects stay in one cluster Iteration: Select a cluster and split it into two sub clusters Until each leaf cluster contains only one object a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

Dendrogram A binary tree that shows how clusters are merged/split hierarchically Each node on the tree is a cluster; each leaf node is a singleton cluster

Dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

Dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

How to Merge Clusters? How to measure the distance between clusters? Single-link Complete-link Average-link Centroid distance Distance? Hint: Distance between clusters is usually defined on the basis of distance between objects.

How to Define Inter-Cluster Distance Single-link Complete-link Average-link Centroid distance The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.

How to Define Inter-Cluster Distance Single-link Complete-link Average-link Centroid distance The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters.

How to Define Inter-Cluster Distance Single-link Complete-link Average-link Centroid distance The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters.

How to Define Inter-Cluster Distance   mi,mj are the means of Ci, Cj, Single-link Complete-link Average-link Centroid distance The distance between two clusters is represented by the distance between the means of the cluters.

An Example of the Agglomerative Hierarchical Clustering Algorithm For the following data set, we will get different clustering results with the single-link and complete-link algorithms. 1 5 3 4 2 6

Result of the Single-Link algorithm 1 5 3 4 2 6 1 3 4 5 2 6 Result of the Complete-Link algorithm 1 5 3 4 2 6 1 3 2 4 5 6

Hierarchical Clustering: Comparison Single-link Complete-link 5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 3 1 Average-link Centroid distance 5 5 1 2 3 4 5 6 4 1 2 3 4 5 6 4 2 2 3 1 3 1

Compare Dendrograms Single-link Complete-link Centroid distance 1 2 5 3 6 4 1 2 5 3 6 4 Average-link Centroid distance 1 2 5 3 6 4 2 5 3 6 4 1

Effect of Bias towards Spherical Clusters Single-link (2 clusters) Complete-link (2 clusters)

Strength of Single-link Two Clusters Original Points Can handle non-global shapes

Limitations of Single-Link Original Points Two Clusters Sensitive to noise and outliers

Strength of Complete-link Two Clusters Original Points Less susceptible to noise and outliers

Which Distance Measure is Better? Each method has both advantages and disadvantages; application-dependent, single-link and complete-link are the most common methods Single-link Can find irregular-shaped clusters Sensitive to outliers, suffers the so-called chaining effects Complete-link, Average-link, and Centroid distance Robust to outliers Tend to break large clusters Prefer spherical clusters

Limitation of Complete-Link, Average-Link, and Centroid Distance The complete-link, average-link, or centroid distance method tend to break the large cluster.

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages; e.g., S+ Use single-link method Merge nodes that have the least dissimilarity Eventually all objects belong to the same cluster

å UPGMA UPGMA: Unweighted Pair-Group Method Average. Merge Strategy: Average-link approach; The distance between two clusters is measured by the average distance between two objects belonging to different clusters. å Î = i j C p q avg d n ) , ( 1 Average distance ni,nj: the number of objects in cluster Ci, Cj.

The color intensity represents expression level. TreeView UPGMA Order the objects The color intensity represents expression level. A large patch of similar color indicates a cluster. Eisen MB et al. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8. http://rana.lbl.gov/EisenSoftware.htm http://genome-www.stanford.edu/serum/fig2cluster.html

DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., S+ Inverse order of AGNES Eventually each node forms a cluster on its own

DIANA- Explored First, all of the objects form one cluster. The cluster is split according to some principle, such as the minimum Euclidean distance between the closest neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each new cluster contains a single object or a termination condition is met.

Splitting Process of DIANA Intialization: Choose the object Oh which is most dissimilar to other objects in C. 2. Let C1={Oh}, C2=C-C1. C C2 C1

Splitting Process of DIANA (Cont’d) Iteration: 3. For each object Oi in C2, tell whether it is more close to C1 or to other objects in C2 C2 C1 C2 C1 C2 C1 4. Choose the object Ok with greatest D score. 5. If Dk>0, move Ok from C2 to C1, and repeat 3-5. 6. Otherwise, stop splitting process. …… C2 C1

Discussion on Hierarchical Approaches Strengths Do not need to input k, the number of clusters Weakness Do not scale well; time complexity of at least O(n2), where n is total number of objects Can never undo what was done previously Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree and incrementally adjusts quality of sub-clusters CURE (1998): selects well-scattered points from cluster and then shrinks them towards center of cluster by a specified fraction CHAMELEON (1999): hierarchical clustering using dynamic modeling

How to Derive Clusters from Dendrogram Use global thresholds Homogeneity within clusters Diameter(C)  MaxD Avg(sim(Oi,Oj)) ≥ (Oi,Oj C) Separation between clusters Inter-cluster distance ≥  single-link complete-link …

Minimum Similarity Threshold Interactively Exploring Hierarchical Clustering Results, Seo, et al. 2002.

How to Derive Clusters from Dendrogram Ask users to derive clusters e.g. TreeView Flexible when user have different requirement of cluster granularity for different parts of data. Inconvenient when data set is large Coarse granularity Fine granularity