Formal Foundations of Clustering Margareta Ackerman Work with Shai Ben-David, Simina Branzei, and David Loker.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Clustering II.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Fast Algorithms For Hierarchical Range Histogram Constructions
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Effects of Rooting on Phylogenic Algorithms Margareta Ackerman Joint work with David Loker and Dan Brown.
Linked Based Clustering and Its Theoretical Foundations Paper written by Margareta Ackerman and Shai Ben-David Yan T. Yang Presented by Yan T. Yang.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Margareta Ackerman Joint work with Shai Ben-David Measures of Clustering Quality: A Working Set of Axioms for Clustering.
A Fairy Tale of Greedy Algorithms Yuli Ye Joint work with Allan Borodin, University of Toronto.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Characterization of Linkage-Based Algorithms Margareta Ackerman Joint work with Shai Ben-David and David Loker University of Waterloo To appear in COLT.
Introduction to Bioinformatics
Weighted Clustering Margareta Ackerman Work with Shai Ben-David, Simina Branzei, and David Loker.
Discerning Linkage-Based Algorithms Among Hierarchical Clustering Methods Margareta Ackerman and Shai Ben-David IJCAI 2011.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
HCS Clustering Algorithm
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Radial Basis Function Networks
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Gene expression & Clustering (Chapter 10)
Network Aware Resource Allocation in Distributed Clouds.
Towards Theoretical Foundations of Clustering Margareta Ackerman Caltech Joint work with Shai Ben-David and David Loker.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Theoretical Foundations of Clustering – CS497 Talk Shai Ben-David February 2007.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009 A UNIQUENESS THEOREM FOR CLUSTERING.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org John Oliver from “The Daily Show” Supporting worthy causes at the G20 Pittsburgh Summit: “Bayesians.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Clustering CSC 600: Data Mining Class 21.
Hierarchical Clustering
Presentation transcript:

Formal Foundations of Clustering Margareta Ackerman Work with Shai Ben-David, Simina Branzei, and David Loker

Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering to gain a first understanding of the structure of large data sets. The Theory-Practice Gap 2

“While the interest in and application of cluster analysis has been rising rapidly, the abstract nature of the tool is still poorly understood” (Wright, 1973) “There has been relatively little work aimed at reasoning about clustering independently of any particular algorithm, objective function, or generative data model” (Kleinberg, 2002) Both statements still apply today. 3 The Theory-Practice Gap

Clustering aims to assign data into groups of similar items Beyond that, there is very little consensus on the definition of clustering 4 Inherent Obstacles: Clustering is ill-defined Inherent Obstacles: Clustering is ill-defined

Clustering is inherently ambiguous – There may be multiple reasonable clusterings – There is usually no ground truth There are many clustering algorithms with different (often implicit) objective functions Different algorithms have radically different input-output behaviour 5 Inherent Obstacles

6 Differences in Input/Output Behavior of Clustering Algorithms

7

There are a wide variety of clustering algorithms, which can produce very different clusterings. 8 How should a user decide which algorithm to use for a given application? Clustering Algorithm Selection

Users rely on cost related considerations: running times, space usage, software purchasing costs, etc… There is inadequate emphasis on input-output behaviour 9 Clustering Algorithm Selection

We propose a framework that lets a user utilize prior knowledge to select an algorithm Identify properties that distinguish between different input-output behaviour of clustering paradigms The properties should be: 1) Intuitive and “user-friendly” 2)Useful for distinguishing clustering algorithms 10 Our Framework for Algorithm Selection

In essence, our goal is to understand fundamental differences between clustering methods, and convey them formally, clearly, and as simply as possible. 11 Our Framework for Algorithm Selection

12 Previous Work Axiomatic perspective Impossibility Result: Kleinberg (NIPS, 2003) Consistent axioms for quality measures: Ackerman & Ben-David (NIPS, 2009) Axioms in the weighted setting: Wright (Pattern Recognition, 1973)

13 Previous Work Characterizations of Single-Linkage Partitional Setting: Bosah Zehad and Ben-David (UAI, 2009) Hierarchical Setting: Jarvis and Sibson (Mathematical Taxonomy, 1981) and Carlsson and Memoli (JMLR, 2010). Characterizations of Linkage-Based Clustering Partitional Setting: Ackerman, Ben-David, and Loker (COLT, 2010). Hierarchical Setting: Ackerman & Ben-David (IJCAI, 2011).

14 Previous Work Classifications of clustering methods Fischer and Van Ness (Biometrica, 1971) Ackerman, Ben-David, and Loker (NIPS, 2010)

15 What’s Left To Be Done? Despite much work on clustering properties, some basic questions remain unanswered. Consider some of the most popular clustering methods: k-means, single-linkage, average-linkage, etc… What are the advantages of k-means over other methods? Previous classifications are missing key properties.

16 Our Contributions (at a high level) We indentify 3 fundamental categories that clearly delineate some essential differences between common clustering methods The strength of these categories is in their simplicity. We hope this gives insight into core differences between popular clustering methods. To define these categories, we first present the weighted clustering setting.

Outline Formal framework Categories and classification A result from each category Conclusions and future work Outline

Every element is associated with a real valued weight, representing its mass or importance. Generalizes the notion of element duplication. Algorithm design, particularly design of approximation algorithms, is often done in this framework. 18 Weighted Clustering

Apply clustering to facility allocation, such as the placement of police stations in a new district. The distribution of stations should enable quick access to most areas in the district. 19 Other Reasons to Add Weight: An Example Other Reasons to Add Weight: An Example Accessibility of different institutions to a station may have varying importance. The weighted setting enables a convenient method for prioritizing certain landmarks.

Traditional clustering algorithms can be readily translated into the weighted setting by considering their behavior on data containing element duplicates. 20 Algorithms in the Weighted Clustering Setting

X For a finite domain set X, a weight function w: X →R + defines the weight of every element. X For a finite domain set X, a distance function d: X x X →R + u {0} is the distance defined between the domain points. Formal Setting

(X,d) (X,d) denotes unweighted data (w[X],d) (w[X],d) denotes weighted data A Partitional Algorithm maps (w[X],d,k) Input: (w[X],d,k)to X Output: a k-partition (k-clustering) of X Formal Setting Partitional Clustering Algorithm Formal Setting Partitional Clustering Algorithm

A Hierarchical Algorithm maps (w[X],d) Input: (w[X],d)to X Output: dendrogram of X (X,d) A dendrogram of (X,d) is a strictly binary tree whose X leaves correspond to elements of X C A(w[X],d) C appears in A(w[X],d) if its clusters are in the dendrogram Formal Setting Hierarchical Clustering Algorithm Formal Setting Hierarchical Clustering Algorithm

24 Our Contributions We utilize the weighted framework to indentify 3 fundamental categories, describing how algorithms respond to weight. Classify traditional algorithms according to these categories Fully characterize when different algorithms react to weight

PARTITIONAL: Range(A(X, d,k)) = {C | ∃ w s.t. C=A(w[X], d)} A(X, d) The set of clusterings that A outputs on (X, d) over all possible weight functions. HIERARCHICAL: Range(A(X, d)) = {D | ∃ w s.t. D=A(w[X], d)} A(X, d) The set of dendrograms that A outputs on (X, d) over all possible weight functions. Towards Basic Categories Range(X,d)

Outline Formal framework Categories and classification A result from each category Conclusions and future work Outline

27 Categories: Weight Robust Categories: Weight Robust A(X, d), |Range(X,d)| = 1. A is weight-robust if for all (X, d), |Range(X,d)| = 1. A A never responds to weight.

28 Categories: Weight Sensitive Categories: Weight Sensitive A(X, d)|Range(X,d)| > 1. A is weight-sensitive if for all (X, d),|Range(X,d)| > 1. A A always responds to weight.

29 Categories: Weight Considering Categories: Weight Considering A An algorithm A is weight-considering if (X, d) |Range(X,d)|=1. 1)There exists (X, d) where |Range(X,d)|=1. (X, d) |Range(X,d)|>1 2)There exists (X, d) where |Range(X,d)|>1. A A responds to weight on some data sets, but not others.

wA(w[X], d) = C} Range(A(X, d)) = {C | ∃ w such that A(w[X], d) = C} wA(w[X], d) = D} Range(A(X, d)) = {D | ∃ w such that A(w[X], d) = D} (X, d), |Range(X,d)| = 1. Weight-robust: for all (X, d), |Range(X,d)| = 1. (X, d),|Range(X,d)| > 1. Weight-sensitive: for all (X, d),|Range(X,d)| > 1. Weight-considering: (X, d) |Range(X,d)|=1. 1) ∃ (X, d) where |Range(X,d)|=1. (X, d) |Range(X,d)|>1 2) ∃ (X, d) where |Range(X,d)|>1. 30 Summary of Categories

Outline In the facility allocation example above, a weight- sensitive algorithm may be preferred. Connecting To Applications In phylogeny, where sampling procedures can be highly biased, some degree of weight robustness may be desired. The desired category depends on the application.

PartitionalHierarchical Weight Robust Min Diameter K-center Single Linkage Complete Linkage Weight Sensitive K-means, k-medoids, k-median, min-sum Ward’s Method Bisecting K-means Weight Considering Ratio CutAverage Linkage Classification For the weight considering algorithms, we fully characterize when they are sensitive to weight.

Outline Formal framework Categories and classification A result from each category Classification of heuristics Conclusions and future work Outline

PartitionalHierarchical Weight RobustMin Diameter K-center Single Linkage Complete Linkage Weight SensitiveK-means k-medoids k-median, min-sum Ward’s Method Bisecting K-means Weight Considering Ratio Cut Average Linkage Classification

35 Zooming Into: Weight Sensitive Algorithms Zooming Into: Weight Sensitive Algorithms We show that k-means is weight-sensitive. A(X, d) SXkw A(w[X],d,k) S A is weight-separable if for any data set (X, d) and subset S of X with at most k points, ∃ w so that A(w[X],d,k) separates all points of S. Fact: Every algorithm that is weight-separable is also weight-sensitive.

36 K-means is Weight-Sensitive Proof: Show that k-means is weight-separable (X,d) S ⊂ X k Consider any (X,d) and S ⊂ X on at most k points S Increase weight of points in S until each belongs to a distinct cluster. Theorem: k-means is weight-sensitive.

We show that Average-Linkage is Weight Considering. Characterize the precise conditions under which it is sensitive to weight. 37 Zooming Into: Weight Considering Algorithms Zooming Into: Weight Considering Algorithms Recall: A An algorithm A is weight-considering if (X, d) |Range(X,d)|=1. 1)There exists (X, d) where |Range(X,d)|=1. (X, d) |Range(X,d)|>1 2)There exists (X, d) where |Range(X,d)|>1.

38 Average Linkage Average-Linkage is a hierarchical algorithm. It starts by creating a leaf for every element. It then repeatedly merges the “closest” clusters using the following linkage function: Average weighted distance between clusters

Average Linkage is Weight Considering (X,d) Range(X,d) =1: (X,d) where Range(X,d) =1: The same dendrogram is output for every weight function. ABCD A BCD

Average Linkage is Weight Considering (X,d) Range(X,d) >1: (X,d) where Range(X,d) >1: ABCD A BCD E 2+2ϵ11 1+ϵ1+ϵ ABCDE ϵ E A B CDE

41 When is Average Linkage Sensitive to Weight? When is Average Linkage Sensitive to Weight? We showed that Average-Linkage is weight- considering. Can we show when it is sensitive to weight? We provide a complete characterization of when Average-Linkage is sensitive to weight, and when it is not.

42 A clustering is nice if every point is closer to all points within its cluster than to all other points. Nice Nice Clustering

43 A clustering is nice if every point is closer to all points within its cluster than to all other points. Nice Nice Clustering

44 A clustering is nice if every point is closer to all points within its cluster than to all other points. Not nice Nice Clustering

45 Range(AL(X,d)) = 1 (X,d) Theorem: Range(AL(X,d)) = 1 if and only if (X,d) has a nice dendrogram. A dendrogram is nice if all of its clusterings are nice. Characterizing When Average Linkage is Sensitive to Weight Characterizing When Average Linkage is Sensitive to Weight

46 Characterizing When Average Linkage is Sensitive to Weight: Proof Proof: Show that: (X,d) 1)If there is a nice dendrogram for (X,d), then Average-Linkage outputs it. AL(w[X],d)wRange(AL(X,d)) > 1 2)If a clustering that is not nice appears in dendrogram AL(w[X],d) for some w, then Range(AL(X,d)) > 1. Range(AL(X,d)) = 1 (X,d) Theorem: Range(AL(X,d)) = 1 if and only if (X,d) has a nice dendrogram.

47 Characterizing When Average Linkage is Sensitive to Weight: Proof (cnt.) (X,d) Lemma: If there is a nice dendrogram for (X,d), then Average-Linkage outputs it. Proof Sketch: (w[X],d) 1)Assume that (w[X],d) has a nice dendrogram. AL(w[X],d). 2)Main idea: Show that every nice clustering of the data appears in AL(w[X],d). 3)For that, we show that each cluster in a nice clustering is formed by the algorithm.

48 C Given a nice clustering C, it can be shown that C i C j C YZ C i W C j Y Z Y W for any clusters C i and C j of C, any disjoint subsets Y and Z of C i, and any subset W of C j, Y and Z are closer than Y and W. C This implies that C appears in the dendrogram. Characterizing When Average Linkage is Sensitive to Weight: Proof (cnt.)

49 Proof: Cxyz Since C is not nice, there exist points x, y, and z, so that xyC x and y are belong to the same cluster in C xz x and z belong to difference clusters d(x,z) < d(x,y) yet d(x,z) < d(x,y) x y z xzxyC If x, y and z are sufficiently heavier than all other points, then x and z will be merged before x and y, so C will not be formed. C AL(w[X],d)range(X,d)>1 Lemma: If a clustering C that is not nice appears in AL(w[X],d), then range(X,d)>1. Characterizing When Average Linkage Responds to Weight: Proof (cnt.)

50 Characterizing When Average Linkage is Sensitive to Weight Characterizing When Average Linkage is Sensitive to Weight (X,d) Average Linkage is robust to weight whenever there is a dendrogram of (X,d) consisting of only nice clusterings, and it is sensitive to weight otherwise. Range(AL(X,d)) = 1 (X,d) Theorem: Range(AL(X,d)) = 1 if and only if (X,d) has a nice dendrogram.

51 Zooming Into: Weight Robust Algorithms Zooming Into: Weight Robust Algorithms These algorithms are invariant to element duplication. Ex. Min-Diameter returns a clustering that minimizes the length of the longest within-cluster edge. As this quantity is not effected by the number of points (or weight) at any location, Min-Diameter is weight robust.

Outline Introduce framework Present categories and classification Show several results from different categories Conclusions and future work Outline

Conclusions We introduced three basic categories describing how algorithms respond to weights We characterize the precise conditions under which algorithms respond to weights The same results apply in the non-weighted setting for data duplicates This classification can be used to help select clustering algorithms for specific applications

Capture differences between objective functions similar to k-means (ex. k-medians, k-medoids, min-sum) Show bounds on the size of the Range of weight considering and weight sensitive methods Analyze clustering algorithms for categorical data Analyze clustering algorithms with a noise bucket Indentify properties that are significant for specific clustering applications (some previous work in this directions by Ackerman, Brown, and Loker (ICCABS, 2012)). Future Directions