CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Linear Regression.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Fast Algorithms For Hierarchical Range Histogram Constructions
The Stability of a Good Clustering Marina Meila University of Washington
Effects of Rooting on Phylogenic Algorithms Margareta Ackerman Joint work with David Loker and Dan Brown.
Margareta Ackerman Joint work with Shai Ben-David Measures of Clustering Quality: A Working Set of Axioms for Clustering.
The General Linear Model. The Simple Linear Model Linear Regression.
Characterization of Linkage-Based Algorithms Margareta Ackerman Joint work with Shai Ben-David and David Loker University of Waterloo To appear in COLT.
Discerning Linkage-Based Algorithms Among Hierarchical Clustering Methods Margareta Ackerman and Shai Ben-David IJCAI 2011.
Clustering II.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluating Hypotheses
Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
What is Cluster Analysis?
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
Continuous Random Variables and Probability Distributions
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering III. Lecture outline Soft (model-based) clustering and EM algorithm Clustering aggregation [A. Gionis, H. Mannila, P. Tsaparas: Clustering.
Radial Basis Function Networks
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Towards Theoretical Foundations of Clustering Margareta Ackerman Caltech Joint work with Shai Ben-David and David Loker.
Lecture 20: Cluster Validation
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Joint work with Chandrashekhar Nagarajan (Yahoo!)
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
First topic: clustering and pattern recognition Marc Sobel.
1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Theoretical Foundations of Clustering – CS497 Talk Shai Ben-David February 2007.
Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009 A UNIQUENESS THEOREM FOR CLUSTERING.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Improved Competitive Ratios for Submodular Secretary Problems ? Moran Feldman Roy SchwartzJoseph (Seffi) Naor Technion – Israel Institute of Technology.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Machine Learning Lunch - 29 Sep 2009 – ClusteringTheory.org John Oliver from “The Daily Show” Supporting worthy causes at the G20 Pittsburgh Summit: “Bayesians.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Learning with General Similarity Functions Maria-Florina Balcan.
Clustering Data Streams A presentation by George Toderici.
Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Haim Kaplan and Uri Zwick
Clustering Evaluation The EM Algorithm
Distinct Distances in the Plane
Pattern Recognition PhD Course.
k-center Clustering under Perturbation Resilience
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Consensus Partition Liang Zheng 5.21.
Clustering.
Presentation transcript:

CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David

The theory-practice gap Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science.. All apply clustering to gain a first understanding of the structure of large data sets. Yet, there is distressingly little theoretical understanding of clustering.

Inherent obstacles Clustering is not well defined. There is a wide variety of different clustering tasks, with different (often implicit) measures of quality. In most practical clustering tasks there is no clear ground truth to evaluate your solution by. (in contrast with classification tasks, in which you can have a hold out labeled set to evaluate the classifier against).

Common solutions Objective utility functions Sum Of In-Cluster Distances, Average Distances to Center Points, Cut Weight, Spectral Clustering, etc. (Shmoys, Charikar, Meyerson, Luxburg,..) Analyze the computational complexity of discrete optimization problems. Consider a restricted set of distributions (“generative models”): Ex. Mixtures of Gaussians [Dasgupta ‘99], [Vempala, ’03], [Kannan et al ‘04], [Achlitopas, McSherry ‘05]. Recover the parameters of the model generating the data. Many more….

Quest for a General Theory What can we say independently of any specific algorithm, specific objective function, or specific generative data model ? Axioms of clustering Ex. Clustering functions [Puzicha, Hofmann, Buhmann ‘00], [Kleinberg ‘02] Axioms of clustering quality measures [Ackerman and Ben-David, ‘08]. Find axioms that define clustering.

Why study clusterability? Even if a data set has no meaningful structure, any clustering algorithm will find some partition of the data set. Clusterability aims to determine whether a data set can be meaningfully clustered. Notions of clusterability quantify the degree of clustered structure in a data set.

Our contributions We set out to explore notions of clusterability, compare them, and find patterns of similarity. Computational complexity of clustering Data sets that are more clusterable are computationally easier to cluster well. Hardness Determining whether a data set is clusterable is usually NP-hard. Comparison Notions of clusterability are pairwise inconsistent.

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Definitions and Notation kXkX  A k-clustering of X is a k-partition of X kk  A loss function (or cost function) is a function that takes a clustering and outputs a real number. Ex. k-median, k-means. OPT L,k (X)  Let OPT L,k (X) denote the minimal loss over all kX k-clusterings of X. OPT L,k (X) = min{ L (C): C a k-clustering of X}.

Definitions and Notation C ={X 1 X 2, …,X k } c i X i X i c i A clustering C = {X 1, X 2, …, X k } is center-based if there are centers c i in X i, such that all points in X i are closer to c i than to any other center.

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Better clusterability implies that data is easier to cluster well k k In most formulations, clustering is an NP-hard problem. (ex. k-means and k-median clustering, correlation clustering, ect.)  When a data set has no significant clustered structure, there is no sense in clustering it.  Clustering makes sense only on data sets that have meaningful clustering structure.  We show that the more clusterable a data set is, the easier it is to cluster well.  Clustering is hard only when there isn’t sufficient clustering structure in the data set.

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Center Perturbation Clusterability: A high level definition L Call a clustering optimal when it has optimal cost by some loss function L. If a clustering looks like the optimal clustering, we might expect that its cost is also near-optimal (in well-clusterable data sets.) X X A data set X is CP-clusterable if all clusterings that are structurally similar to an optimal clustering of X have near-optimal cost.

Center Perturbation Clusterability C C’ε Two center-based clusterings C and C’ are ε-close if there exist c 1 c 2, c 3,…,c k Cc’ 1 c’ 2, c’ 3,…,c’ k C’ centers c 1, c 2, c 3,…, c k of C and c’ 1, c’ 2, c’ 3,…, c’ k of C’ s.t. |c i -c’ i | ≤ |c i -c’ i | ≤ ε. Definition: X(ε,δ)-k CXεk X A data set X is (ε,δ)-CP Clusterable (for k) if for every clustering C of X that is ε-close to some optimal k- clustering of X, L (C) ≤(1+ δ) OPT L,k (X). Definition: X(ε,δ)-k CXεk X A data set X is (ε,δ)-CP Clusterable (for k) if for every clustering C of X that is ε-close to some optimal k- clustering of X, L (C) ≤(1+ δ) OPT L,k (X).

Good PC-clusterability implies that it is easy to cluster well This result holds with any loss function where the optimal clusterings are center-based. Theorem: X R m (rad(X)/sqrt(l),δ) k nkC Given a data set X in R m that is (rad(X)/sqrt(l),δ)-PC clusterable (for k), there is an algorithm that runs in time polynomial in n, and outputs a k-clustering C such that L (C) ≤(1+ δ) OPT L,k (X). Theorem: X R m (rad(X)/sqrt(l),δ) k nkC Given a data set X in R m that is (rad(X)/sqrt(l),δ)-PC clusterable (for k), there is an algorithm that runs in time polynomial in n, and outputs a k-clustering C such that L (C) ≤(1+ δ) OPT L,k (X).

Proof that PC-clusterability implies that it is easy to cluster well ll X Let an l -sequence denote a collection of l elements of X (not necessarily distinct). Algorithm 1: k l X For each k-tuple of l -sequences of X S l S := centers of mass of the l -sequences C temp SX C temp := the clustering that S induces on X L (C temp ) L (C) If L (C temp ) < L (C) C C temp C := C temp C Return C Algorithm 1: k l X For each k-tuple of l -sequences of X S l S := centers of mass of the l -sequences C temp SX C temp := the clustering that S induces on X L (C temp ) L (C) If L (C temp ) < L (C) C C temp C := C temp C Return C

Proof that PC-clusterability implies that it is easy to cluster well C’ rad(X)/sqrt( l )X  By Maurey’s result, there is a clustering C’ examined by Algorithm 1 that is rad(X)/sqrt( l )-close to an optimal clustering of X. L (C)≤ L (C‘)  Since Algorithm 1 selects the minimal loss clustering of the ones it examines, L (C)≤ L (C‘). X(rad(X)/sqrt(l),δ)  Since X is (rad(X)/sqrt(l),δ)-PC clusterable, L (C)≤ L (C ‘) ≤ (1+ δ) OPT L,k (X). L (C)≤ L (C ‘) ≤ (1+ δ) OPT L,k (X). Theorem [Maurey, 1981]: l ≥1xX R m x 1, x 2,,…, x l ɛ X x i rad(X)/sqrt( l ) x For any fix l ≥1, and each x in the convex hull of X in R m, there exist x 1, x 2,,…, x l ɛ X such that the average of the x i ’s is at most rad(X)/sqrt( l ) away from x. Theorem [Maurey, 1981]: l ≥1xX R m x 1, x 2,,…, x l ɛ X x i rad(X)/sqrt( l ) x For any fix l ≥1, and each x in the convex hull of X in R m, there exist x 1, x 2,,…, x l ɛ X such that the average of the x i ’s is at most rad(X)/sqrt( l ) away from x.

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Worst Pair Ratio Clusterability: Preliminaries  The width of a clustering is the maximum distance between points in the same cluster (over all clusters).  The split of a clustering is the minimum distance between points in different clusters. Introduced by Epter, Krishnamoorthy, and Zaki in Definition: Xk The Worst Pair Ratio of X (w.r.t. k) is WPR k (X)=max{(C)/(C) : C k X}. WPR k (X)=max{split(C)/width(C) : C a k-clustering of X}. Definition: Xk The Worst Pair Ratio of X (w.r.t. k) is WPR k (X)=max{(C)/(C) : C k X}. WPR k (X)=max{split(C)/width(C) : C a k-clustering of X}.

Better WPR-clusterability implies that it is easier to cluster well Theorem: Xn Given a data set X (on n elements) where WPR k (X) > 1kX O(n 2 log n) WPR k (X) > 1, we can find a k-clustering of X with the maximal split over width ratio in time O(n 2 log n). Theorem: Xn Given a data set X (on n elements) where WPR k (X) > 1kX O(n 2 log n) WPR k (X) > 1, we can find a k-clustering of X with the maximal split over width ratio in time O(n 2 log n).

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Separability clusterability k-1k  Separability measures how much is gained in the transition from k-1 to k clusters. L k  In the original definition, L is k-means. Introduced by Ostrovsky, Rabani, Schulman, and Swamy in Definition: Xkε A data set X is (k,ε)-separable if OPT L,k (X) ≤ εOPT L,k-1 (X) OPT L,k (X) ≤ εOPT L,k-1 (X). Definition: Xkε A data set X is (k,ε)-separable if OPT L,k (X) ≤ εOPT L,k-1 (X) OPT L,k (X) ≤ εOPT L,k-1 (X).

Better separability implies that it is easier to cluster well Theorem [Theorem 4.13, Ostrovsky et al. 2006] : (k, ε 2 )XR m εkk (1- ε 2 )/(1-37 ε 2 )OPT k-means, k (X) 1- O(ε 2 ) O(nm) Given a (k, ε 2 )-separable data set X in R m, for small enough ε, we can find a k-clustering with k-means loss at most (1- ε 2 )/(1-37 ε 2 )OPT k-means, k (X) away from the optimal, with probability at least 1- O(ε 2 ) in time O(nm). Theorem [Theorem 4.13, Ostrovsky et al. 2006] : (k, ε 2 )XR m εkk (1- ε 2 )/(1-37 ε 2 )OPT k-means, k (X) 1- O(ε 2 ) O(nm) Given a (k, ε 2 )-separable data set X in R m, for small enough ε, we can find a k-clustering with k-means loss at most (1- ε 2 )/(1-37 ε 2 )OPT k-means, k (X) away from the optimal, with probability at least 1- O(ε 2 ) in time O(nm).

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Variance Ratio – Preliminaries. Introduced by Zhang in C The within cluster variance of C is C The between cluster variance of C is Definition: Xk The Variance Ratio of a data set X (for k) is Var k (X) = max{B(C)/W(C) : C k X} Var k (X) = max{B(C)/W(C) : C a k-clustering of X} Definition: Xk The Variance Ratio of a data set X (for k) is Var k (X) = max{B(C)/W(C) : C k X} Var k (X) = max{B(C)/W(C) : C a k-clustering of X}

Better VR clusterability implies that it is easier to cluster well Proof: VR 2 (X) = 1/S 2 (X) We can show that VR 2 (X) = 1/S 2 (X) The result follows from a theorem by Ostrovsky et al. for 2-clusterings [Theorem 3.5, Ostrovsky et al. 2006]. Theorem: (2, ε 2 )-XR m 2 k1-Θ(ε 2 ) 1- O(ε 2 ) O(nm). Given a (2, ε 2 )-separable data set X in R m, we can find a 2- clustering with k-means loss at most 1-Θ(ε 2 ) away from the optimal, with probability at least 1- O(ε 2 ) in time O(nm). Theorem: (2, ε 2 )-XR m 2 k1-Θ(ε 2 ) 1- O(ε 2 ) O(nm). Given a (2, ε 2 )-separable data set X in R m, we can find a 2- clustering with k-means loss at most 1-Θ(ε 2 ) away from the optimal, with probability at least 1- O(ε 2 ) in time O(nm).

Summary of notions of clusterability  Center-perturbation: Whenever a clustering is structurally similar to the optimal clustering, its cost is near-optimal.  Separability : kk-1 Loss of the optimal k-clustering/Loss of the optimal (k-1)-clustering  Variance Ratio: Between-cluster variance/Within-cluster variance.  Worst Pair Ratio: Split/width.

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Comparing notions of clusterability An arrow from notion A to notion B indicates good clusterability by notion A implies good clusterability by notion B. Worst Pair Ratio Variance Ratio Separability Center- Perturbation No two notions are equivalent.

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Computational complexity of determining the clusterability value What is the computational complexity of determining the clusterability of a data set? kε  It is NP-hard to determine whether a data set is (k,ε)-separable.  It is NP-hard to find the Variance Ratio of a data set.  It a data set is well-clusterable by WPR, then the WPR can found in polynomial time.

Outline  Definitions and notation  Clusterability and the complexity of clustering  Introduce a new notion of clusterability: PC-clusterability  Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)  Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)  Variance ratio (Zhang, 2001)  Comparison of notions of clusterability  The hardness of determining clusterability  Summary and future work

Summary  We initiate a study of clusterability and introduce a new notion of clusterability.  We show that three previous and the new notion of clusterabilty are pairwise distinct.  Better clusterability implies that it is easier to cluster well for distinct notions of clusterability.  Determining the degree of clusterability is usually NP-hard.

Future work Property: The more clusterable a data set is, the easier it is, computationally, to find a near- optimal clustering of the data.  Does this property hold for other natural notions of clusterability?  Can clusterability be axiomatized?  Can it be shown that the above property holds for all reasonable notions of clusterability?

Appendix: Variance Ratio does not imply the other notions X 1 (k-1) B  Select a data set X 1 with arbitrarily poor (k-1)-clusterability according to any other notion B (separability, Worst Pair Ratio, or Center- Perturbation) X 2 X 1  Create data set X 2 by taking X 1 and adding a single point very far away, making it’s own cluster in the optimal k-clustering.  X 2  X 2 can have arbitrarily good Variance Ratio. X 1 X 2 B  Clusterability is the same on data sets X 1 and X 2 by notion B. X1X1X1X1 X2X2X2X2

Thinking about clusterability: Are these data sets clusterable? Clusters come in different shapes and sizes.

What happens with noise, outliers, and “structureless” data?