A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Introduction to Kernel Lower Bounds Daniel Lokshtanov.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Why do we want a good ratio anyway? Approximation stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work joint with Pranjal.
On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Semi-Supervised Learning
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Machine Learning Week 2 Lecture 2.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Item Pricing for Revenue Maximization in Combinatorial Auctions Maria-Florina Balcan, Carnegie Mellon University Joint with Avrim Blum and Yishay Mansour.
Active Learning of Binary Classifiers
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Probably Approximately Correct Model (PAC)
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Computational aspects of stability in weighted voting games Edith Elkind (NTU, Singapore) Based on joint work with Leslie Ann Goldberg, Paul W. Goldberg,
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Incorporating Unlabeled Data in the Learning Process
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Towards Theoretical Foundations of Clustering Margareta Ackerman Caltech Joint work with Shai Ben-David and David Loker.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Finding Low Error Clusterings TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA Maria-Florina Balcan.
The Cost of Fault Tolerance in Multi-Party Communication Complexity Binbin Chen Advanced Digital Sciences Center Haifeng Yu National University of Singapore.
Topics in Algorithms 2007 Ramesh Hariharan. Tree Embeddings.
Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009 A UNIQUENESS THEOREM FOR CLUSTERING.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Harnessing implicit assumptions in problem formulations: Approximation-stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Machine Learning Concept Learning General-to Specific Ordering
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Stability Yields a PTAS for k-Median and k-Means Clustering
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
Correlation Clustering
Semi-Supervised Clustering
New Characterizations in Turnstile Streams with Applications
RE-Tree: An Efficient Index Structure for Regular Expressions
Unsupervised Learning
Chapter 5. Optimal Matchings
Algorithms for Routing Node-Disjoint Paths in Grids
k-center Clustering under Perturbation Resilience
CIS 700: “algorithms for Big Data”
Exponential Time Paradigms Through the Polynomial Time Lens
CSCI B609: “Foundations of Data Science”
On the effect of randomness on planted 3-coloring models
Maria Florina Balcan 03/04/2010
A Theory of Learning and Clustering via Similarity Functions
Presentation transcript:

A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala

2 Brief Overview of the Talk Vague, difficult to reason about at a general technical level. Supervised Learning Good theoretical models: Clustering Lack of good unified models. Learning from labeled data. Learning from unlabeled data. PAC, SLT Kernels & Similarity fns A PAC-style framework Our work: fix the problem

3 Clustering: Learning from Unlabeled Data [documents] [topic] S set of n objects. 9 ground truth clustering. Goal: h of low error where x, l (x) in {1,…,t}. err(h) = min  Pr x~S [  (h(x))  l (x)] Problem: unlabeled data only! But have a Similarity Function! [sports] [fashion]

4 Clustering: Learning from Unlabeled Data [sports] [fashion] 9 ground truth clustering for S i.e., each x in S has l (x) in {1,…, t }. The similarity function K has to be related to the ground-truth. Input S, a similarity function K. Output Clustering of small error. Protocol

5 Clustering: Learning from Unlabeled Data [sports] [fashion] What natural properties on a similarity function would be sufficient to allow one to cluster well? Fundamental Question

6 Contrast with Standard Approaches Approximation algorithms - analyze algs to optimize various criteria over edges - score algs based on apx ratios Input: graph or embedding into R d Much better when input graph/ similarity is based on heuristics. Mixture models Clustering Theoretical Frameworks Our Approach Discriminative, not generative. Input: embedding into R d - score algs based on error rate - strong probabilistic assumptions Input: graph or similarity info - score algs based on error rate - no strong probabilistic assumptions E.g., clustering documents by topic, web search results by category

7 [sports] [fashion] Condition that trivially works. K(x,y) > 0 for all x,y, l (x) = l (y). K(x,y) 0 for all x,y, l (x) = l (y). K(x,y) < 0 for all x,y, l (x)  l (y). C C’ AA’ What natural properties on a similarity function would be sufficient to allow one to cluster well?

8 Problem: same K can satisfy it for two very different, equally natural clusterings of the same data! All x more similar to all y in own cluster than any z in any other cluster sports fashion soccer tennis Lacoste Gucci sports fashion soccer tennis Lacoste Gucci K(x,x’)=1 K(x,x’)=0.5 K(x,x’)=0 What natural properties on a similarity function would be sufficient to allow one to cluster well?

9 Relax Our Goals 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

10 Relax Our Goals soccer tennis Lacoste Gucci 1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. soccer sportsfashion Gucci tennis Lacoste All topics 2. List of clusterings s.t. at least one has low error. Tradeoff strength of assumption with size of list. Obtain a rich, general model.

11 Strict Separation Property Single-Linkage. merge “parts” whose max similarity is highest. Sufficient for hierarchical clustering (If K is symmetric) soccer sportsfashion Gucci tennis Lacoste All topics All x more similar to all y in own cluster than any z in any other cluster sports fashion soccer tennis Lacoste Gucci Algorithm

12 Strict Separation Property Use Single-Linkage, construct a tree s.t. ground-truth clustering is a pruning of the tree. Theorem All x more similar to all y in own cluster than any z in any other cluster If use c-approx. alg. to objective f (e.g, k-median) to minimize error rate, then implicit assumption: Most points (1-O(  ) fraction) satisfy Strict Separation. Clusterings within factor c of optimal are  -close to the target. Incorporate Approximation Assumptions in Our Model Can still cluster well in the tree model. k-median, k-means

13 Stability Property Sufficient for hierarchical clustering Merge “parts” whose average similarity is highest. Single linkage fails, but average linkage works. Neither A or A’ more attracted to the other one than to the rest of its own cluster. For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’) (K(A,A’) - average attraction between A and A’) A A’ C C’

14 Stability Property K(P 1,P 3 ) ¸ K(P 1,C-P 1 ) and K(P 1,C-P 1 ) > K(P 1,P 2 ). All “parts” laminar wrt target clustering. Contradiction. Analysis: Use Average Linkage, construct a tree s.t. the ground-truth clustering is a pruning of the tree. Theorem Failure iff merge P 1, P 2 s.t. P 1 ½ C, P 2 Å C = . But must exist P 3 ½ C s.t. P1P1 P2P2 P3P3 C (K(A,A’) - average attraction between A and A’) A A’ C C’ For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’)

15 Stability Property (K(A,A’) - average attraction between A and A’) A A’ C C’ Average Linkage breaks down if K is not symmetric Instead, run “Boruvka-inspired” algorithm: – Each current cluster C i points to argmax Cj K(C i,C j ) – Merge directed cycles. For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’)

16 Unified Model for Clustering Algorithm A 1 … … … Property P 1 Property P i Property P n Algorithm A 2 Algorithm A m of the similarity function wrt the ground-truth clustering Question 1: Given a property of the similarity function w.r.t. ground truth clustering, what is a good algorithm?

17 Unified Model for Clustering Algorithm A 1 … … … Property P 1 Property P i Property P n Algorithm A 2 Algorithm A m of the similarity function wrt the ground-truth clustering Question 2: Given the algorithm, what property of the similarity function w.r.t. ground truth clustering should the expert aim for?

18 Other Examples of Properties and Algorithms A A’ C C’ Sufficient for hierarchical clustering Find hierarchy using a multi-stage learning-based algorithm. Average Attraction Property Not sufficient for hierarchical clustering Can produce a small list of clusterings.(sampling based algorithm) Stability of Large Subsets Property Upper bound: t O(t/  2 log t/  ) Lower bound: t O(1/  ) E x’ 2 C(x) [K(x,x’)] > E x’ 2 C’ [K(x,x’)]+  E x’ 2 C(x) [K(x,x’)] > E x’ 2 C’ [K(x,x’)]+  ( 8 C’  C(x)) For all clusters C, C’, for all A µ C, A’ µ C, |A|+|A’| ¸ sn, neither A nor A’ more attracted to the other one than to the rest of its own cluster.

19 1)Generate list L of candidate clusters (average attraction alg.) 2)For every (C, C 0 ) in L s.t. all three parts are large: 3) Clean and hook up the surviving clusters into a tree. If K(C Å C 0, C \ C 0 ) ¸ K(C Å C 0, C 0 \ C), then throw out C 0 C C0C0 C Å C 0 Ensure that any ground-truth cluster is f-close to one in L. Else throw out C. Clustering A A’ Algorithm C C’ For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’) |A|+|A’| ¸ sn Stability of Large Subsets Property

20 Stability of Large Subsets A For all C, C’, all A ½ C, A’ µ C’, |A|+|A’| ¸ sn K(A,C-A) > K(A,A’)+  Clustering A’ C C’ If s=O(  2 /k 2 ), f=O(  2  /k 2 ), then produce a tree s.t. the ground-truth is  -close to a pruning. Theorem

21 The Inductive Setting Insert new points as they arrive. Draw sample S, cluster S (in the list or tree model). Inductive Setting Many of our algorithms extend naturally to this setting. instance space X Sample S xx xx To get poly time for stab of all subsets, need to argue that sampling preserves stability. [AFKK]

22 Similarity Functions for Clustering, Summary Natural conditions on K to be useful for clustering. For robust theory, relax objective: hierarchy, list. Algos for stability of large subsets; -strict separation. Algos and analysis for the inductive setting. Main Conceptual Contributions Technically Most Difficult Aspects A general model that parallels PAC, SLT, Learning with Kernels and Similarity Functions in Supervised Classification.

23

24 Properties Summary PropertyModel, AlgorithmClustering Complexity Strict SeparationHierarchical, Linkage based  2 t ) Stability, all subsets. (Weak, Strong, etc) Hierarchical, Linkage based  2 t ) Average Attraction (Weighted) List, Sampling based & NN t O( t /  2 ) Stability of large subsets (SLS) Hierarchical, complex algorithm (running time t O( t /  2 ) )  2 t )  strict separationHierarchical  2 t ) (2,  ) k-medianspecial case of strict separation, =3 

25 Algorithm sketch: 1.For each x,m generate cluster of m most similar pts to x. Delete any of size < 4  n. 2.If two clusters like this: have each y in intersection choose based on median similarity to C-C’, C’-C. Thm: If 9 bad set B of ·  n pts s.t. S’ = S-B satisfies strict ordering, and if all clusters of size ¸ 5  n, then can find tree of error · . > 2  n C C’

26 Algorithm sketch: 1.For each x,m generate cluster of m most similar pts to x. Delete any of size < 4  n. 3.If two clusters like this: have each y in C M C’ choose in or out based on  n+1 st most similar in C Å C’ or S-(C [ C’) Thm: If 9 bad set B of ·  n pts s.t. S’ = S-B satisfies strict ordering, and if all clusters of size ¸ 5  n, then can find tree of error · . < 2  n C C’

27 Algorithm sketch: 1.For each x,m generate cluster of m most similar pts to x. Delete any of size < 4  n. 4.If two clusters like this: have each y in C-C’ choose in or out based on  n+1 st most similar in C Å C’ or S-(C [ C’) Thm: If 9 bad set B of ·  n pts s.t. S’ = S-B satisfies strict ordering, and if all clusters of size ¸ 5  n, then can find tree of error · . < 2  n> 2  n C C’

28 Algorithm sketch: 1.For each x,m generate cluster of m most similar pts to x. Delete any of size < 4  n. 5.Then argue that never hurts correct clusters (wrt S’) and each step makes progress. Thm: If 9 bad set B of ·  n pts s.t. S’ = S-B satisfies strict ordering, and if all clusters of size ¸ 5  n, then can find tree of error · .