Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum.

Similar presentations


Presentation on theme: "Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum."— Presentation transcript:

1 Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum

2 Shuchi Chawla, Carnegie Mellon University 2 Natural Language Processing  In order to understand the article automatically, need to figure out which entities are one and the same  Is “his” in the second line the same person as “The secretary” in the first line? Co-reference Analysis

3 Shuchi Chawla, Carnegie Mellon University 3  Web Document Clustering Given a bunch of documents, classify them into salient topics  Computer Vision Distinguish boundaries between different objects and the background in a picture  Research Communities Given data on research papers, divide researchers into communities by co-authorship  Authorship (Citeseer/DBLP) Given authors of documents, figure out which authors are really the same person Other real-world clustering problems

4 Shuchi Chawla, Carnegie Mellon University 4 Traditional Approaches to Clustering  Approximation algorithms k-means, k-median, k-min sum  Matrix methods Spectral Clustering  AI techniques EM, single-linkage, classification algorithms

5 Shuchi Chawla, Carnegie Mellon University 5 Issues with traditional approaches  Dependence on underlying metric Objective functions are meaningful only on a metric eg. k-means Some algorithms work only for specific metrics (such as Euclidean) Problem: No well-defined “similarity metric” inconsistencies in beliefs

6 Shuchi Chawla, Carnegie Mellon University 6 Issues with traditional approaches  Fixed number of clusters/known topics Meaningless without prespecified number of clusters eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster Problem: Number of clusters is usually unknown No predefined topics – desirable to figure them out as part of the algorithm

7 Shuchi Chawla, Carnegie Mellon University 7 Issues with traditional approaches  No clean notion of “quality” of clustering Approximations do not directly translate to how many items have been grouped wrongly  Reliance on generative model eg. Data arising from a mixture of Gaussians Typically don’t work well in the case of fuzzy boundaries Problem: Fuzzy boundaries – how to cluster may depends on the given set of objects

8 Shuchi Chawla, Carnegie Mellon University 8 Cohen, McCallum & Richman’s idea  “Learn” a similarity function based on context  f(x,y) = amount of similarity between x and y Not necessarily a metric! 1.Use labeled data to train up this function 2.Classify all pairs with the learned function 3.Find the clustering that agrees most with the function  Problem divided into two separate phases  We deal with the second phase

9 Shuchi Chawla, Carnegie Mellon University 9 Cohen, McCallum & Richman’s idea Mr. Rumsfield his he Saddam Hussein Strong similarity Strong dissimilarity The secretary “Learn” a similarity measure based on context

10 Shuchi Chawla, Carnegie Mellon University 10 Consistent clustering: edges inside clusters edges between clusters Mr. Rumsfield his he Saddam Hussein The secretary Strong similarity Strong dissimilarity A good clustering

11 Shuchi Chawla, Carnegie Mellon University 11 Inconsistencies or “mistakes” Strong similarity Strong dissimilarity A good clustering Mr. Rumsfield his he Saddam Hussein The secretary Consistent clustering: edges inside clusters edges between clusters

12 Shuchi Chawla, Carnegie Mellon University 12 A good clustering Mistakes No consistent clustering! Goal: Find the most consistent clustering Strong similarity Strong dissimilarity Mr. Rumsfield his he Saddam Hussein The secretary

13 Shuchi Chawla, Carnegie Mellon University 13 Compared to traditional approaches…  Do not have to specify k Number of clusters can range from 1 to n  No condition on weights – can be arbitrary  Clean notion of quality of clustering – number of examples where the clustering differs from f  If a good (perfect) clustering exists, it is easy to find

14 Shuchi Chawla, Carnegie Mellon University 14 From a Machine Learning perspective  Noise Removal There is some true classification function f But there are a few errors in the data We want to find the true function  Agnostic Learning There is no inherent clustering Try to find the best representation using a hypothesis with limited expressivity

15 Shuchi Chawla, Carnegie Mellon University 15 Correlation Clustering  Given a graph with positive (similar) and negative (dissimilar) edges, find the most consistent clustering  NP-hard [Bansal, Blum, C, FOCS’02]  Two natural objectives – Maximize agreements (# of +ve inside clusters) + (# of –ve between clusters) Minimize disagreements (# of +ve between clusters) + (# of –ve inside clusters)  Equivalent at optimality, but different in terms of approximation

16 Shuchi Chawla, Carnegie Mellon University 16 Overview of results  Minimizing disagreements Unweighted complete graphO(1) [Bansal Blum C ’02] 4 [Charikar et al ’03] Weighted general graphO(log n) [Charikar et al ’03] [Demaine et al ’03] [Emmanuel et al ’03] APX-hardness for weighted case [Bansal Blum C’02] constant lower bounds for both cases [Charikar et al ’03]  Maximizing agreements Unweighted complete graphPTAS [Bansal Blum C ’02] Weighted general graphs0.7664 [Charikar et al ’03] 0.7666 [Swamy ’04] constant lower bound for weighted case [Charikar et al ’03] This talk

17 Shuchi Chawla, Carnegie Mellon University 17 Minimizing Disagreements [Bansal, Blum, C, FOCS’02]  Goal: approximately minimize number of “mistakes”  Assumption: The graph is unweighted and complete  A lower bound on OPT : Erroneous Triangles Consider + - Any clustering disagrees with at least one of these edges + “Erroneous Triangle” If several edge-disjoint erroneous ∆s, then any clustering makes a mistake on each one D opt  Maximum fractional packing of erroneous triangles

18 Shuchi Chawla, Carnegie Mellon University 18 Using the lower bound:  -clean clusters  Relating erroneous triangles to mistakes In special cases, we can “charge-off” disagreements to erroneous triangles  “clean” clusters each vertex has few disagreements incident on it few is relative to the size of the cluster # of disagreements · ¼ # of erroneous triangles “good” vertex “bad” vertex Clean cluster  All vertices are good

19 Shuchi Chawla, Carnegie Mellon University 19 Using the lower bound:  -clean clusters  Relating erroneous triangles to mistakes In special cases, we can “charge-off” disagreements to erroneous triangles   -clean clusters each vertex in cluster C has fewer than  |C| positive and  |C| negative mistakes   ¼  # of disagreements · ¼ # of erroneous triangles  A high density of positive edges We can easily spot them in the graph  Possible solution: Find a  -clean clustering, and charge disagreements to erroneous triangles  Caveat: It may not exist

20 Shuchi Chawla, Carnegie Mellon University 20 Using the lower bound:  -clean clusters  We show:  an almost-  -clean clustering that is almost as good as OPT Nice structure helps us find it easily.  Caveat: A  -clean clustering may not exist  An almost-  -clean clustering: All clusters are either  -clean or contain a single node  An almost-  -clean clustering always exists – trivially OPT(  )

21 Shuchi Chawla, Carnegie Mellon University 21 OPT(  ) — clean or singleton Optimal Clustering Imaginary Procedure Few (   fraction) bad nodes  remove them from cluster “bad” vertices New cluster: O(  )-clean; few new mistakes –  by a 1/  factor

22 Shuchi Chawla, Carnegie Mellon University 22 OPT(  ) : All clusters are  -clean or singleton OPT(  ) — clean or singleton Optimal Clustering Imaginary Procedure Many (   fraction) bad nodes  break up cluster “bad” vertices New singleton clusters: mistakes  by a 1/  2 factor Few new mistakes

23 Shuchi Chawla, Carnegie Mellon University 23 Our algorithm  Goal: Find nearly clean clusters 1.Pick an arbitrary vertex v C  green (+ve) neighbors of v 2.Remove any bad vertices from C 3.Add vertices that are good w.r.t. C 4.Output C and recurse on the remaining graph 5.If C is empty for all choices of v, output remaining vertices as singletons

24 Shuchi Chawla, Carnegie Mellon University 24 Finding clean clusters OPT(  ) ALG O(  )-clean clusters Charging-off mistakes 1. Mistakes among clean clusters - charge to erron. ∆s 2. Mistakes among singletons - no more than corresponding mistakes in OPT(  )  constant factor approximation

25 Shuchi Chawla, Carnegie Mellon University 25 Maximizing Agreements  Easy to obtain a 2-approximation  If #(pos. edges) > #(neg. edges) put everything in one cluster Otherwise, n singleton clusters  Get at least half the edges correct Max score possible = total number of edges  2-approximation

26 Shuchi Chawla, Carnegie Mellon University 26 Maximizing Agreements  Max possible score = ½n 2  Goal: obtain an additive approx of  n 2 Standard approach: Draw small sample Guess partition of sample Compute partition of remainder Running time doubly exp’l in , or singly with bad exponent.

27 Shuchi Chawla, Carnegie Mellon University 27 Experimental Results [Wellner McCallum’03] 102428%age error reduction over previous best 73.4291.5993.96Correlation clustering 60.8388.9091.65Single-link-threshold 70.4188.8390.98Best-previous-match Dataset 3Dataset 2Dataset 1 (%age Accuracy of classification)

28 Shuchi Chawla, Carnegie Mellon University 28 Future Directions  Better combinatorial approximation  A good “iterative” approximation on few changes to the graph, quickly recompute a good clustering  Minimizing Correlation number of agreements – number of disagreements log-approx known; can we get a constant factor approx?

29 Questions?

30 Shuchi Chawla, Carnegie Mellon University 30 Future Directions  Clustering with small clusters Given that all clusters in OPT have size at most k, find a good approximation Is this NP-hard? Different from finding best clustering with small clusters, without guarantee on OPT  Clustering with few clusters Given that OPT has at most k clusters, find an approximation  Maximizing Correlation number of agreements – number of disagreements Can we get a constant factor approximation?

31 Shuchi Chawla, Carnegie Mellon University 31 Lower Bounding Idea: Erroneous Triangles If several edge-disjoint erroneous ∆s, then any clustering makes a mistake on each one D opt  Maximum fractional packing of erroneous triangles 1 43 2 5 2 Edge disjoint erroneous triangles (1,2,3), (1,4,5) + - + 3 mistakes

32 Shuchi Chawla, Carnegie Mellon University 32 Open Problems  Clustering with small clusters In most applications, clusters are very small Given that all clusters in OPT have size at most k, find a good approximation Different from finding best clustering with small clusters, without guarantee on OPT  Optimal solution for unweighted graphs? A possible approach… Any two vertices in the same cluster in OPT are neighbors or share a common neighbor. We can find a list O(n2 k ) clusters, such that all OPT’s clusters are in this list When k is small, only polynomially many choices to pick from

33 Shuchi Chawla, Carnegie Mellon University 33 Open Problems  Clustering with few clusters Given that OPT has at most k clusters, find an approximation  Consensus clustering Given a “sum” of k clusterings; find best “consensus” clustering easy 2-approximation; can we get a PTAS?  Maximizing Correlation number of agreements – number of disagreements bad case: # of disagree = constant fraction of total weight Charikar & Wirth obtained a constant factor approximation Can we get a PTAS in unweighted graphs?

34 Shuchi Chawla, Carnegie Mellon University 34 Overview of results Weighted graphs Unweighted (complete) graphs Max Agree Min Disagree O(1) [Bansal Blum C 02] 4 [Charikar Guruswami Wirth 03] PTAS [Bansal Blum C 02] 1.3048 O(log n) [CGW 03] 1.3044 [Swamy 04] [Immorlica Demaine 03] [Charikar Guruswami Wirth 03] [Emanuel Fiat 03] 1.008729/28 [CGW 03] APX-hard [CGW 03]

35 Shuchi Chawla, Carnegie Mellon University 35 Typical characteristics  No well-defined “similarity metric” inconsistencies in beliefs  Number of clusters is unknown  No predefined topics desirable to figure them out as part of the algorithm  Fuzzy boundaries – how to cluster may depends on the given set of objects


Download ppt "Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum."

Similar presentations


Ads by Google