Download presentation
Published byCynthia Chase Modified over 8 years ago
1
Data reduction for weighted and outlier-resistant clustering
Leonard J. Schulman Caltech joint with Dan Feldman MIT
2
Talk outline Clustering-type problems:
k-median weighted k-median k-median with m outliers (small m) k-median with penalty (clustering with many outliers) k-line median Unifying framework: tame loss functions Core-sets, a.k.a. -approximations Common existence proof and algorithm
5
Voronoi regions have spherical boundaries
Voronoi regions have spherical boundaries
7
k-Median with penalty
8
k-Median with penalty: good for outliers
2-median clustering of a data set: Same data set plus an outlier: Now cluster with h-robust loss function:
10
Related work and our results
Problem Approx. Time Reference 4 Charikar et al. SODA’01 Our Result O(1) K. Chen SODA’08 Har-Peled FSTTCS’06 F, Fiat, Sharir FOCS’06
11
What qualifies as a “tame” loss function?
Why are all these problems in the same paper? In each case the objective function is a suitably tame “loss function”. The loss in representing a point p by a center c is: k-median: D(p) = dist(p,c) Weighted k-median: D(p) = w · dist(p,c) Robust k-median: D(p) = min{h, dist(p,c)} What qualifies as a “tame” loss function?
12
Log-Log Lipschitz (LgLgLp) condition on the loss function
13
Many examples of LgLgLp loss functions: Robust M-estimators in Statistics
figure: Z. Zhang
14
Classic Data Reduction
15
Same notion for LgLgLp loss functions
16
k-clustering core-set for loss D
17
Weighted-k-clustering core-set for loss D
Handling arbitrary-weight centers is the “hard part”
18
Our main technical result
For every LgLgLp loss fcn D on a metric space, for every set P of n points, there is a weighted-(D,k)-core-set S of size |S| = O(log2 n) (In more detail: |S|=(dkO(k)/2) log2 n in Rd. For finite metrics, d=log n.) S can be computed in time O(n)
19
Sensitivity [Langberg and S, SODA’11]
The sensitivity of a point p P determines how important it is to include P in a core-set: Why this works: If s(p) is small, then p has many “surrogates” in the data, we can take any one of them for the core-set. If s(p) is large, then there is some C for which p alone contributes a significant fraction of the loss, so we need to include p in any core-set. DW(p,C) s(p) = maxC qP DW(q,C)
20
Total sensitivity The total sensitivity T(P) is the sum of the sensitivities of all the points: The total sensitivity of the problem is the maximum of T(P) over all input sets P. Total sensitivity ~ n: cannot have small core-sets. Total sensitivity constant or polylog: there may exist small core-sets. T(P)=sP s(p)
21
Small total sensitivity Small coreset
22
Small total sensitivity Small core-set
23
The main thing we need to do in order to produce a small core-set for weighted-k-median:
For each p P compute a good upper bound on s(p) in amortized O(1) time per point. (Upper bound should be good enough that s(p) is small)
24
Algorithm for computing sensitivities
Recursive-Robust-Median(P,k) Input: A set P of n points in a metric space An integer k 1 Output: A subset Q P of (n/kk) points We prove that any two points in Q can serve as each others’ surrogates w.r.t. any query. Hence each point p Q has sensitivity s(p) O(1/|Q|). Outer loop: Call Recursive-Robust-Median(P,k), then set P:=P-Q. Repeat until P is empty. Total sensitivity bd: T # calls to Recursive-Robust-Median kk log n.
25
The algorithm to find the (n)–size set Q:
26
Recursive-Robust-Median: illustration
27
Recursive-Robust-Median: illustration
28
A detail Actually it’s more complicated than described because we can’t afford to look for a (1+)-approximation, or even a 2-approximation, to the best k-median of any b·n points (b constant). Instead look for a bicriteria approximation: a 2-approximation of the best k-median of any b·n/2 points. Linear time algorithm from [F,Langberg STOC’11].
29
High-level intuition for the correctness of Recursive-Robust-Median
Consider any p in the “output” set Q. If for all queries C, D(p,C) is small, then p has low sensitivity. If there is a query C for which D(p,C) is large then in that query, all points of Q are assigned to the same center c C, and are closer to each other than to c; so they are surrogates.
30
Thank you
31
appendices
32
Many examples of LgLgLp loss functions: Robust M-estimators in Statistics
…
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.