Download presentation

1
**A Fast PTAS for k-Means Clustering**

Dan Feldman, Tel Aviv University, Morteza Monemizadeh, Christian Sohler , Universität Paderborn

2
**Simple coreset for clustering problems Overview**

Introduction Weak Coresets Definition Intuition The construction A sketch of analysis The k-means PTAS Conclusions

3
**Introduction Clustering**

Partition input in sets (cluster), such that - Objects in same cluster are similar - Objects in different clusters are dissimilar Goal Simplification Discovery of patterns Procedure Map objects to Euclidean space => point set P Points in same cluster are close Points in different clusters are far away from eachother

4
**Introduction k-means clustering**

Clustering with Prototypes One prototyp (center) for each cluster k-Means Clustering k clusters C ,…,C One center c for each cluster C Minimize S S d(p,c ) 1 k i i 2 pC i i i

5
**Introduction k-means clustering**

Clustering with Prototypes One prototyp (center) for each cluster k-Means Clustering k clusters C ,…,C One center c for each cluster C Minimize S S d(p,c ) 1 k i i 2 pC i i i

6
**Introduction k-means clustering**

Clustering with Prototypes One prototyp (center) for each cluster k-Means Clustering k clusters C ,…,C One center c for each cluster C Minimize S S d(p,c ) 1 k i i 2 pC i i i

7
**Introduction Simplification / Lossy Compression**

(218,181,163) (128,59,88)

8
**Introduction Simplification / Lossy Compression**

9
**Introduction Simplification / Lossy Compression**

10
**Introduction Properties of k-means**

Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

11
**Introduction Properties of k-means**

Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

12
**Introduction Properties of k-means**

Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

13
**Introduction Properties of k-means**

Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

14
**Introduction Properties of k-means**

Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters Notation: cost(P,C) denotes the cost of the solution defined this way

15
**Weak Coresets Centroid Sets**

Definition (e-approx. centroid set) A set S is called e-approximate centroid set, if it contains a subset C S s.t. cost(P,C) (1+e) cost(P,Opt) Lemma [KSS04] The centroid of a random set of 2/e points is with constant probability a (1+e)-approx. of the optimal center of P. Corollary The set of all centroids of subsets of 2/e points is an e-approx. Centroid set.

16
**Weak Coresets Definition**

Definition (weak e-Coreset for k-means) A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) Point set P (light blue)

17
**Weak Coresets Definition**

Definition (weak e-Coreset for k-means) A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) Set of solution S (yellow)

18
**Weak Coresets Definition**

Definition (weak e-Coreset for k-means) A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) Possible coreset with weights (red) 3 4 5 5 4

19
**Weak Coresets Definition**

Definition (weak e-Coreset for k-means) A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) Approximates cost of k centers (voilett) from S 3 4 5 5 4

20
**Weak Coresets Ideal Sampling**

Problem Given n numbers a1,…,an >0 Task: approximate A:=Sai by random sampling Ideal Sampling Assign weights w1,…, wn to numbers wj = avg / aj Pr[x=j] = aj / avg Estimator: wxax

21
**Weak Coresets Ideal Sampling**

Problem Given n numbers a1,…,an >0 Task: approximate A:=Sai by random sampling Ideal Sampling Assign weights w1,…, wn to numbers wj = avg / aj Pr[x=j] = aj / avg Estimator: wxax Properties of estimator: (1) wxax = A (0 variance) (2) Expected weight of number j is 1

22
**Weak Coresets Ideal Sampling**

Problem Given n numbers a1,…,an >0 Task: approximate A:=Sai by random sampling Ideal Sampling Assign weights w1,…, wn to numbers wj = A / aj Pr[x=j] = aj / A Estimator: wxax Properties of estimator: (1) wxax = A (0 variance) (2) Expected weight of number j is 1 Only problem: Weights can be very large

23
**Weak Coresets Construction**

Step 1 Compute constant factor approximation

24
**Weak Coresets Construction**

Step 2 Consider each cluster separately

25
**Weak Coresets Construction**

Step 2 Consider each cluster separately

26
**Weak Coresets Construction**

Step 2 Consider each cluster separately Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)

27
**Weak Coresets Construction**

Step 2 Consider each cluster separately But what about high weights? Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)

28
**Weak Coresets Construction**

Step 2 A little twist Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)

29
**Weak Coresets Construction**

Step 3 A little twist Uniform sampling from small ball Radius = average distance / e Ideal sampling from ‚outliers‘

30
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (a): nearest center is ‚far away‘

31
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius

32
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius Weight of samples from outliers at most e|C|

33
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius Forget about outliers!

34
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (a): nearest center is ‚far away‘

35
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (a): nearest center is ‚far away‘ D eD Doesn‘t matter where points lie inside the ball

36
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (b): nearest center is ‚near‘

37
**Weak Coresets Analysis**

Fix arbitrary set of centers K Case (b): nearest center is ‚near‘ Almost ideal sampling - Expectation is cost(C,K) - low variance

38
**The centroid set Theorem**

Weak Coresets Result The centroid set S is set of all centroids of 2/e points (with repetition) from our sample set K Can show that K approximates all solutions from S Can show that S is an e-approx. centroid set w.h.p. Theorem One can compute in O(nkd) time a weak e-coreset (K,S). The size of K is poly(k, 1/e). S is the set of all centroids of subsets of K of size 2/e.

39
**Weak Coresets Applications**

Fast-k-Means-PTAS(P,k) Compute weak coreset K Project K on poly(1/e,k) dimensional space Exhaustively search for best solution of (projection of) centroid set Return centroids of the points that create C Running time: O(nkd + (k/e) ) ~ O(k/e)

40
**Weak Coresets independent of n and d fast PTAS for k-means**

Summary Weak Coresets independent of n and d fast PTAS for k-means First PTAS for kernel k-means (if the kernel maps into finite dimensional space)

41
**Thank you! Christian Sohler Heinz Nixdorf Institut**

& Institut für Informatik Universität Paderborn Fürstenallee 11 33102 Paderborn, Germany Tel.: (0) 52 51/ Fax: (0) 52 51/

Similar presentations

OK

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google