Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christian Sohler 1 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität A Fast PTAS for k-Means Clustering Dan Feldman, Tel Aviv University,

Similar presentations


Presentation on theme: "Christian Sohler 1 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität A Fast PTAS for k-Means Clustering Dan Feldman, Tel Aviv University,"— Presentation transcript:

1 Christian Sohler 1 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität A Fast PTAS for k-Means Clustering Dan Feldman, Tel Aviv University, Morteza Monemizadeh, Christian Sohler, Universität Paderborn

2 Christian Sohler 2 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Simple coreset for clustering problems Overview Introduction Weak Coresets Definition Intuition The construction A sketch of analysis The k-means PTAS Conclusions

3 Christian Sohler 3 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Clustering Clustering Partition input in sets (cluster), such that - Objects in same cluster are similar - Objects in different clusters are dissimilar Goal Simplification Discovery of patterns Procedure Map objects to Euclidean space => point set P Points in same cluster are close Points in different clusters are far away from eachother

4 Christian Sohler 4 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction k-means clustering Clustering with Prototypes One prototyp (center) for each cluster k-Means Clustering k clusters C,…,C One center c for each cluster C Minimize d(p,c ) 1k i i p C i i i 2

5 Christian Sohler 5 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction k-means clustering Clustering with Prototypes One prototyp (center) for each cluster k-Means Clustering k clusters C,…,C One center c for each cluster C Minimize d(p,c ) 1k i i p C i i i 2

6 Christian Sohler 6 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction k-means clustering Clustering with Prototypes One prototyp (center) for each cluster k-Means Clustering k clusters C,…,C One center c for each cluster C Minimize d(p,c ) 1k i i p C i i i 2

7 Christian Sohler 7 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität (128,59,88) (218,181,163) Introduction Simplification / Lossy Compression

8 Christian Sohler 8 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Simplification / Lossy Compression

9 Christian Sohler 9 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Simplification / Lossy Compression

10 Christian Sohler 10 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Properties of k-means Properties of k-means Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

11 Christian Sohler 11 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Properties of k-means Properties of k-means Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

12 Christian Sohler 12 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Properties of k-means Properties of k-means Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

13 Christian Sohler 13 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Properties of k-means Properties of k-means Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters

14 Christian Sohler 14 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Introduction Properties of k-means Properties of k-means Optimal solution, if Centers are given assign each point to the nearest center Cluster are given centroid (mean) of clusters Notation: cost(P,C) denotes the cost of the solution defined this way

15 Christian Sohler 15 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Centroid Sets Definition ( -approx. centroid set) A set S is called -approximate centroid set, if it contains a subset C S s.t. cost(P,C) (1+ ) cost(P,Opt) Lemma [KSS04] The centroid of a random set of 2/ points is with constant probability a (1+ )-approx. of the optimal center of P. Corollary The set of all centroids of subsets of 2/ points is an -approx. Centroid set.

16 Christian Sohler 16 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Definition Definition (weak -Coreset for k-means) A pair (K,S) is called a weak -coreset for P, if for every set C of k centers from the -approx. centroid set S we have (1- ) cost(P,C) cost(K,C) (1+ ) cost(P,C) Point set P (light blue)

17 Christian Sohler 17 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Definition Definition (weak -Coreset for k-means) A pair (K,S) is called a weak -coreset for P, if for every set C of k centers from the -approx. centroid set S we have (1- ) cost(P,C) cost(K,C) (1+ ) cost(P,C) Set of solution S (yellow)

18 Christian Sohler 18 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Definition Definition (weak -Coreset for k-means) A pair (K,S) is called a weak -coreset for P, if for every set C of k centers from the -approx. centroid set S we have (1- ) cost(P,C) cost(K,C) (1+ ) cost(P,C) Possible coreset with weights (red)

19 Christian Sohler 19 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Definition Definition (weak -Coreset for k-means) A pair (K,S) is called a weak -coreset for P, if for every set C of k centers from the -approx. centroid set S we have (1- ) cost(P,C) cost(K,C) (1+ ) cost(P,C) Approximates cost of k centers (voilett) from S

20 Christian Sohler 20 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Ideal Sampling Problem Given n numbers a 1,…,a n >0 Task: approximate A:= a i by random sampling Ideal Sampling Assign weights w 1,…, w n to numbers w j = avg / a j Pr[x=j] = a j / avg Estimator: w x a x

21 Christian Sohler 21 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Ideal Sampling Problem Given n numbers a 1,…,a n >0 Task: approximate A:= a i by random sampling Ideal Sampling Assign weights w 1,…, w n to numbers w j = avg / a j Pr[x=j] = a j / avg Estimator: w x a x Properties of estimator: (1) w x a x = A (0 variance) (2) Expected weight of number j is 1

22 Christian Sohler 22 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Ideal Sampling Problem Given n numbers a 1,…,a n >0 Task: approximate A:= a i by random sampling Ideal Sampling Assign weights w 1,…, w n to numbers w j = A / a j Pr[x=j] = a j / A Estimator: w x a x Properties of estimator: (1) w x a x = A (0 variance) (2) Expected weight of number j is 1 Only problem: Weights can be very large

23 Christian Sohler 23 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Construction Step 1 Compute constant factor approximation

24 Christian Sohler 24 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Construction Step 2 Consider each cluster separately

25 Christian Sohler 25 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Construction Step 2 Consider each cluster separately

26 Christian Sohler 26 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Construction Step 2 Consider each cluster separately Main idea: Apply ideal sampling to each Cluster C Pr[p i is taken] = dist(p i, c) / cost(C,c) w(p i ) = cost(C,c) / dist(p i,c)

27 Christian Sohler 27 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Construction Step 2 Consider each cluster separately Main idea: Apply ideal sampling to each Cluster C Pr[p i is taken] = dist(p i, c) / cost(C,c) w(p i ) = cost(C,c) / dist(p i,c) But what about high weights?

28 Christian Sohler 28 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Construction Step 2 A little twist Main idea: Apply ideal sampling to each Cluster C Pr[p i is taken] = dist(p i, c) / cost(C,c) w(p i ) = cost(C,c) / dist(p i,c)

29 Christian Sohler 29 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Construction Step 3 A little twist Uniform sampling from small ball Radius = average distance / Ideal sampling from outliers

30 Christian Sohler 30 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (a): nearest center is far away

31 Christian Sohler 31 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (a): nearest center is far away At least (1- )-fraction of points is here by choice of radius

32 Christian Sohler 32 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (a): nearest center is far away At least (1- )-fraction of points is here by choice of radius Weight of samples from outliers at most |C|

33 Christian Sohler 33 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (a): nearest center is far away At least (1- )-fraction of points is here by choice of radius Forget about outliers!

34 Christian Sohler 34 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (a): nearest center is far away

35 Christian Sohler 35 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (a): nearest center is far away Doesnt matter where points lie inside the ball D D

36 Christian Sohler 36 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (b): nearest center is near

37 Christian Sohler 37 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Analysis Fix arbitrary set of centers K Case (b): nearest center is near Almost ideal sampling - Expectation is cost(C,K) - low variance

38 Christian Sohler 38 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Result The centroid set S is set of all centroids of 2/ points (with repetition) from our sample set K Can show that K approximates all solutions from S Can show that S is an -approx. centroid set w.h.p. Theorem One can compute in O(nkd) time a weak -coreset (K,S). The size of K is poly(k, 1/ ). S is the set of all centroids of subsets of K of size 2/.

39 Christian Sohler 39 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Weak Coresets Applications Fast-k-Means-PTAS(P,k) 1.Compute weak coreset K 2.Project K on poly(1/,k) dimensional space 3.Exhaustively search for best solution of (projection of) centroid set 4.Return centroids of the points that create C Running time: O(nkd + (k/ ) ) O(k/ ) ~

40 Christian Sohler 40 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Summary Weak Coresets independent of n and d fast PTAS for k-means First PTAS for kernel k-means (if the kernel maps into finite dimensional space)

41 Christian Sohler 41 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität Christian Sohler Heinz Nixdorf Institut & Institut für Informatik Universität Paderborn Fürstenallee Paderborn, Germany Tel.: +49 (0) 52 51/ Fax: +49 (0) 52 51/ Thank you!


Download ppt "Christian Sohler 1 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität A Fast PTAS for k-Means Clustering Dan Feldman, Tel Aviv University,"

Similar presentations


Ads by Google