Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.

Similar presentations


Presentation on theme: "1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab."— Presentation transcript:

1 1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab

2 2 Outline: Motivation Main Problem Hierarchical Agglomerative Clustering A Model Incremental Clustering Different incremental algorithms Lower Bounds for incremental algorithms Dual Problem

3 3 I. Main Problem The clustering problem is as follows: given n points in a metric space M, partition the points into k clusters so as to minimize the maximum cluster diameter.

4 4 1.Greedy Incremental Clustering a)Center-Greedy b)Diameter-greedy

5 5 a) Center-Greedy The center-greedy algorithm associates a center for each cluster and merges the two clusters whose centers are closest. The center of the old cluster with the larger radius becomes the new center Theorem: The center-greedy algorithm’s performance ratio has a lower bound of 2k - 1.

6 6 0 v1v1 v2v2 v3v3 v4v4 v5v5 01 011 0 0 0 1 0 S0S0 S1S1 S3S3 S2S2 S2S2 a) Center-Greedy cont. Proof: 1-Tree Construction K=2

7 7 a) Center-Greedy cont. 2-Tree  Graph Set A i (in our example A i ={{v 1 },{v 2 }, {v 3 },{v 4 }}) v1v1 v2v2 v3v3 v4v4 v5v5 S0S0 S1S1 S3S3 S2S2 S2S2 1 1-   1-   1-   1

8 8 Post-Order Traverse

9 9 a) Center-Greedy cont. Claims: For 1 <= i <= 2k - 1, A i is the set of clusters of center-greedy which contain more than one vertex after the k + i vertices v 1,..., v k+i are given. There is a k-clustering of G of diameter 1. The clustering which achieves the above diameter is {S 0 US 1,..., S 2k-2 US 2k-1 }.

10 10 K=4

11 11 Competitiveness of Center-Greedy Theorem : The center-greedy algorithm has performance ratio of 2k-1 in any metric space.

12 12 b) Diameter-Greedy The diameter-greedy algorithm always merges those two clusters which minimize the diameter of the resulting merged cluster. Theorem : The diameter-greedy algorithm’s performance ratio  (log(k)) is even on the line.

13 13 b) Diameter-Greedy cont. Proof: 1) Assumptions U i = U j=1 Fi {{pij, qij}, {rij, sij}}, V i = U j=1 Fi {{qij}, {rij}}, W i = U j=1 Fi {{pij}, {qij, rij}}, X i = U j=1 Fi {{pij}, {qij, rij}, {sij}}, Y i = U j=1 Fi {{pij, qij, rij}, {sij}}, Z i = U j=1 Fi {{pij, qij, rij, sij}}.

14 14 b) Diameter-Greedy cont. Proof: 2) Invariant : When the last element of K t is received, diameter-greedy’s k+1 clusters are (U i=1 t-2 Z i ) UY t-1 U X t (U r i=t+1 V i ). Since there are k+1 clusters, two of the clusters have to be merged and the algorithm merges two clusters in V t+1 to form a cluster of diameter (t+1). Without loss of generality, we may assume that the clusters merged are {q(t+1)1} and {r(t+1)1}.

15 15 Competitiveness of Diameter-Greedy Theorem : For k = 2, the diameter-greedy algorithm has a performance ratio 3 in any metric space.

16 16 2.Doubling Algorithm a)Deterministic b)Randomized c)Oblivious d)Randomized Oblivious

17 17 a) Deterministic doubling algorithm The algorithm works in phases At the start of phase i it has k+1 clusters Uses  and , s.t  /(1-  )<=  At start of phase i the following is assumed: 1. for each cluster C j, the radius of C j defined as max p  Cj d(c j, p) is at most αd i 2. for each pair of clusters C j and C l, the inter- center distance d(c j, c l ) => d i 3. d i <= opt.

18 18 a) Deterministic doubling algorithm Each phase has two stages 1- Merging stage, in which the algorithm reduces the number of clusters by merging certain pairs 2-Update stage, in which the algorithm accepts new updates and tries to maintain at most k clusters without increasing the radius of the clusters or violating the invariants A phase ends when number of clusters exceeds k

19 19 a) Deterministic doubling algorithm Definition: The t-threshold graph on a set of points P = {p 1, p 2,..., p n } is the graph G=(P,E) such that (p i, p j ) in E if and only if d(p i, p j ) <= t. Merging stage defines d i+1 =  d i and a graph G d i+1 –threshold for centers c 1,..., c k+1. New clusters C’ 1 …C’ m. If m=k+1 this ends the phase i

20 20 a) Deterministic doubling algorithm Lemma The pairwise distance between cluster centers after the merging stage of phase i is at least d i+1. Lemma The radius of the clusters after the merging stage of phase i is at most d i+1 +αd i <=αd i+1 Update continues while the number of clusters is at most k. It is restricted by the radius bound αd i+1. Then phase i ends.

21 21 a) Deterministic doubling algorithm Initialization : the algorithm waits until k+1 points have arrived then enters phase 1, with each point as a center containing just itself. And d 1 set to the distance between the closest pair of points

22 22 a) Deterministic doubling algorithm Lemma The k + 1 clusters at the end of the ith phase satisfy the following conditions: 1. The radius of the clusters is at most αd i+1. 2. The pairwise distance between the cluster centers is at least d i+1. 3. d i+1 <= OPT, where OPT is the diameter of the optimal clustering for the current set of points. Theorem: The doubling algorithm has performance ratio 8 in any metric space.

23 23 a) Deterministic doubling algorithm Example to show the analysis is tight: k=>3. Input consists of k+3 points p 1 …p k+3 the points p 1 …p k+1 have distance 1, p k+2,p k+3 have distance 4 from the others, and 8 from each other.

24 24 b) Randomized doubling algorithm Choose a random variable r from [1/e,1] according to the probability density function 1/r The min pairwise distance of the first k+1 point is x. And d 1 =rx  =e,  =e/(e-1)

25 25 b) Randomized doubling algorithm Theorem : The randomized doubling algorithm has expected performance ratio 2e in any metric space. The same bound is also achieved for the radius measure.

26 26 c) Oblivious clustering algorithm Does not need to know k Assume we have un upper bound on the max distance between point which is 1. Points are maintained in a tree

27 27 c) Oblivious clustering algorithm cont. At distance greater than 1/2 i Within dist. 1/2 i-1 from parent Where i is the depth of the vertex, i=>0 Root at depth 0

28 28 Illustration Of Oblivious clustering algorithm:

29 29 c) Oblivious clustering algorithm cont. How do we obtain the k clusters from the tree? If k is given, and i is the greatest depth containing at most k points. These are the k cluster centers. The sub-trees of the vertices at depth i are the clusters. As points are added, the number of vertices at depth i increases; if it goes beyond k, then we change i to i - 1, collapsing certain clusters; otherwise, the new point is inserted in one of the existing clusters.

30 30 c) Oblivious clustering algorithm cont. Theorem : The algorithm that outputs the k clusters obtained from the tree construction has performance ratio 8 for the diameter measure and the radius measure. Optimal diameter is ½ i+1 < d <= ½ I Then points at depth i are in different clusters, so there are at most k of them. j=>i be the greatest depth containing at most k points. Subtrees are at a distance of the root within ½ j + ½ j+1 + ½ j+2 + · · ·<= ½ j-1 < 4d.

31 31 d) Randomized Oblivious Distance threshold for depth i is r/e i r chosen once at random from [1,e], according to the PDF 1/r The expected diameter is at most 2e.OPT diameter

32 32 Lower Bounds Theorem1: For k = 2, there is a lower bound of 2 and 2 - ½ k/2 on the performance ratio of deterministic and randomized algorithms, respectively, for incremental clustering on the line.

33 33 Lower Bounds cont. Theorem2: There is a lower bound of 1+2 1/2 on the performance ratio of any deterministic incremental clustering algorithm for arbitrary metric spaces.

34 34 Lower Bounds cont.

35 35 Lower Bounds cont. Theorem3: For any e>0 and k=>2, there is a lower bound of 2 - e on the performance ratio of any randomized incremental algorithm.

36 36 Lower Bounds cont. Theorem4: For the radius measure, no deterministic incremental clustering algorithm has a performance ratio better than 3 and no randomized algorithm has a ratio better than 3 – e for any fixed e > 0.

37 37 II. Dual Problem For a sequence of points p 1,p 2,...,p n  R d, cover each point with a unit ball in d as it arrives, so as to minimize the total number of balls used.

38 38 II. Dual Problem Rogers Theorem: R d can be covered by any convex shape with covering density O(d log d). Theorem: For the dual clustering problem in R d, there is an incremental algorithm with performance ratio O(2 d d log d). Theorem: For the dual clustering problem in d, any incremental algorithm must have performance ratio  ( (log d)/(log log log d) ).

39 39


Download ppt "1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab."

Similar presentations


Ads by Google