# A sublinear Time Approximation Scheme for Clustering in Metric Spaces Author: Piotr Indyk IEEE FOCS 1999.

## Presentation on theme: "A sublinear Time Approximation Scheme for Clustering in Metric Spaces Author: Piotr Indyk IEEE FOCS 1999."— Presentation transcript:

A sublinear Time Approximation Scheme for Clustering in Metric Spaces Author: Piotr Indyk IEEE FOCS 1999

outline Introduction Preliminaries The Algorithms The analysis of BC The analysis of UC Sublinear time algorithm

Introduction k clustering problem: Input:Given a weighted graph G = (X,d) on N vertices. Output:Partition X into k sets S 1 …S k such that the value of is minimized.

Introduction This problem is NP-complete (for k >=2) and can’t be approximated up to any constant. A standard way to reduce the complexity of clustering problems is to assume that the weight function d is a metric.

Introduction Facts: –Guttman-Beck showed a 2-approx algorithm –Vega and Kenyon gave a PTAS for metric max cut. –Unfortunately, it does not imply a PTAS for the 2-clustering problem.

Introduction The result of this paper focus on the case when the de la Vega-Kenyon PTAS doesn't work. If done correctly, this procedure yields a (1+ε) approximate solution.

Preliminaries Let (X,d) be a metric space.For any two sets A,B X, we define We use d(A) to denote d(A,A)/2 We also use d(u,B) or d(B,u)

Preliminaries We define Observe, that is a metric.

Preliminaries For sets A,B of equal cardinality, we define and Notice that both d M and satisfy metric properties.

Preliminaries For any α [0,1], u and A X, we define d α (u,A) to be the sum of the α|A| smallest values of {d(u,a)|a A}. We also define The algorithms which we give in this paper are randomized and producing an (1+ε)-approximate solutions with high probability.

The Algorithms The result is obtained by running three algorithms in parallel: MAXCUT,BC,UC We will assume that |S 1 |=m and |S 2 |=n are given to us, as otherwise we use all N possible combinations.

The Algorithms MAXCUT: –This is the algorithms of another paper for (1-ε)- approximate MAX-CUT –The algorithm is useful if the cut/clustering ratio is smaller than a constant value c. –Thus we need another algorithm for the case when the cut/clustering ratio exceeds c.

The Algorithms BC: –Balance ratio: –If the cut/clustering ratio is greater than c and the balance ratio is smaller than ρ,we will run BC

The Algorithms BC: –Uniformly chooses set T of t =O(ρlogn) points –Guesses and, s.t. |T1| = |T2| = λ = O(logn) –It checks for each point u X-T 1 -T 2 if Back

The Algorithms BC(contd.): –If the above inequality holds, then u is added to R 1 ; otherwise we add it to R 2. –The pair is returned as a solution.

The Algorithms T T1T1 T2T2 X S1S1 S2S2

UC: –Use this algorithm when b(S 1,S 2 ) > ρ,assume |S 1 | > ρ|S 2 | –Obtain a set T of λrandom points from S 1. –Sort all points u X-T by (in ascending order) –The first |S 1 |-λ points from the list are added to R 1, the remaining points are added to R 2. –Output (R 1,R 2 ) Back

The Algorithms X T Sort u by d 1-α (u,T) S1S1 S2S2

The Algorithms MAXCUT B(S 1,S 2 )<ρ BU UC Solution Yes No Cut/clustering <= c Cut/clustering > c

The Algorithms Q1:polynomial? –Yes. (BC, UC)BCUC Q2:feasible solution? –Yes.

The analysis of BC Relating the outcome of the (randomized) comparison of d(u,T 1 ) and d(u,T 2 ) to a certain (deterministic) property of d Lemma1 Consider any u S 1. If for every set S 2 ’ which is a subset of S 2 such that |S 2 ’|>=(1-2α)|S 2 | we have d(u,S 2 ’)>=d(u,S 1 ), then with high probability we have n·d 1-α (u,T 1 )>m·d 1-α (u,T 2 )

The analysis of BC(cont’d) Proof: –Without loss of generality we can consider S 2 ’ which contains smallest (1-2α)n elements from S 2 –Moreover, we can assume d(u,S 2 ’)=d(u,S 1 )=1 –Finally, we will assume that the largest 2αn elements of S 2 are all equal –For λ large enough we have a significant gap in the expected values of d 1-α (u,T 1 ) and d 1-α (u,T 2 ) E[d 1-α (u,T 1 )]/(1-α)|T 1 | <= E[d 1-0 (u,S 1 )]/(1-0)|S 1 | E[d 1-α (u,T 2 )]/(1-α)|T 2 | >= E[d 1-0 (u,S 2 ’)]/(1-0)|S 2 ’|

The analysis of BC(cont’d) (m/λ)E[d 1-α (u,T 1 )] <= 1-α <= 1 (n/λ)E[d 1-α (u,T 2 )] >= (1-α)/(1-2α) >= 1+α/(1-2α) >= 1+α/2 –(m/λ)E[d 1-α (u,T 1 )] <= 1 (n/λ)E[d 1-α (u,T 2 )] >= 1+α/2 –We want to convert the expectation bounds into bounds holding with high probability –Applying standard tail inequalities, we obtain that n·d 1-α (u,T 2 ) >= m·d 1-α (u,T 1 ) with high probability if λ = Ω(log n/α ) 4

The analysis of BC(cont’d) We will upper bound the additional cost incurred by assigning u S 1 to R 2 ; the opposite case can be handled in the same way W(C) <= (1+ε)OPT  W(C) – OPT <= ε·OPT From the above Lemma, we can assume that for every u S 1 which has been included in R 2 (i.e. such that n·d 1-α (u,T 2 ) <= m·d 1-α (u,T 1 ) there exists a set S 2 of cardinality (1-2α)|S 2 | such that d(u,S 2 ) < d(u,S 1 ) (1) Thus we need only to bound d(u,S 2 -S 2 ) u u u

The analysis of BC(cont’d) d(S 1 ’)+d(S 2 ’) d(S 1 )+d(S 2 ) d(S 1 -U)+d(S 1 -U,V)+d(V)+ d(S 1 -U)+d(S 1 -U,U)+d(U)+ d(S 2 -V)+d(S 2 -V,U)+d(U) d(S 2 -V)+d(S 2 -V,V)+d(V) d(S 1 -U,V)+d(S 2 -V,U) d(S 1 -U,U)+d(S 2 -V,V) d(S 1,V)-d(U,V)+d(S 2,U)-d(V,U) d(S 1,U)-d(U)+d(S 2,V)-d(V) d(u,S 2 )+d(u,S 2 -S 2 )-d(u,V) d(u,S 1 )-d(u,U) ∵ d(u,S 2 ) < d(u,S 1 ) ∴ d(u,S 2 -S 2 )-d(u,V)-(d(u,S 1 )-d(u,S 2 ))+d(u,U) <= d(u,S 2 -S 2 ) uu u uu VS1-US1-US1-US1-UUUVS 2 -V S1’S1’S1S1 S2’S2’S2S2

The analysis of BC(cont’d) We will be only interested in u’s such that d(u,S 2 ) >= d(u,S 1 )(1+ε) (2) as otherwise the difference in the cost can be easily bounded (i.e. d(u,S 2 )-d(u,S 1 ) < d(u,S 2 )-d(u,S 2 ) < ε·d(u,S 1 )) From (1) and (2) we obtain that d(u,S 2 )-d(u,S 2 ) >= ε·d(u,S 1 ) >= ε·d(u,S 2 ) which can be rewritten as đ(u,S 2 -S 2 ) >= ε((1-2 α) /2α)·đ(u,S 2 ) u uu uu

The analysis of BC(cont’d) By triangle inequality we have đ(,S 2 - ) >= đ(u,S 2 - )-đ(u,S 2 ) >= (1-2 α/ ε(1-2 α)) đ(u,S 2 - ) (3) The above give a bound for đ(,S 2 - ). In the following we show that the number of such us is also not very large if (as we assume) d(S 1,S 2 ) >= c(d(S 1 )+d(S 2 )) Firstly, observe that đ(S 1,S 2 ) <= đ(u,S 1 )+đ(u,S 2 ) u S 2 -S 2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u u S1S1 S2S2 S2S2 u

The analysis of BC(cont’d) ∵ đ(u,S 2 )= d(u,S 2 )/n=d(u, )/n+d(u,S 2 - )/n =(1-2 α) đ(u, )+2 α đ(u,S 2 - ) đ(S 1,S 2 ) <= đ(u,S 1 )+(1-2 α) đ(u, )+2 α đ(u,S 2 - ) <= đ(u,S 1 )+(1-2 α) đ(u, )+ 2α(đ(u, )+đ(,S 2 - )) = đ(u,S 1 )+đ(u, )+2αđ(,S 2 - ) We can rewrite it as: S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u u S 2 -S 2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u

The analysis of BC(cont’d) d(S 1,S 2 )/|S 1 ||S 2 | <= d(u,S 1 )/|S 1 |+d(u, )/(1-2 α)| S 2 | + 2αd(,S 2 - )/(1-2 α) 2α|S 2 ||S 2 |  d(S 1,S 2 ) <= |S 2 |d(u,S 1 )+|S 1 |d(u, )/(1-2 α)+ |S 1 |d(,S 2 - )/(1-2 α) |S 2 | <= |S 2 |d(u,S 1 )+|S 1 |d(u, )/(1-2 α)+ ρd(S 2 )/(1-2α) Therefore c(d(S 1 )+d(S 2 )) <= d(S 1,S 2 ) <= |S 2 |d(u,S 1 )+|S 1 |d(u, )/(1-2 α)+ ρd(S 2 )/(1-2α) S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u

The analysis of BC(cont’d) Alternatively (c-ρ/(1-2 α ))(d(S 1 )+d(S 2 )) <= c(d(S 1 )+d(S 2 ))- ρd(S 2 )/(1-2 α) <= |S 2 |d(u,S 1 )+|S 1 |d(u, )/(1-2 α) <= ( |S 2 |+|S 1 |/(1-2 α)) d(u,S 1 ) ( ∵ d(u, ) <= d(u,S 1 )) The upper bound for the number of u satisfying the above inequality can be obtained as follows. Assume that this number is equal to γ|S 1 | S2S2 u S2S2 u

The analysis of BC(cont’d) Σ u U d(u,S 1 ) <= d(S 1 ). By plugging in the lower bound for d(u,S 1 ) we get Σ u U (c-ρ/(1-2 α ))(d(S 1 )+d(S 2 ))/(|S 2 |+|S 1 |/(1-2 α)) <= d(S 1 )  γ|S 1 |(c-ρ/(1-2 α ))(d(S 1 )+d(S 2 ))/(|S 2 |+|S 1 |/(1-2 α)) <= d(S 1 )  γ(c-ρ/(1-2 α ))(d(S 1 )+d(S 2 ))/(1/ρ+1/(1-2 α)) <= d(S 1 )  γ(c-ρ/(1-2 α ))/(1/ρ+1/(1-2 α)) <= 1 Therefore γ <= (1/ρ+1/(1-2 α)) /(c-ρ/(1-2 α ))

The analysis of BC(cont’d) Denote the set of u’s as above by U. We can bound the total cost difference by Σ u U d(u,S 2 - ) <= Σ u U 2 αn đ(u,S 2 - ) <= Σ u U 2 αn đ(,S 2 - )/(1-2 α/ ε(1-2 α)) ( ∵ (3)) = γ|S 1 |·2 αn đ(,S 2 - )/(1-2 α/ ε(1-2 α)) = γ|S 1 |·2 αnd (,S 2 - )/[(1-2 α/ ε(1-2 α)) 2 α |S 2 |(1-2 α) |S 2 | ] <= γρd(S 2 )/[(1-2 α/ ε(1-2 α)) (1-2 α)] = A d(S 2 ) S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u S2S2 u

The analysis of BC(cont’d) The factor A becomes smaller than ε when we set c = Ω(ρ /ε) and α = O(ε), for sufficient constants 2

The analysis of UC

Lemma 2

notes For every included by mistake to R2 there is included to R1 Let U denote the set of mistaken u’s and let V denotes the set of mistaken v’s. We will bound the differences d(V,S 1 )-d(U,S 1 ) and d(U,S 2 )-d(V,S 2 ) It is sufficient to bound

fact1 To bound right hand side therefore

We also use the fact therefore thus

In this way we bounded the first component by setting we make the value 1/F-1 smaller than ε

To Bound the second part

Observe that by setting we make B smaller than ε

Sublinear time algorithm We improve the running time of the above algorithm to

Sublinear time algorithm The running time of UC is bounded by the time needed sampling t points from large cluster.We improve UC by using random sampling, we can perform sampling in time roughly We improve MAXCUT running time of [2] by using the techniques of[1].

Sublinear time algorithm The main time bottleneck is the time needed for exhaustive partitioning of the set T in BC. We divide T into C 1 and C 1, choosing T 1 and T 2 from C 1 C 1, from lemma below,we show they are good enough for our algorithm.

Sublinear time algorithm Lemma3: Let (S 1, S 2 ) and (S 1 ’, S 2 ’ ) be two partitions of the metric space over S.There exists a constant B such that for any A and any β<1/B if d(S 1 )+d(S 2 ) <= βA/Bd(S 1, S 2 ) and d(S 1 ’)+d(S 2 ’ ) <= A/B(d(S 1 )+d(S 2 )), then (S 1, S 2 ) and (S 1 ’, S 2 ’ ) differ on at most βn points.

Sublinear time algorithm Select a sample R of r points.It can be split into R 1 and R 2 such that d(R 1, R 2 )/d(R 1 )+d(R 2 ) and d(S 1, S 2 ) /d(S 1 )+d(S 2 ) are comparable. Find T 1 ’ R 1 and T 2 ’ R 2,such that | T 1 ’ |= | T 2 ’ |=t and | T 1 ’ - S 1 |= | T 2 ’ - S 2 | <= β t It turns out T 1 ’ and T 2 ’ are almost as good as T 1 and T 2 obtained by exhaustive search.

Sublinear time algorithm For any two equal size sets A’ and A,if max(| A’ |-| A |,| A |-| A’ |)<= β | A | then for any α and u, d 1- α- β (u, A) <= d 1- α (u, A’ ) <= d 1- α+β (u, A) So we can replace Lemma 1 nd 1- α (u, T 2 ) <= m d 1- α (u, T 1 ) by nd 1- 3α/2 (u, T 2 ) <= m d 1- α/2 (u, T 1 )

Download ppt "A sublinear Time Approximation Scheme for Clustering in Metric Spaces Author: Piotr Indyk IEEE FOCS 1999."

Similar presentations