Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.

Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron

Property Testing (Informal Definition) For a fixed property P and any object O, determine whether O has property P, or whether O is far from having property P (i.e., far from any other object having P ). Task should be performed by querying the object (in as few places as possible). ?? ? ? ?

Examples Object can be a graph (represented by its adjacency matrix), and property can be 3- colorabilty. Object can be a function and property can be linearity.

Context A relaxation of exactly deciding whether the object has the property. A relaxation of learning the object. Property testing can be viewed as: In either case want testing algorithm to be significantly more efficient than decision/learning algorithm.

When can Property Testing be Useful? Object is to too large to even fully scan, so must make approximate decision. Object is not too large but (1) Exact decision is NP-hard (e.g. coloring) (2) Prefer sub-linear approximate algorithm to polynomial exact algorithm. Use Testing as preliminary step to exact decision or learning. In first case can quickly rule out object far from property. In second case can aid in efficiently selecting good hypothesis class.

Previous Work Testing algebraic properties - linearty, low- degree polynomials... Testing graph properties - bipartitness, k - colorability, connectivity, acyclicity, first-order graph properties... Testing monotonicity of functions. Testing (properties defined by) regular languages, branching programs.

Testing of Clustering X - set of points in, | X | = n Notation: dist(x,y) - distance between points x and y (e.g. Euclidean) For any subset S of X : - Diameter of S - Radius of S.y.y

Definition Continued (for diameter cost) X is ( k,b )-clusterable if exists a k -way partition (clustering) of X s.t. each cluster has diameter at most b. X is -far from being ( k,b’ )-clusterable (b’  b) if there is no k -way partition of any Y  X, | Y | (1-) n s.t. each cluster has diameter at most b’. In first case algorithm should accept and in second reject with probability  2/3  b b

Our Results For general metrics (obey triangle ineq.), and both costs: b’=2b, |S| =. For b’<2b, |S| =. For L 2 metric and radius cost: b’= b, |S|=. For L 2 metric and diameter cost: b’=(1+  )b, |S|=. Dependence on 1/and exponential dependence on dimension d are unavoidable. All algorithms select uniform sample S  X where | S |= poly ( k,1/).

Our Results cont. Our Algorithms can be used to obtain approximately good clusterings. That is, k -clusterings with cost at most b ’ of all but at most an -fraction of the points. Independently, Mishara, Oblinger and Pitt give algorithms with similar complexities for other costs (e.g., sum of distances to center.)

Related Work on Clustering Hard to approximate cost of optimal clustering to within constant factor (e.g < 2) even for L 2 metric [HS,FG]. An approximation factor of 2 can be achieved efficiently under both costs [FG].

Testing Diameter Clustering under the L 2 metric Apply “natural” algorithm: - Uniformly and independently select sample from X, having size - If sample is ( k,b )-clusterable then ACCEPT, otherwise REJECT. Verifying whether a sample of size m is ( k,b )- clusterable (according to diam-cost), can be done in time.

Analysis of k=1 d=2 case If X is (k,b)-clusterable then always accept. Assume from now on that X is -far from ( k,(1+  )b )-clusterable. Will show that w.p.  2/3, sample contains at least 2 points at distance > b (causing rejection).

Analysis of k=1 d=2 case cont. View sample as being selected in p =   phases. In each phase make certain progress. For, A(R) - area of R For every x in sample, y in I j dist(x,y)  b x b IjIj x y I j - intersection of all circles centered at sample points at end of phase j. For x  X, C x - circle of radius b centered at x

Analysis of k=1 d=2 case cont. Say that point y  X is influential w.r.t I j if either y  I j or A(I j  C y ) < A(I j ) - b 2 /2 If in phase j+1 select point y  I j REJECT. Otherwise, consider I j  C y. If A(I j  C y ) < A(I j ), make progress. Suppose influential point selected in each phase. After  2/ 2 phases get y  I j IjIj y

Analysis of k=1 d=2 case cont. Claim: In each phase >  n influential points. Follows from geometric lemma: Lemma: For every non-influential y  I j, and every z  I j, dist(y,z)  (1+  )b. y z

Generalizing Argument d >2 (k=1): Instead of circles C x, consider balls B x, and let I j be intersection of balls. Modify def. of influential, and prove analogous geometric lemma for balls. k >1: Also view sample as selected in phases. In each phase consider all k -way partitions of sample. Show that w.h.p. new sample contains influential point w.r.t. every partition-subset of every partition. After sufficient number of phases, every partition has diameter > b.

Finding Approximately Good Partitions Assume X is ( k,b )-clusterable. Then with prob.  2/3, partition of sample can be used to implicitly define a k -clustering having diameter at most (1+  )b of all but an -fraction of the points. Idea: assign each non-influential point to appropriate cluster.

Lower Bound for Diameter Suppose X consists of pairs of antipodal points on ball. Can position (1/) (d-1)/2 pairs at distance > (1+  )b, while any two non-antipodal points at distance  b. To get pair, need ((1/) (d-1)/4 ) examples.

Conclusions and Further Research Described sub-linear algorithms for testing of clustering under various cost measures, which can be used for finding approximately good clusterings. Other natural cost measures (not covered by [MOP])? Practical applications?

Testing Radius Clustering under the L 2 metric Here too apply “natural” algorithm: - Uniformly and independently select sample from X, having size - If sample is ( k,b )-clusterable then ACCEPT, otherwise REJECT. Verifying whether a sample of size m is ( k,b )- clusterable (according to radius-cost), can be done in time.

Analysis of Radius Clustering S - family of subsets in, R - subset of, 0<<1. Say that subset N of R is -net of R w.r.t S, if for every S in S s.t., exists point x in. ( N “hits” every S that has non- negligible intersection with R.) Definitions: Subset A of is shattered by S, if for every, exists S in S, s.t.. The VC-dim of S, VCD( S ), is the maximum size of a subset A that is shattered by S.

Radius Clustering Cont. Theorem: For any family of subsets S and subset R, with probability at least 2/3, a sample of size m >=8VCD( S )/ log(VCD( S )/) is an -net for R w.r.t. S. Claim: If X is-far from ( k,b )-clusterable by radius cost, then alg rejects w.p. >= 2/3. Proof: Let B k,b be be family of subsets of defined by unions of k balls of radius b. Let C k,b be family of complements of subsets in B k,b. By assumption on X, for every S in C k,b,.

Proof Cont. Thus, a subset of X is -net of X w.r.t. C k,b, iff contains at least one point from every S in C k,b. It follows that if the sample selected is an -net of X, then it is not ( k,b )-clusterable. Since VCD( C k,b )=O( d k log k ), by Theorem, suffice that sample be of size so that get an -net and REJECT.

Testing of Diameter Clustering Under General Metrics Basic Idea: Try and find points in X that are representatives of different clusters. Show: - If X is ( k,b )-clusterable, will find at most k representatives; - If X is -far from ( k,2b )-clusterable, will find k+1 representatives w.h.p.

General Metrics Algorithm 1. let rep 1 be arbitrary point in X ; 2. i 1 ; find-new-rep TRUE ; 3. while i<k+1 and find-new-rep=TRUE do (a) uniformly and independently select sample of size ln(3k)/ ; (b) if exists x in sample s.t. dist(x,rep j )>b for every j<=i, then i=i+1, rep i x. else find-new-rep FALSE ; 4. If i<=k then ACCEPT, otherwise REJECT

rep1 rep2 rep3

Analysis of Gen. Met. Alg. Claim 1: If X is ( k,b )-clusterable then alg always accepts. Proof of Claim 1: Alg rejects only if finds k+1 points at pairwise distances all > b. If X is ( k,b )-clusterable no such set exists. Claim 2: If X is -far from ( k,2b )-clusterable then alg rejects w.p >= 2/3.

Analysis of Gen. Met. Alg. Cont. Proof of Claim2: Show that w.h.p. in each iteration, sample contains new representative, resulting in k+1 rep’s REJECT. Consider i th iteration, i b from each rep j, j<=i. Claim2 follows since prob of not selecting such point in some iteration is <= k (1-  ) ln(3k)/  <1/3 To verify sub-claim, suppose that <  n such points. Let us remove these points. Then by tri-ineq. if assign each other point to cluster j (<=i<=k ) s.t. rep j is at distance at most b from x, then obtain ( k,2b )- clustering, contradicting assumption on X.

Finding an Approx. Good Part. Assume X is ( k,b )-clusterable. Then analysis implies that with prob. >= 2/3, final rep’s rep 1,…,rep i, i<=k can be used to (implicitly) define -good ( k,2b ) clustering: For each x in X s.t. exists rep j, dist(x,rep j )<=b, assign x to cluster j. Thus obtain ( k,2b )-clustering of all but at most  n points in X.

Lower Bound for Gen. Metrics If all that is known about distance func. btwn. points in X is that tri-ineq. holds, then cannot go below b’=2b unless use sample of size. 2 1 1 2 1 ( Matching edges have distance 2, all others have distance 1) - If X does not contain matched pairs, then it is (1,1)-clusterable. - If contains >  n matched pairs, then -far from (1,2-)-clusterable for every >0.

(Informal) Definition of Property Testing For a fixed property P and any object O, determine whether O has property P or is far from any other object having property P. The task should be performed by observing only a small (possibly random) part of O

Property Testing - Background Initially defined by Rubinfeld and Sudan in the context of Program Testing (of algebraic functions). Goldreich Goldwasser and Ron initiated study of testing properties of (undirected) graphs. Body of work deals with properties of functions, graphs, strings...

Property Testing (Informal Definition) For a fixed property P and any object O, determine whether O has property P or whether O is far from having property P (i.e., far from any other object having P ). Task should be performed by querying the object (in as few places as possible).

Property Testing - Background Initially defined by Rubinfeld and Sudan in the context of Program Testing (of algebraic functions). Goldreich Goldwasser and Ron initiated study of testing properties of (undirected) graphs. Growing body of work deals with properties of functions, graphs, strings, sets of points... Many algorithms with complexity that is sub-linear in (or even independent of) size of object.

Related Work on Clustering Hard to approximate cost of optimal clustering to within constant factor (e.g < 2) even for L 2 metric [HS,FG]. An approximation factor of 2 can be achieved efficiently under both costs [FG]. Can achieve approximation factor of ( 1+ ) for radius cost [AP] in time

Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.

Similar presentations

Presentation on theme: "Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.

Similar presentations

Presentation on theme: "Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron."— Presentation transcript:

Similar presentations

About project

Feedback