1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT
2 Computational Model Single * pass over the data: e 1, e 2, …,e n Bounded storage Fast processing time per element * For the purpose of this talk
3 Streaming Data Types Vector problems: Stream defines an array of numbers Maintain stats of the array, e.g., median Metric problems Clustering Graph problems, Text problems Geometric Problems [this talk]
4 Geometric Data Stream Algorithms as Data Structures Data structures that support: Insert(p) to P Possibly: Delete(p) from P Compute(P) Use space that is sub-linear in |P|
5 Insertions-only
6 Clustering in Geometric Spaces Problems: k-center [Charikar-Chekuri-Feder-Motwani’97] k-median [Guha-Mishra-Motwani-O’Callaghan’00, Meyerson’01, Charikar-O’Callaghan-Panigrahy’03] Bounds: poly(k,log n) space O(1)-approximation
7 k-median/k-center k is given Goal: choose k medians/centers to minimize: k-median: the sum of the distances k-center: the max distance
8 Geometric Space Bounds: poly(k,log n) space (1+ )-approximation Problems: Diameter, Minimum Enclosing Ball [Agarwal-Har-Peled’01, Feigenbaum- Kannan-Zhang’02, Cormode-Muthukrishnan’02, Hershberger-Suri’04] k-center [Agarwal-HarPeled’01, Agarwal-HarPeled-Varadarajan’04] k-median [HarPeled-Mazumdar’04] Range searching via -approximations: [Suri-Toth-Zhou’04] [Bagchi-Chaudhary-Eppstein-Goodrich’04] …
9 Dominant Approach: Merge and Reduce Main ideas: Design an (off-line) algorithm that converts the input into a “sketch”: Small size Sufficient to solve the problem A sketch of sketches is a sketch Partition the input in a tree-like fashion Simulate tree computation in small space Technique can traced back to ancient times i.e., 80’s [Munro-Paterson’80]
10 Tree Computation p1p1 p2p2 p3p3 p4p4 p5p5 p7p7 p6p6 p8p8 p9p9 p 10 p 11 p 12 p 13 p 15 p 14 p 16
11 Analysis Space: (sketch size)*log n Time: sketch computation time Question: Where do sketches come from ?
12 Idea I: solution=sketch Consider k-median [GMMO’00] : approximate k- median of approximate weighted k-medians is an approximate k-median Result: Constant depth tree Space: kn , >0 O(1) -approximation Works for any metric space k=3
13 Use the solution, ctd. -Approximations: find a subset S P, such that for any rectangle/halfspace/etc R, |R S|/|S| = |R P|/|P| [Matousek] : approximation of a union of approximations is an approximation [BCEG’04] : convert it into streaming algorithm, applications 1/ 2 space [STZ’04] : better/optimal bounds for rectangles and halfspaces
14 Idea 2: Core-Sets [AHP’01] Assume we want to minimize C P (o) S P is an -core-set for P, if for any o, and a set T: C P T (o) = (1 ± ) C S T (o) Note: this must hold for all o, not just the optimal one o P
15 Example: Core-set for MEB Compute extremal points: Choose “densely” spaced direction v 1 …v k I.e., for any u there is v i such that u*v i ≥ ||u|| 2 / (1+ ) For each direction maintain extremal point k=O(1/ ) (d-1)/2 suffice
16 Stream Algorithms via Core-sets Diameter/MEB/width: O(1/ ) (d-1)/2 log n space [AHP ’ 01] k-center: O(k/ d ) log n [HP ’ 01] k-median: O(k/ d ) log n [HPM ’ 04] O(k 2 / d ) [HPK ’ 05] O(k 2 d log 6 n/ ) [Chen ’ 05] O(d 3 / 7 ), k=1 [Indyk ’ 05] Faster algorithms and other results
17 Limitations Small core-sets might not exist Do not support deletions
18 Insertions and Deletions
19 Insertions and Deletions Technique: Reduction of geometric problems to vector problems Use of randomized linear embeddings Problems: Maintaining histograms of the data Classic geometric problems (matching, MST, clustering etc)
20 Streaming Algorithms for Vector Problems Norm estimation: Stream elements: (i,b), i=1…m Interpretation: x i =x i +b Want to maintain ||x|| p Why ? Examples: ||x|| p p =Σ i x i p = #non-zero coordinates in x, as p 0 … How ?
21 Dimensionality reduction x is an m-dimensional vector A is a “random” m times k matrix, k “small” Store Ax Recover (1± )||x|| 2 from Ax (with prob. 1-1/N ) [Alon-Matias-Szegedy’96] Estimator: median[ (A 1 x) (A c x) 2, (A c+1 x) (A 2c x) 2,..] 1/2, c=1/ 2, k=c log N A: constructed from 4-wise independent random variables [Johnson-Lindenstrauss’85] Estimator: ||Ax|| 2 A: each entry independently drawn from e.g. Gaussian distribution constructed using Nisan’s PRG [Indyk’00] [Indyk’00] Estimator: median[ (A 1 x),…, (A k x) ] A: as above Works for ||x|| p any p (0,2] (using p-stable distributions)
22 What it means To know ||x|| 2, suffices to know Ax Can maintain Ax when the coordinates are incremented: A(x+ be i )=Ax+ bA e i Ax Ax
23 Applications of Vector Approach Histograms/wavelet approximation Classic geometric problems (matching, MST, clustering etc)
24 Histograms View x as a function x:[1…n] [1…M] Approximate it using piecewise constant function h, with B pieces (buckets) Problem can be formulated in 2D as well (buckets become rectangular tiles)
25 Results: 1D [Gilbert-Guha-Indyk-Kotidis-Muthukrishnan-Strauss’02] : Maintains h with B pieces such that ||x-h|| 2 ≤ (1+ )||x-h OPT || 2 Under increments/decrements of x Space: poly(B,1/ ,log n) Time: poly(B,1/ ,log n)
26 Results: 2D [Thaper-Guha-Indyk-Koudas’02] : Maintains h with B log (nM) tiles such that ||x-h|| 2 ≤ (1+ )||x-h OPT || 2 Under increments/decrements of x Space/Update time: poly(B,1/ ,log n) Histogram reconstruction time: poly(B,1/ , n) [Muthukrishnan-Strauss’03] : Maintains h with 4B tiles Time: poly(B,1/ , log(nM))
27 Minimum Weight Bi-chromatic Matching Estimate the cost of MWBM
28 Minimum Weight Matching Estimate the cost of MWM
29 Minimum Spanning Tree Estimate the cost of MST
30 Facility Location Goal: choose a set F of facilities to minimize the sum of the distances to nearest facility plus the number of facilities times f Again, report the cost
31 Approach [Indyk’04] Assume P {1… } 2 Reduce to vector problems Impose square grids G 0 …G k, with side lengths 2 0,2 1, …, 2 k, shifted at random. For each square cell c in G i, let n P (c) be the number of points from P in c. The algorithms will maintain certain statistics over n P (.), which will allow it to approximately solve the problems
32 Estimators MST:∑ i 2 i ∑ c G i [n P (c)>0] MWM:∑ i 2 i ∑ c G i [n P (c) is odd] MWBM:∑ i 2 i ∑ c G i |n G (c)-n B (c)| Fac. Loc.:∑ i 2 i ∑ c G i min[n P (c), T i ] K-median:∑ i 2 i ∑ c G i - B(Q, 2^i) n P (c) (given medians Q) Maintain #non-zero entries in n P [FM’85] Maintain L 1 difference [I’00]
33 Results ProblemAppr. MST log → 1+ MWM log MWBM log Fac.Loc. log 2 K-median XYZ → 1+ Space: (log +log n + K ) O(1) [Frahling-Indyk-Sohler’05] [Frahling-Sohler’05] […, Charikar’02, …]
34 XYZ Space: (K+log + log n) O(1) Computation TimeApproximation O(k) poly(log n+1/ )1+ 2 poly(log n+log +k) O(1) poly(log n+log +k)[ 1+ , log n log ]
35 Probabilistic embeddings into HST’s Known [Bartal’96, Charikar-Chekuri-Goel-Guha-Plotkin’98]: ||p-q|| ≤ D tree (p,q) E[ D tree (p,q) ] ≤ ||p-q|| * O(log ) T 5
36 MST E[Cost(MST in T)] ≤ O(log ) Cost(MST) Cost(MST in T) Cost(T) How to compute Cost(T) ? Sum over all levels i, of the #nodes at i, times 2 i Node c exists iff n i (c)>
37 Matching Algorithm: Match what you can at the current level Odd leftovers wait for the next level Repeat Optimal on the HST Cost=∑ i 2 i ∑ c Gi [n P (c) is odd]
38 Conclusions Algorithms for geometric data streams Insertions-only: merge and reduce, coresets Insertions and deletions: randomized linear embeddings
39 Open Problems Matchings, Facility Location, etc: Replace log by O(1) or even 1+ Possible for: MST [Frahling-Indyk-Sohler’05] k-median [Frahling-Sohler’05] Related to computing bi-chromatic matching [Agarwal-Varadarajan’04] Min-sum clustering ?
40 Open Problems High dimensions: Diameter: 2 1/2 -approx, O(d 2 n 1/2 ) space, follows from [Goel-Indyk-Varadarajan’01] c-approx, O( dn 1/(c 2 - 1) ) [Indyk’03] Conjecture: 2 1/2 -approx, O(d polylog n) space Min-width cylinder: 18-approx, O(d) space [Chan’04]
41 Open Problems Range queries: General lower bounds ? (Not just for - approximations) (1/ 2 ) -bit bound for general queries follows from LB for dot product [Indyk-Woodruff’03] and is tight (for randomized algorithms) What about e.g., half-space queries ? O(1/ 4/3 ) is known [STZ’04] Other problems [STZ’04]