1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT.

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Searching on Multi-Dimensional Data
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.
Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.
Network Design Adam Meyerson Carnegie-Mellon University.
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Open Problem: Dynamic Planar Nearest Neighbors CSCE 620 Problem 63 from the Open Problems Project
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
The Message Passing Communication Model David Woodruff IBM Almaden.
Clustering Data Streams A presentation by George Toderici.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Sparse RecoveryAlgorithmResults  Original signal x = x k + u, where x k has k large coefficients and u is noise.  Acquire measurements Ax = y. If |x|=n,
An Optimal Algorithm for Finding Heavy Hitters
Data Transformation: Normalization
Clustering Data Streams
Stream-based Geometric Algorithms
New Characterizations in Turnstile Streams with Applications
Approximating the MST Weight in Sublinear Time
Open Problems in Streaming
Estimating L2 Norm MIT Piotr Indyk.
Haim Kaplan and Uri Zwick
Lecture 11: Nearest Neighbor Search
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Near(est) Neighbor in High Dimensions
Lecture 16: Earth-Mover Distance
Parallel Algorithms for Geometric Graph Problems
CIS 700: “algorithms for Big Data”
Y. Kotidis, S. Muthukrishnan,
Near-Optimal (Euclidean) Metric Compression
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Overview Massive data sets Streaming algorithms Regression
Locality Sensitive Hashing
CSCI B609: “Foundations of Data Science”
Lecture 15: Least Square Regression Metric Embeddings
Clustering.
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT

2 Computational Model Single * pass over the data: e 1, e 2, …,e n Bounded storage Fast processing time per element * For the purpose of this talk

3 Streaming Data Types Vector problems: Stream defines an array of numbers Maintain stats of the array, e.g., median Metric problems Clustering Graph problems, Text problems Geometric Problems [this talk]

4 Geometric Data Stream Algorithms as Data Structures Data structures that support: Insert(p) to P Possibly: Delete(p) from P Compute(P) Use space that is sub-linear in |P|

5 Insertions-only

6 Clustering in Geometric Spaces Problems: k-center [Charikar-Chekuri-Feder-Motwani’97] k-median [Guha-Mishra-Motwani-O’Callaghan’00, Meyerson’01, Charikar-O’Callaghan-Panigrahy’03] Bounds: poly(k,log n) space O(1)-approximation

7 k-median/k-center k is given Goal: choose k medians/centers to minimize: k-median: the sum of the distances k-center: the max distance

8 Geometric Space Bounds: poly(k,log n) space (1+  )-approximation Problems: Diameter, Minimum Enclosing Ball [Agarwal-Har-Peled’01, Feigenbaum- Kannan-Zhang’02, Cormode-Muthukrishnan’02, Hershberger-Suri’04] k-center [Agarwal-HarPeled’01, Agarwal-HarPeled-Varadarajan’04] k-median [HarPeled-Mazumdar’04] Range searching via  -approximations: [Suri-Toth-Zhou’04] [Bagchi-Chaudhary-Eppstein-Goodrich’04] …

9 Dominant Approach: Merge and Reduce Main ideas: Design an (off-line) algorithm that converts the input into a “sketch”: Small size Sufficient to solve the problem A sketch of sketches is a sketch Partition the input in a tree-like fashion Simulate tree computation in small space Technique can traced back to ancient times i.e., 80’s [Munro-Paterson’80]

10 Tree Computation p1p1 p2p2 p3p3 p4p4 p5p5 p7p7 p6p6 p8p8 p9p9 p 10 p 11 p 12 p 13 p 15 p 14 p 16

11 Analysis Space: (sketch size)*log n Time: sketch computation time Question: Where do sketches come from ?

12 Idea I: solution=sketch Consider k-median [GMMO’00] : approximate k- median of approximate weighted k-medians is an approximate k-median Result: Constant depth tree Space: kn ,  >0 O(1) -approximation Works for any metric space       k=3

13 Use the solution, ctd.  -Approximations: find a subset S  P, such that for any rectangle/halfspace/etc R, |R  S|/|S| = |R  P|/|P|  [Matousek] : approximation of a union of approximations is an approximation [BCEG’04] : convert it into streaming algorithm, applications  1/  2 space [STZ’04] : better/optimal bounds for rectangles and halfspaces

14 Idea 2: Core-Sets [AHP’01] Assume we want to minimize C P (o) S  P is an  -core-set for P, if for any o, and a set T: C P  T (o) = (1 ±  ) C S  T (o) Note: this must hold for all o, not just the optimal one o P

15 Example: Core-set for MEB Compute extremal points: Choose “densely” spaced direction v 1 …v k I.e., for any u there is v i such that u*v i ≥ ||u|| 2 / (1+  ) For each direction maintain extremal point k=O(1/  ) (d-1)/2 suffice

16 Stream Algorithms via Core-sets Diameter/MEB/width: O(1/  ) (d-1)/2 log n space [AHP ’ 01] k-center: O(k/  d ) log n [HP ’ 01] k-median: O(k/  d ) log n [HPM ’ 04] O(k 2 /  d ) [HPK ’ 05] O(k 2 d log 6 n/  ) [Chen ’ 05] O(d 3 /  7 ), k=1 [Indyk ’ 05] Faster algorithms and other results

17 Limitations Small core-sets might not exist Do not support deletions

18 Insertions and Deletions

19 Insertions and Deletions Technique: Reduction of geometric problems to vector problems Use of randomized linear embeddings Problems: Maintaining histograms of the data Classic geometric problems (matching, MST, clustering etc)

20 Streaming Algorithms for Vector Problems Norm estimation: Stream elements: (i,b), i=1…m Interpretation: x i =x i +b Want to maintain ||x|| p Why ? Examples: ||x|| p p =Σ i x i p = #non-zero coordinates in x, as p  0 … How ?

21 Dimensionality reduction x is an m-dimensional vector A is a “random” m times k matrix, k “small” Store Ax Recover (1±  )||x|| 2 from Ax (with prob. 1-1/N ) [Alon-Matias-Szegedy’96] Estimator: median[ (A 1 x) (A c x) 2, (A c+1 x) (A 2c x) 2,..] 1/2, c=1/  2, k=c log N A: constructed from 4-wise independent random variables [Johnson-Lindenstrauss’85] Estimator: ||Ax|| 2 A: each entry independently drawn from e.g. Gaussian distribution constructed using Nisan’s PRG [Indyk’00] [Indyk’00] Estimator: median[ (A 1 x),…, (A k x) ] A: as above Works for ||x|| p any p  (0,2] (using p-stable distributions)

22 What it means To know ||x|| 2, suffices to know Ax Can maintain Ax when the coordinates are incremented: A(x+ be i )=Ax+ bA e i Ax Ax

23 Applications of Vector Approach Histograms/wavelet approximation Classic geometric problems (matching, MST, clustering etc)

24 Histograms View x as a function x:[1…n]  [1…M] Approximate it using piecewise constant function h, with B pieces (buckets) Problem can be formulated in 2D as well (buckets become rectangular tiles)

25 Results: 1D [Gilbert-Guha-Indyk-Kotidis-Muthukrishnan-Strauss’02] : Maintains h with B pieces such that ||x-h|| 2 ≤ (1+  )||x-h OPT || 2 Under increments/decrements of x Space: poly(B,1/ ,log n) Time: poly(B,1/ ,log n)

26 Results: 2D [Thaper-Guha-Indyk-Koudas’02] : Maintains h with B log (nM) tiles such that ||x-h|| 2 ≤ (1+  )||x-h OPT || 2 Under increments/decrements of x Space/Update time: poly(B,1/ ,log n) Histogram reconstruction time: poly(B,1/ , n) [Muthukrishnan-Strauss’03] : Maintains h with 4B tiles Time: poly(B,1/ , log(nM))

27 Minimum Weight Bi-chromatic Matching Estimate the cost of MWBM

28 Minimum Weight Matching Estimate the cost of MWM

29 Minimum Spanning Tree Estimate the cost of MST

30 Facility Location Goal: choose a set F of facilities to minimize the sum of the distances to nearest facility plus the number of facilities times f Again, report the cost

31 Approach [Indyk’04] Assume P  {1…  } 2 Reduce to vector problems Impose square grids G 0 …G k, with side lengths 2 0,2 1, …, 2 k, shifted at random. For each square cell c in G i, let n P (c) be the number of points from P in c. The algorithms will maintain certain statistics over n P (.), which will allow it to approximately solve the problems

32 Estimators MST:∑ i 2 i ∑ c  G i [n P (c)>0] MWM:∑ i 2 i ∑ c  G i [n P (c) is odd] MWBM:∑ i 2 i ∑ c  G i |n G (c)-n B (c)| Fac. Loc.:∑ i 2 i ∑ c  G i min[n P (c), T i ] K-median:∑ i 2 i ∑ c  G i - B(Q, 2^i) n P (c) (given medians Q) Maintain #non-zero entries in n P [FM’85] Maintain L 1 difference [I’00]

33 Results ProblemAppr. MST log  → 1+  MWM log  MWBM log  Fac.Loc. log 2  K-median XYZ → 1+  Space: (log  +log n + K ) O(1) [Frahling-Indyk-Sohler’05] [Frahling-Sohler’05] […, Charikar’02, …]

34 XYZ Space: (K+log +  log n) O(1) Computation TimeApproximation  O(k) poly(log n+1/  )1+   2 poly(log n+log +k) O(1) poly(log n+log  +k)[ 1+ , log n log  ]

35 Probabilistic embeddings into HST’s Known [Bartal’96, Charikar-Chekuri-Goel-Guha-Plotkin’98]: ||p-q|| ≤ D tree (p,q) E[ D tree (p,q) ] ≤ ||p-q|| * O(log  ) T 5

36 MST E[Cost(MST in T)] ≤ O(log  ) Cost(MST) Cost(MST in T)  Cost(T) How to compute Cost(T) ? Sum over all levels i, of the #nodes at i, times 2 i Node c exists iff n i (c)>

37 Matching Algorithm: Match what you can at the current level Odd leftovers wait for the next level Repeat Optimal on the HST Cost=∑ i 2 i ∑ c  Gi [n P (c) is odd]

38 Conclusions Algorithms for geometric data streams Insertions-only: merge and reduce, coresets Insertions and deletions: randomized linear embeddings

39 Open Problems Matchings, Facility Location, etc: Replace log  by O(1) or even 1+  Possible for: MST [Frahling-Indyk-Sohler’05] k-median [Frahling-Sohler’05] Related to computing bi-chromatic matching [Agarwal-Varadarajan’04] Min-sum clustering ?

40 Open Problems High dimensions: Diameter: 2 1/2 -approx, O(d 2 n 1/2 ) space, follows from [Goel-Indyk-Varadarajan’01] c-approx, O( dn 1/(c 2 - 1) ) [Indyk’03] Conjecture:  2 1/2 -approx, O(d polylog n) space Min-width cylinder: 18-approx, O(d) space [Chan’04]

41 Open Problems Range queries: General lower bounds ? (Not just for  - approximations)  (1/  2 ) -bit bound for general queries follows from LB for dot product [Indyk-Woodruff’03] and is tight (for randomized algorithms) What about e.g., half-space queries ? O(1/  4/3 ) is known [STZ’04] Other problems [STZ’04]