Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Efficient Algorithms via Precision Sampling Robert Krauthgamer (Weizmann Institute) joint work with: Alexandr Andoni (Microsoft Research) Krzysztof Onak.
1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Visual Recognition Tutorial
On Sketching Quadratic Forms Robert Krauthgamer, Weizmann Institute of Science Joint with: Alex Andoni, Jiecao Chen, Bo Qin, David Woodruff and Qin Zhang.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Maximum likelihood (ML)
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF. Read the TexPoint manual.
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)
Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Data Stream Algorithms Lower Bounds Graham Cormode
Forrelation: A Problem that Optimally Separates Quantum from Classical Computing.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Approximating the MST Weight in Sublinear Time
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Sublinear Algorithmic Tools 3
Pseudorandomness when the odds are against you
Sublinear Algorithmic Tools 2
Sketching and Embedding are Equivalent for Norms
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Linear sketching with parities
Summarizing Data by Statistics
Neuro-RAM Unit in Spiking Neural Networks with Applications
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Imperfectly Shared Randomness
Streaming Symmetric Norms via Measure Concentration
Lecture 6: Counting triangles Dynamic graphs & sampling
Sublinear Algorihms for Big Data
On Solving Linear Systems in Sublinear Time
Presentation transcript:

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Goal Compute the number of Dacians in the empire Estimate S=a 1 +a 2 +…a n where a i  [0,1] sublinearly…

Sampling  Send accountants to a subset J of provinces  Estimator: S ̃ =∑ j  J a j * n/J  Chebyshev bound: with 90% success probability 0.5*S – O(n/m) < S ̃ < 2*S + O(n/m)  For constant additive error, need m~ n

 Send accountants to each province, but require only approximate counts  Estimate a ̃ i, up to some pre-selected precision u i : |a i – a ̃ i | < u i  Challenge: achieve good trade-off between  quality of approximation to S  total cost of estimating each a ̃ i to precision u i Precision Sampling Framework

Formalization Sum EstimatorAdversary 1. fix a 1,a 2,…a n 1. fix precisions u i 2. fix ã 1,ã 2,…ã n s.t. |a i – ã i | < u i 3. given ã 1,ã 2,…ã n, output S̃ s.t. |∑a i – S̃| < 1.  What is cost?  Here, average cost = 1/n * ∑ 1/u i  to achieve precision u i, use 1/u i “resources”: e.g., if a i is itself a sum a i =∑ j a ij computed by subsampling, then one needs Θ( 1/u i ) samples  For example, can choose all u i =1/n  Average cost ≈ n  This is best possible, if estimator S ̃ = ∑a ̃ i

Precision Sampling Lemma  Goal: estimate ∑a i from {a ̃ i } satisfying |a i -a ̃ i |<u i.  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < S ̃ < 1.5*S + O(1)  with average cost equal to O(log n)  Example: distinguish Σ a i =5 vs Σ a i =0  Consider two extreme cases:  if five a i =1: sample all, but need only crude approx (u i =1/10) if all a i =5/n: only few with good approx u i =1/n, and the rest with u i =1 ε1+ε S – ε < S̃ < (1+ ε)S + ε O(ε -3 log n)

Precision Sampling Algorithm  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < S ̃ < 1.5*S + O(1)  with average cost equal to O(log n)  Algorithm:  Choose each u i  [0,1] i.i.d.  Estimator: S ̃ = count number of i‘s s.t. a ̃ i / u i > 6 (modulo a normalization constant)  Proof of correctness:  we use only a ̃ i which are (1+ ε )-approximation to a i  E[ S ̃ ] ≈ ∑ Pr[a i / u i > 6] = ∑ a i /6.  E[1/u] = O(log n) w.h.p. function of [ã i /u i - 4/ε] + and u i ’s concrete distrib. = minimum of O(ε -3 ) u.r.v. O(ε -3 log n) ε1+ε S – ε < S̃ < (1+ ε)S + ε

Why?  Save time:  Problem: computing edit distance between two strings  new algorithm that obtains (log n) 1/ ε approximation in n 1+O( ε ) time  via efficient property-testing algorithm that uses Precision Sampling  More details: see the talk by Robi on Friday!  Save space:  Problem: compute norms/frequency moments in streams  gives a simple and unified approach to compute all l p, F k moments, and other goodies  More details: now

Streaming frequencies  Setup:  1+ ε estimate frequencies in small space  Let x i = frequency of ethnicity i  k th moment: Σ x i k  k  [0,2]: space O(1/ ε 2 ) [AMS’96,I’00, GC07, Li08, NW10, KNW10, KNPW11]  k>2: space O ̃ (n 1-2/k ) [AMS’96,SS’02,BYJKS’02,CKS’03,IW’05,BGKS’06,BO10]  Sometimes frequencies x i are negative:  If measuring traffic difference (delay, etc)  We want linear “dim reduction” L:R n  R m m<<n EthnicityFrequency Dacians358 Galois12 Barbarians2988

Norm Estimation via Precision Sampling  Idea:  Use PSL to compute the sum ||x|| k k =∑ |x i | k  General approach  1. Pick u i ’s according to PSL and let y i =x i /u i 1/k  2. Compute all y i k up to additive approximation O(1)  Can be done by computing the heavy hitters of the vector y  3. Use PSL to compute the sum ||x|| k k =∑ |x i | k  Space bound is controlled by the norm ||y|| 2  Since heavy hitters under l 2 is the best we can do  Note that ||y|| 2 ≤||x|| 2 * E[1/u i ]

Streaming F k moments  Theorem: linear sketch for F k with O(1) approximation, O(1) update, and O(n 1-2/k log n) space (in words).  Sketch:  Pick random u i  [0,1], s i  {±1}, and let y i = s i * x i / u i 1/k  throw into one hash table H,  size m=O(n 1-2/k log n) cells  Update: on (i, a)  H[h(i)] += s i *a/u i 1/k  Estimator:  Max j  [m] |H[j]| k  Randomness: O(1) independence suffices x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 y1+y3y1+y3 y4y4 y2+y5+y6y2+y5+y6 x= H=

More Streaming Algorithms  Other streaming algorithms:  Algorithm for all k-moments, including k≤2  For k>2, improves existing space bounds [AMS96, IW05, BGKS06, BO10]  For k≤2, worse space bounds [AMS96, I00, GC07, Li08, NW10, KNW10, KNPW11]  Improved algorithm for mixed norms ( l p of l k ) [CM05, GBD08, JW09]  space bounded by (Rademacher) p-type constant  Algorithm for l p -sampling problem [MW’10]  This work extended to give tight bounds by [JST’11]  Connections:  Inspired by the streaming algorithm of [IW05], but simpler  Turns out to be distant relative of Priority Sampling [DLT’07]

Finale  Other applications for Precision Sampling framework ?  Better algorithms for precision sampling ?  Best bound for average cost (for 1+ ε approximation)  Upper bound: O(1/ ε 3 * log n) (tight for our algorithm)  Lower bound: Ω (1/ ε 2 * log n)  Bounds for other cost models?  E.g., for 1/square root of precision, the bound is O(1 / ε 3/2 )  Other forms of “access” to a i ’s ?