Download presentation

Presentation is loading. Please wait.

Published byNoah Henry Modified over 3 years ago

1
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff

2
The Streaming Model 7113734 … Stream of elements a 1, …, a n each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order Algorithms given one pass over stream Goal: Minimum space algorithm

3
Frequency Moments [AMS96] n = stream size, m = universe size f i = # occurrences of item i Why are frequency moments important? F 0 = # of distinct elements F 1 = n = stream size F 2 = self-join size k-th moment

4
Applications Estimating distinct elements with low space Estimate query selectivity to huge DB without sorting Routers gather # distinct destinations F 2 estimates size of self-joins: Bobx Alicey Bobz a Aliceb Bobc, Aliceby Bobax az cx cz F k measures data skewness f B 2 + f A 2 = 4 + 1 = 5

5
The Best Deterministic Algorithm Trivial algorithm for F k Store/update f i for each item i, sum f i k at end Space = O(mlog n): m items i, log n bits to count f i Negative Results [AMS96]: Compute F k exactly (m) space Any deterministic alg. outputs X with |F k – X| < F k must use (m) space What about randomized algorithms?

6
Randomized Approx Algs for F k Randomized alg. -approximates F k if outputs X s.t. Pr[|F k – X| 2/3 Previous work (table suppresses polylog mn) UpperLower F0F0 1/ 2 [FM85, GT02, BJKST02] 1/ 2 [IW03, W04] F1F1 1-1- F2F2 1/ 2 [AMS96] 1/ 2 [W04] FkFk m 1-1/(k-1) [CK04, G04]m 1-2/k [BJKS02]

7
Matching Upper Bound Our Contribution: For every k there is a 1-pass O~(m 1-2/k ) space algorithm to -approximate F k Additional Features: 1.Works even if we allow deletions, that is, stream of elements (i, +), (i,-) 2. Constant update time

8
Techniques Our algorithm 1. Divide frequencies into buckets 0, [1, 2), [2, 4), [4, 8), …, [2 i-1, 2 i ), … 2. Estimate size s i of each bucket 3. Output X = i s i 2 ik Previous Algorithms [AMS96, CK04, G04] 1. Cleverly construct small-space estimator X s.t. E[X] = F k Var[X] small 2. Apply Chebyshevs inequality

9
Whats Left? Remaining Problem: Estimate s i = # of elements with frequency in each bucket [2 i-1, 2 i ) Is this always easy? No. Suppose always easy – then could approximate the maximum frequency This is HARD – (m) space [AMS96] However, (m) only applies to worst-case streams, otherwise can do better: Countsketch [CCF-C]

10
For the moment, lets assume: 1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch) 2. We have a very long RAM of random bits (we remove this using Nisans generator) 0110001… items frequency Max

11
Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p. General Idea: Max + Sampling 7113734 … Random subset = {1, 3} … 3 311

12
General Idea: Max + Sampling What are chances the maximum lies in S i = elements r such that f r 2 [2 i-1, 2 i )? Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p. q = (1-p) j > i s j ¢ (1 – (1-p) s i ) Idea: 1. Estimate q as q by taking independent trials and computing fraction of max in S i 2. If already estimated s j for j > i, solve this expression for s i.

13
When is this estimate any good? Recall q = (1-p) {j > i} sj (1 – (1-p) si ), so estimate s i : Need 1. (holds inductively) 2. Requires 9 p so that q > 1/R, where R = # trials used to estimate q (tight concentration of q)

14
When is this estimate any good? Motivates the following: Say a class S i contributes if and only if s i > j > i s j /R If R = (log n), then F k ¼ contributing i s i 2 ik q = (1-p) j > i s j (1 – (1-p) s i ) p too large? ! q too small p too small? ! q too small

15
The Idealized Algorithm 1. Use the random string to generate hash functions h j r : [m] -> [2 j ] for j 2 [log m] and r 2 [R] 2. Restrict stream Str to Str j r, those items i with h j r (i) = 1 3. For each Str j r, compute Max(Str j r ) 4. To estimate s i given s t for t > i, find some j for which enough of the Max(Str j r ) come from S i, and then set 5. Output F k = i s i 2 ik

16
Removing the assumptions [CCF-C02]: 9 a 1-pass O(B)-space algorithm CountSketch which, given stream Str, outputs all x for which f x 2 ¸ F 2 /B 1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space Lemma: If S i = [2 i-1, 2 i ) contributes, then Proof: Holders inequality. Recall: S i contributes if and only if s i > j > i s j /R

17
Removing the assumptions 2. We have an infinite string of random bits Consider a space-S algorithm A and a function f, with random strings R 1, …, R n that, when processing a stream, maintains a variable C, and updates as follows: C = C + f(i, R i ) [Indyk00] Then R 1, …, R n can be generated using Nisans PRG, and: 1. The new algorithm A has space O~(S) 2. The outputs of A and A are indistinguishable Our algorithm follows this framework

18
Conclusions Result: Tight O~(m 1-2/k ) upper bound Handle deletions (j, -) O~(1) update time Open Problem: Reduce O~ factors

Similar presentations

OK

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on public private partnership in india Ppt on save water free download Ppt on mammals and egg laying animals Ppt on origin of english language Ppt on diode as rectifier symbol Ppt on artificial intelligence in machines Ppt on forests in india Ppt on chapter iv of companies act 2013 Ppt on financial performance analysis Odp to ppt online converter