Download presentation

Presentation is loading. Please wait.

Published byNoah Henry Modified over 4 years ago

1
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff

2
The Streaming Model 7113734 … Stream of elements a 1, …, a n each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order Algorithms given one pass over stream Goal: Minimum space algorithm

3
Frequency Moments [AMS96] n = stream size, m = universe size f i = # occurrences of item i Why are frequency moments important? F 0 = # of distinct elements F 1 = n = stream size F 2 = self-join size k-th moment

4
Applications Estimating distinct elements with low space Estimate query selectivity to huge DB without sorting Routers gather # distinct destinations F 2 estimates size of self-joins: Bobx Alicey Bobz a Aliceb Bobc, Aliceby Bobax az cx cz F k measures data skewness f B 2 + f A 2 = 4 + 1 = 5

5
The Best Deterministic Algorithm Trivial algorithm for F k Store/update f i for each item i, sum f i k at end Space = O(mlog n): m items i, log n bits to count f i Negative Results [AMS96]: Compute F k exactly (m) space Any deterministic alg. outputs X with |F k – X| < F k must use (m) space What about randomized algorithms?

6
Randomized Approx Algs for F k Randomized alg. -approximates F k if outputs X s.t. Pr[|F k – X| 2/3 Previous work (table suppresses polylog mn) UpperLower F0F0 1/ 2 [FM85, GT02, BJKST02] 1/ 2 [IW03, W04] F1F1 1-1- F2F2 1/ 2 [AMS96] 1/ 2 [W04] FkFk m 1-1/(k-1) [CK04, G04]m 1-2/k [BJKS02]

7
Matching Upper Bound Our Contribution: For every k there is a 1-pass O~(m 1-2/k ) space algorithm to -approximate F k Additional Features: 1.Works even if we allow deletions, that is, stream of elements (i, +), (i,-) 2. Constant update time

8
Techniques Our algorithm 1. Divide frequencies into buckets 0, [1, 2), [2, 4), [4, 8), …, [2 i-1, 2 i ), … 2. Estimate size s i of each bucket 3. Output X = i s i 2 ik Previous Algorithms [AMS96, CK04, G04] 1. Cleverly construct small-space estimator X s.t. E[X] = F k Var[X] small 2. Apply Chebyshevs inequality

9
Whats Left? Remaining Problem: Estimate s i = # of elements with frequency in each bucket [2 i-1, 2 i ) Is this always easy? No. Suppose always easy – then could approximate the maximum frequency This is HARD – (m) space [AMS96] However, (m) only applies to worst-case streams, otherwise can do better: Countsketch [CCF-C]

10
For the moment, lets assume: 1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch) 2. We have a very long RAM of random bits (we remove this using Nisans generator) 0110001… items frequency Max

11
Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p. General Idea: Max + Sampling 7113734 … Random subset = {1, 3} … 3 311

12
General Idea: Max + Sampling What are chances the maximum lies in S i = elements r such that f r 2 [2 i-1, 2 i )? Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p. q = (1-p) j > i s j ¢ (1 – (1-p) s i ) Idea: 1. Estimate q as q by taking independent trials and computing fraction of max in S i 2. If already estimated s j for j > i, solve this expression for s i.

13
When is this estimate any good? Recall q = (1-p) {j > i} sj (1 – (1-p) si ), so estimate s i : Need 1. (holds inductively) 2. Requires 9 p so that q > 1/R, where R = # trials used to estimate q (tight concentration of q)

14
When is this estimate any good? Motivates the following: Say a class S i contributes if and only if s i > j > i s j /R If R = (log n), then F k ¼ contributing i s i 2 ik q = (1-p) j > i s j (1 – (1-p) s i ) p too large? ! q too small p too small? ! q too small

15
The Idealized Algorithm 1. Use the random string to generate hash functions h j r : [m] -> [2 j ] for j 2 [log m] and r 2 [R] 2. Restrict stream Str to Str j r, those items i with h j r (i) = 1 3. For each Str j r, compute Max(Str j r ) 4. To estimate s i given s t for t > i, find some j for which enough of the Max(Str j r ) come from S i, and then set 5. Output F k = i s i 2 ik

16
Removing the assumptions [CCF-C02]: 9 a 1-pass O(B)-space algorithm CountSketch which, given stream Str, outputs all x for which f x 2 ¸ F 2 /B 1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space Lemma: If S i = [2 i-1, 2 i ) contributes, then Proof: Holders inequality. Recall: S i contributes if and only if s i > j > i s j /R

17
Removing the assumptions 2. We have an infinite string of random bits Consider a space-S algorithm A and a function f, with random strings R 1, …, R n that, when processing a stream, maintains a variable C, and updates as follows: C = C + f(i, R i ) [Indyk00] Then R 1, …, R n can be generated using Nisans PRG, and: 1. The new algorithm A has space O~(S) 2. The outputs of A and A are indistinguishable Our algorithm follows this framework

18
Conclusions Result: Tight O~(m 1-2/k ) upper bound Handle deletions (j, -) O~(1) update time Open Problem: Reduce O~ factors

Similar presentations

OK

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google