# Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

## Presentation on theme: "Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff."— Presentation transcript:

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff

The Streaming Model 7113734 … Stream of elements a 1, …, a n each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order Algorithms given one pass over stream Goal: Minimum space algorithm

Frequency Moments [AMS96] n = stream size, m = universe size f i = # occurrences of item i Why are frequency moments important? F 0 = # of distinct elements F 1 = n = stream size F 2 = self-join size k-th moment

Applications Estimating distinct elements with low space Estimate query selectivity to huge DB without sorting Routers gather # distinct destinations F 2 estimates size of self-joins: Bobx Alicey Bobz a Aliceb Bobc, Aliceby Bobax az cx cz F k measures data skewness f B 2 + f A 2 = 4 + 1 = 5

The Best Deterministic Algorithm Trivial algorithm for F k Store/update f i for each item i, sum f i k at end Space = O(mlog n): m items i, log n bits to count f i Negative Results [AMS96]: Compute F k exactly (m) space Any deterministic alg. outputs X with |F k – X| < F k must use (m) space What about randomized algorithms?

Randomized Approx Algs for F k Randomized alg. -approximates F k if outputs X s.t. Pr[|F k – X| 2/3 Previous work (table suppresses polylog mn) UpperLower F0F0 1/ 2 [FM85, GT02, BJKST02] 1/ 2 [IW03, W04] F1F1 1-1- F2F2 1/ 2 [AMS96] 1/ 2 [W04] FkFk m 1-1/(k-1) [CK04, G04]m 1-2/k [BJKS02]

Matching Upper Bound Our Contribution: For every k there is a 1-pass O~(m 1-2/k ) space algorithm to -approximate F k Additional Features: 1.Works even if we allow deletions, that is, stream of elements (i, +), (i,-) 2. Constant update time

Techniques Our algorithm 1. Divide frequencies into buckets 0, [1, 2), [2, 4), [4, 8), …, [2 i-1, 2 i ), … 2. Estimate size s i of each bucket 3. Output X = i s i 2 ik Previous Algorithms [AMS96, CK04, G04] 1. Cleverly construct small-space estimator X s.t. E[X] = F k Var[X] small 2. Apply Chebyshevs inequality

Whats Left? Remaining Problem: Estimate s i = # of elements with frequency in each bucket [2 i-1, 2 i ) Is this always easy? No. Suppose always easy – then could approximate the maximum frequency This is HARD – (m) space [AMS96] However, (m) only applies to worst-case streams, otherwise can do better: Countsketch [CCF-C]

For the moment, lets assume: 1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch) 2. We have a very long RAM of random bits (we remove this using Nisans generator) 0110001… items frequency Max

Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p. General Idea: Max + Sampling 7113734 … Random subset = {1, 3} … 3 311

General Idea: Max + Sampling What are chances the maximum lies in S i = elements r such that f r 2 [2 i-1, 2 i )? Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p. q = (1-p) j > i s j ¢ (1 – (1-p) s i ) Idea: 1. Estimate q as q by taking independent trials and computing fraction of max in S i 2. If already estimated s j for j > i, solve this expression for s i.

When is this estimate any good? Recall q = (1-p) {j > i} sj (1 – (1-p) si ), so estimate s i : Need 1. (holds inductively) 2. Requires 9 p so that q > 1/R, where R = # trials used to estimate q (tight concentration of q)

When is this estimate any good? Motivates the following: Say a class S i contributes if and only if s i > j > i s j /R If R = (log n), then F k ¼ contributing i s i 2 ik q = (1-p) j > i s j (1 – (1-p) s i ) p too large? ! q too small p too small? ! q too small

The Idealized Algorithm 1. Use the random string to generate hash functions h j r : [m] -> [2 j ] for j 2 [log m] and r 2 [R] 2. Restrict stream Str to Str j r, those items i with h j r (i) = 1 3. For each Str j r, compute Max(Str j r ) 4. To estimate s i given s t for t > i, find some j for which enough of the Max(Str j r ) come from S i, and then set 5. Output F k = i s i 2 ik

Removing the assumptions [CCF-C02]: 9 a 1-pass O(B)-space algorithm CountSketch which, given stream Str, outputs all x for which f x 2 ¸ F 2 /B 1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space Lemma: If S i = [2 i-1, 2 i ) contributes, then Proof: Holders inequality. Recall: S i contributes if and only if s i > j > i s j /R

Removing the assumptions 2. We have an infinite string of random bits Consider a space-S algorithm A and a function f, with random strings R 1, …, R n that, when processing a stream, maintains a variable C, and updates as follows: C = C + f(i, R i ) [Indyk00] Then R 1, …, R n can be generated using Nisans PRG, and: 1. The new algorithm A has space O~(S) 2. The outputs of A and A are indistinguishable Our algorithm follows this framework

Conclusions Result: Tight O~(m 1-2/k ) upper bound Handle deletions (j, -) O~(1) update time Open Problem: Reduce O~ factors

Download ppt "Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff."

Similar presentations