# 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

## Presentation on theme: "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006"— Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006 http://www.ee.technion.ac.il/courses/049011

2 Data Streams

3 Outline The data stream model Approximate counting Distinct elements Frequency moments

4 The Data Stream Model f: A n  B  A,B arbitrary sets  n: positive integer (think of n as large)  Given x 2 A n, each entry x i is called an “element”.  Typically, A,B are “small” (constant size) sets Goal: given x  A n, compute f(x)  Frequently, approximation of f(x) suffices  Usually, will use randomization Streaming access to input  Algorithm reads input in “sequential passes”  In each pass x is read in the following order: x 1,x 2,…,x n  Impossible: random access, go backwards  Possible: store portions of x (or other functions of x) in memory

5 Complexity Measures Space  Objective: use as little memory as possible  Note: if we allow unlimited space, data stream model is the same as the standard RAM model  Ideally, up to O(log c n) for some constant c Number of passes  Objective: use as few passes as possible  Ideally, only a single pass  Usually, no more than a constant number of passes Running time  Objective: use as little time as possible  Ideally, up to O(n log c n) for some constant c

6 Motivation Types of large data sets:  Pre-stored Stored on magnetic or optical media: tapes, disks, DVDs,…  Generated on the fly Data feeds, streaming media, packet streams,… Access to large data sets:  Random access: costly (if data is pre-stored) infeasible (if data is generated on the fly)  Streaming access: the only feasible option Resources:  Memory: the primary bottleneck  Number of passes: a few (if data is pre-stored) single (if data is generated on the fly)  Time: cannot be more than quasi-linear

7 Approximate Counting [Morris 77, Flajolet 85] Input: a bitstring x  {0,1} n Goal: find H = number of 1’s in x Naïve solution: just count them!  O(log H) bits of space Can we do better? Answer 1: No!  Information theory implies an  (log H) lower bound Answer 2: Yes!  But only approximately:  Output closest power of 1+  to H  Note: # possible outputs is O(log 1+  H) = O(1/  log H)  Hence, only O(log log H + log(1/  )) bits of space suffice

8 Approximate Counting (  = 1) k  0 for i = 1 to n do  if x i = 1, then with probability 1/2 k, set k  k + 1 output 2 k - 1 General idea:  Expected # of 1’s needed to increment k to k + 1 is 2 k  k = 0  k = 1: after seeing 1 one  k = 1  k = 2: after seeing 2 additional 1’s  k = 2  k = 3: after seeing 4 additional 1’s ……  k = i-1  k = i: after seeing 2 i-1 additional 1’s  Therefore, we expect k to become i after seeing 1 + 2 + 4 + … + 2 i-1 = 2 i – 1 1’s

9 Approximate Counting: Analysis For m = 0,…,H, let: K m = value of counter after seeing m 1’s. For i = 0,…,m, let p m,i = Pr[K m = i] Recursion:  p 0,0 = 1  p m,0 = 0, for m = 1,…,H  p m,i = p m-1,i (1 – 1/2 i ) + p m-1,i-1 1/2 i-1, for m = 1,…,H, i = 1,…,m-1  p m,m = p m-1,m-1 1/2 m-1, for m = 1,…,H

10 Approximate Counting: Analysis Define:C m = 2 K m Lemma: E[C m ] = m + 1 Therefore, C H - 1 is an unbiased estimator for H Can show that Var[C H ] is small, and therefore w.h.p. H/2 ≤ C H – 1 ≤ 2H. Proof of lemma: By induction on m.  Basis: E[C 0 ] = 1, E[C 1 ] = 2.  Suppose m ≥ 2 and E[C m-1 ] = m.

11 Approximate Counting: Analysis

12 Better Approximation So far, factor 2 approximation How do we obtain a 1+  approximation? k  0 for i = 1 to n do  if x i = 1, then with probability 1/(1 +  ) k, set k  k + 1 output ((1 +  ) k – 1)/ 

13 Distinct Elements [Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] Input: a vector x  {1,2,…,m} n Goal: find D = number of distinct elements of x  Example: if x = (1,2,3,1,2,3), then D = 3 Naïve solution: use a bit vector of size m, and track the values that appeared at least once  O(m) bits of space Can we do better? Answer 1: No!  If we want exact number, need  (m) bits of space Information theory gives only  (log m) Need communication complexity arguments Answer 2: Yes!  But only approximately:  Use only O(log m) bits of space

14 Estimating the Size of a Random Set Suppose we choose D << M 1/2 elements uniformly and independently from {1,…,M}:  X 1 is uniformly chosen from {1,…,M}  X 2 is uniformly chosen from {1,…,M} ……  X D is uniformly chosen from {1,…,M} For each k = 1,…,D, how many elements of {1,…,M} do we expect to be smaller than min{X 1,…,X k }?  k = 1, we expect M/2 elements to be less than X 1  k = 2, we expect M/3 elements to be less than min{X 1,X 2 }  k = 3, we expect M/4 elements to be less than min{X 1,X 2,X 3 } ……  k = D, we expect M/(D+1) elements to be less than min{X 1,…,X D } Conversely, suppose S is a set of randomly chosen elements from {1,…,M} whose size is unknown Then, if t = min S, we can estimate |S| as M/t – 1.

15 Distinct Elements, 1 st Attempt Let M >> m 2 Pick a random “hash function” h: {1,…,m}  {1,…,M},  h(1),…,h(m) are chosen uniformly and independently from {1,…,M}  Since M >> m 2, probability of collisions is tiny min  M for i = 1 to n do  if h(x i ) < min, min  h(x i ) output M/min

16 Distinct Elements: Analysis Space: O(log M) = O(log m)  Not quite. We’ll discuss this later. Correctness:  Let a 1,…,a D be the distinct values of x 1,…,x n  S = { h(a 1 ),…,h(a D ) } is a set of D random and independent elements from { 1,…,M }  Note: min = min S  Algorithm outputs M/(min S) Lemma: With probability at least 2/3, D/6 ≤ M/min ≤ 6D.

17 Distinct Elements: Correctness Part 1: show Define for k = 1,…,D: Define: Note:

18 Markov’s Inequality X  0: a non-negative random variable t > 1 Then: Need to show: By Markov’s inequality,

19 Distinct Elements: Correctness Part 2: show Define for k = 1,…,D: Define: Note:

20 Chebyshev’s Inequality X: an arbitrary random variable > 0 Then: Need to show: By Chebyshev’s inequality, By independence of Y 1,…,Y D : Hence,

21 How to Store the Hash Function? How many bits needed to represent a random hash function h: [m]  [M]?  O(m log M) = O(m log m) bits  More than the naïve algorithm! Solution: use “small” families of hash functions  H will be a family of functions h: [m]  [M]  |H| = O(m c ) for some constant c  Each h  H, can be represented in O(log m) bits  Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently.  How do we make sure H has the “random-like” properties of totally random hash functions?

22 Universal Hash Functions [Carter, Wegman 79] H is a 2-universal family of hash functions if:  For all x  y  [m] and for all z,w  [M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M 2 Conclusions:  For each x, h(x) is uniform in [M]  For all x  y, h(x) and h(y) are independent  h(1),…,h(m) is a sequence of uniform pairwise- independent random variables k-universal families: straightforward generalization

23 Construction of a Universal Family Suppose m = M and m is a prime power. [m] can then be associated with the finite field F m Each two elements a,b  F m will define one hash function in H  |H| = |F m | 2 = m 2 h a,b (x) = ax + b (operations in F m ) Note: if x  y  [m] and z,w  [m], then h a,b (x) = z and h a,b (y) = w iff Since x  y, the above system has a unique solution, and thus if we choose a,b at random the probability to hit the solution is exactly 1/m 2.

24 Distinct Elements, 2 nd Attempt Use a random hash function from a 2-universal family of hash functions rather than a totally random hash function Space:  O(log m) for tracking the minimum  O(log m) for storing the hash function Correctness:  Part 1: h(a 1 ),…,h(a D ) are still uniform in [M] Linearity of expectation holds regardless of whether Z 1,…,Z k are independent or not.  Part 2: h(a 1 ),…,h(a D ) are still uniform in [M] Main point: variance of pairwise independent variables is additive:

25 Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 +  approximation algorithm:  Find the t = O(1/  2 ) smallest elements, rather than just the smallest one.  If v is the largest among these, output tM/v Space: O(1/  2 log m)  Better algorithm: O(1/  2 + log m)

26 Frequency Moments [Alon, Matias, Szegedy 96] Input: a vector x  {1,2,…,m} n Goal: find F k = k-th frequency moment of x For each j  {1,…,m}, f j = # of occurrences of j in x  Ex: if x = (1,1,1,2,2,3) then f 1 = 3, f 2 = 2, f 3 = 1 Examples  F 1 = n (counting)  F 0 = distinct elements  F 2 = measure of “pairwise collisions”  F k = measure of “k-wise collisions”

27 Frequency Moments: Data Stream Algorithms F 0 : O(1/  2 + log m) F 1 : O(log log n + log(1/  )) F 2 : O(1/  2 (log m + log n)) F k, k > 2: O(1/  2 m 1-2/k )

28 End of Lecture 12

Download ppt "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006"

Similar presentations