Download presentation

Presentation is loading. Please wait.

1
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006 http://www.ee.technion.ac.il/courses/049011

2
2 Data Streams

3
3 Outline The data stream model Approximate counting Distinct elements Frequency moments

4
4 The Data Stream Model f: A n B A,B arbitrary sets n: positive integer (think of n as large) Given x 2 A n, each entry x i is called an “element”. Typically, A,B are “small” (constant size) sets Goal: given x A n, compute f(x) Frequently, approximation of f(x) suffices Usually, will use randomization Streaming access to input Algorithm reads input in “sequential passes” In each pass x is read in the following order: x 1,x 2,…,x n Impossible: random access, go backwards Possible: store portions of x (or other functions of x) in memory

5
5 Complexity Measures Space Objective: use as little memory as possible Note: if we allow unlimited space, data stream model is the same as the standard RAM model Ideally, up to O(log c n) for some constant c Number of passes Objective: use as few passes as possible Ideally, only a single pass Usually, no more than a constant number of passes Running time Objective: use as little time as possible Ideally, up to O(n log c n) for some constant c

6
6 Motivation Types of large data sets: Pre-stored Stored on magnetic or optical media: tapes, disks, DVDs,… Generated on the fly Data feeds, streaming media, packet streams,… Access to large data sets: Random access: costly (if data is pre-stored) infeasible (if data is generated on the fly) Streaming access: the only feasible option Resources: Memory: the primary bottleneck Number of passes: a few (if data is pre-stored) single (if data is generated on the fly) Time: cannot be more than quasi-linear

7
7 Approximate Counting [Morris 77, Flajolet 85] Input: a bitstring x {0,1} n Goal: find H = number of 1’s in x Naïve solution: just count them! O(log H) bits of space Can we do better? Answer 1: No! Information theory implies an (log H) lower bound Answer 2: Yes! But only approximately: Output closest power of 1+ to H Note: # possible outputs is O(log 1+ H) = O(1/ log H) Hence, only O(log log H + log(1/ )) bits of space suffice

8
8 Approximate Counting ( = 1) k 0 for i = 1 to n do if x i = 1, then with probability 1/2 k, set k k + 1 output 2 k - 1 General idea: Expected # of 1’s needed to increment k to k + 1 is 2 k k = 0 k = 1: after seeing 1 one k = 1 k = 2: after seeing 2 additional 1’s k = 2 k = 3: after seeing 4 additional 1’s …… k = i-1 k = i: after seeing 2 i-1 additional 1’s Therefore, we expect k to become i after seeing 1 + 2 + 4 + … + 2 i-1 = 2 i – 1 1’s

9
9 Approximate Counting: Analysis For m = 0,…,H, let: K m = value of counter after seeing m 1’s. For i = 0,…,m, let p m,i = Pr[K m = i] Recursion: p 0,0 = 1 p m,0 = 0, for m = 1,…,H p m,i = p m-1,i (1 – 1/2 i ) + p m-1,i-1 1/2 i-1, for m = 1,…,H, i = 1,…,m-1 p m,m = p m-1,m-1 1/2 m-1, for m = 1,…,H

10
10 Approximate Counting: Analysis Define:C m = 2 K m Lemma: E[C m ] = m + 1 Therefore, C H - 1 is an unbiased estimator for H Can show that Var[C H ] is small, and therefore w.h.p. H/2 ≤ C H – 1 ≤ 2H. Proof of lemma: By induction on m. Basis: E[C 0 ] = 1, E[C 1 ] = 2. Suppose m ≥ 2 and E[C m-1 ] = m.

11
11 Approximate Counting: Analysis

12
12 Better Approximation So far, factor 2 approximation How do we obtain a 1+ approximation? k 0 for i = 1 to n do if x i = 1, then with probability 1/(1 + ) k, set k k + 1 output ((1 + ) k – 1)/

13
13 Distinct Elements [Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] Input: a vector x {1,2,…,m} n Goal: find D = number of distinct elements of x Example: if x = (1,2,3,1,2,3), then D = 3 Naïve solution: use a bit vector of size m, and track the values that appeared at least once O(m) bits of space Can we do better? Answer 1: No! If we want exact number, need (m) bits of space Information theory gives only (log m) Need communication complexity arguments Answer 2: Yes! But only approximately: Use only O(log m) bits of space

14
14 Estimating the Size of a Random Set Suppose we choose D << M 1/2 elements uniformly and independently from {1,…,M}: X 1 is uniformly chosen from {1,…,M} X 2 is uniformly chosen from {1,…,M} …… X D is uniformly chosen from {1,…,M} For each k = 1,…,D, how many elements of {1,…,M} do we expect to be smaller than min{X 1,…,X k }? k = 1, we expect M/2 elements to be less than X 1 k = 2, we expect M/3 elements to be less than min{X 1,X 2 } k = 3, we expect M/4 elements to be less than min{X 1,X 2,X 3 } …… k = D, we expect M/(D+1) elements to be less than min{X 1,…,X D } Conversely, suppose S is a set of randomly chosen elements from {1,…,M} whose size is unknown Then, if t = min S, we can estimate |S| as M/t – 1.

15
15 Distinct Elements, 1 st Attempt Let M >> m 2 Pick a random “hash function” h: {1,…,m} {1,…,M}, h(1),…,h(m) are chosen uniformly and independently from {1,…,M} Since M >> m 2, probability of collisions is tiny min M for i = 1 to n do if h(x i ) < min, min h(x i ) output M/min

16
16 Distinct Elements: Analysis Space: O(log M) = O(log m) Not quite. We’ll discuss this later. Correctness: Let a 1,…,a D be the distinct values of x 1,…,x n S = { h(a 1 ),…,h(a D ) } is a set of D random and independent elements from { 1,…,M } Note: min = min S Algorithm outputs M/(min S) Lemma: With probability at least 2/3, D/6 ≤ M/min ≤ 6D.

17
17 Distinct Elements: Correctness Part 1: show Define for k = 1,…,D: Define: Note:

18
18 Markov’s Inequality X 0: a non-negative random variable t > 1 Then: Need to show: By Markov’s inequality,

19
19 Distinct Elements: Correctness Part 2: show Define for k = 1,…,D: Define: Note:

20
20 Chebyshev’s Inequality X: an arbitrary random variable > 0 Then: Need to show: By Chebyshev’s inequality, By independence of Y 1,…,Y D : Hence,

21
21 How to Store the Hash Function? How many bits needed to represent a random hash function h: [m] [M]? O(m log M) = O(m log m) bits More than the naïve algorithm! Solution: use “small” families of hash functions H will be a family of functions h: [m] [M] |H| = O(m c ) for some constant c Each h H, can be represented in O(log m) bits Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently. How do we make sure H has the “random-like” properties of totally random hash functions?

22
22 Universal Hash Functions [Carter, Wegman 79] H is a 2-universal family of hash functions if: For all x y [m] and for all z,w [M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M 2 Conclusions: For each x, h(x) is uniform in [M] For all x y, h(x) and h(y) are independent h(1),…,h(m) is a sequence of uniform pairwise- independent random variables k-universal families: straightforward generalization

23
23 Construction of a Universal Family Suppose m = M and m is a prime power. [m] can then be associated with the finite field F m Each two elements a,b F m will define one hash function in H |H| = |F m | 2 = m 2 h a,b (x) = ax + b (operations in F m ) Note: if x y [m] and z,w [m], then h a,b (x) = z and h a,b (y) = w iff Since x y, the above system has a unique solution, and thus if we choose a,b at random the probability to hit the solution is exactly 1/m 2.

24
24 Distinct Elements, 2 nd Attempt Use a random hash function from a 2-universal family of hash functions rather than a totally random hash function Space: O(log m) for tracking the minimum O(log m) for storing the hash function Correctness: Part 1: h(a 1 ),…,h(a D ) are still uniform in [M] Linearity of expectation holds regardless of whether Z 1,…,Z k are independent or not. Part 2: h(a 1 ),…,h(a D ) are still uniform in [M] Main point: variance of pairwise independent variables is additive:

25
25 Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 + approximation algorithm: Find the t = O(1/ 2 ) smallest elements, rather than just the smallest one. If v is the largest among these, output tM/v Space: O(1/ 2 log m) Better algorithm: O(1/ 2 + log m)

26
26 Frequency Moments [Alon, Matias, Szegedy 96] Input: a vector x {1,2,…,m} n Goal: find F k = k-th frequency moment of x For each j {1,…,m}, f j = # of occurrences of j in x Ex: if x = (1,1,1,2,2,3) then f 1 = 3, f 2 = 2, f 3 = 1 Examples F 1 = n (counting) F 0 = distinct elements F 2 = measure of “pairwise collisions” F k = measure of “k-wise collisions”

27
27 Frequency Moments: Data Stream Algorithms F 0 : O(1/ 2 + log m) F 1 : O(log log n + log(1/ )) F 2 : O(1/ 2 (log m + log n)) F k, k > 2: O(1/ 2 m 1-2/k )

28
28 End of Lecture 12

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google