Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

Similar presentations


Presentation on theme: "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006"— Presentation transcript:

1 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 http://www.ee.technion.ac.il/courses/049011

2 2 Data Streams (cont.)

3 3 Outline Distinct elements L p norms Notation: for integers a < b, [a,b] = {a, a+1, …, b}

4 4 Distinct Elements [Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] Input: a vector x  [1,m] n Goal: find D = number of distinct elements of x Exact algorithms: need  (m) bits of space Deterministic algorithms: need  (m) bits of space Approximate randomized algorithms: O(log m) bits of space

5 5 Distinct Elements, 1 st Attempt Let M >> m 2 Pick a “random hash function” h: [1,m]  [1,M]  h(1),…,h(m) are chosen uniformly and independently from [1,M]  Since M >> m 2, probability of collisions is tiny 1. min  M 2. for i = 1 to n do 3. read x i from stream 4. if h(x i ) < min, min  h(x i ) 5. output M/min

6 6 Distinct Elements: Analysis Space:  O(log M) = O(log m) for min  O(m log M) = O(m log m) for h Too much!  Worse than the naïve O(m) space algorithm Next: show how to use more “space- efficient” hash functions

7 7 Small Families of Hash Functions H = {h | h: [1,m]  [1,M] }: a family of hash functions |H| = O(m c ) for some constant c  Therefore, each h  H can be represented in O(log m) bits Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently. How do we make sure H has the “random-like” properties of random hash functions?

8 8 Universal Hash Functions [Carter, Wegman 79] H is a 2-universal family of hash functions if: For all x  y  [1,m] and for all z,w  [1,M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M 2 Conclusions:  For each x, h(x) is uniform in [1,M]  For all x  y, h(x) and h(y) are independent  h(1),…,h(m) is a sequence of uniform pairwise-independent random variables k-universal families: straightforward generalization

9 9 Construction of a Universal Family Suppose M = prime power [1,M] can be viewed as a finite field F M  [1,m] can be viewed as elements of F M H = { h a,b | a,b  F M } is defined as: h a,b (x) = ax + b Note:  |H| = M 2  If x  y  F M and z,w  F m, then h a,b (x) = z and h a,b (y) = w iff  Since x  y, the above system has a unique solution  Hence, Pr a,b [h a,b (x) = z and h a,b (y) = w] = 1/M 2.

10 10 Distinct Elements, 2 nd Attempt Use 2-universal hash functions rather than random hash function Space:  O(log m) for tracking the minimum  O(log m) for storing the hash function Correctness:  Part 1: h(a 1 ),…,h(a D ) are still uniform in [1,M] Linearity of expectation holds regardless of whether Z 1,…,Z k are independent or not.  Part 2: h(a 1 ),…,h(a D ) are still uniform in [1,M] Main point: variance of pairwise independent variables is additive:

11 11 Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 +  approximation algorithm:  Find the t = O(1/  2 ) smallest elements, rather than just the smallest one.  If v is the largest among these, output tM/v Space: O(1/  2 log m)  Better algorithm: O(1/  2 + log m)

12 12 L p Norms Input: an integer vector x  [-m,+m] n Goal: find ||x|| p = L p norm of x Popular instantiations:  L 2 : Euclidean distance  L 1 : Manhattan distance  L  : max  L 0 : # of non-zeros (assuming 1/0 = 1, 0 0 = 0) Not a norm Data stream algorithm:  Can be done trivially in O(log m) space

13 13 L p Norms: The “Cash Register” Model Input: a sequence X of N pairs (i 1,a 1 ),…,(i N,a N )  For each j, i j  {1,…,n}  For each j, a j  [-m,m] Ex: X = (1,3), (3,-2), (1,-5), (2,4), (2,1)  For each i = 1,…,n, let S i = { j | i j = i } Ex: S 1 = {1,3}, S 2 = {4,5}, S 3 = {2}  Define: x i =  j  S i a j Ex: x 1 = -2, x 2 = 5, x 3 = -2 Goal: find ||x|| p = L p norm of x

14 14 L p Norms in the “Cash Register” Model: Applications Standard L p norms L p distances  Input: two vectors x,y  [-m,+m] n (interleaved arbitrarily)  Goal: find ||x – y|| p Frequency moments:  Input: a vector X  [1,n] N Ex: X = (1 2 3 1 1 2)  For each i = 1,…,n, define: x i = frequency of i in X Ex: x 1 = 3, x 2 = 2, x 3 = 1  Goal: output ||x|| p  Special cases: p =  : Most frequent element p = 0: Distinct elements

15 15 L p Norms: State of the Art Results 0 < p ≤ 2: O(log n log m) space algorithm [Indyk 00] 2 < p <  : O(n 1-2/p log m) space algorithm [Indyk,Woodruff 05]   (n 1-2/p-o(1) ) space lower bound [Saks, Sun 02], [Bar- Yossef,Jayram,Kumar,Sivakumar 02], [Chakrabarti, Khot, Sun 03] p =  : O(n) space algorithm [Alon,Matias,Szegedy 96]   (n) space lower bound [Alon,Matias,Szegedy 96] p = 0 (distinct elements): O(log n + 1/  2 ) space algorithm [Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan 02]   (log n + 1/  2 ) space lower bound [Alon,Matias,Szegedy 96], [Indyk, Woodruff 03]

16 16 Stable Distributions D: distribution on R, x  R n, p  (0,2] The distribution D x :  Z 1,…,Z n : i.i.d. random variables with distribution D  D x = distribution of  i x i Z i The distribution D p,x :  Z: random variable with distribution D  D p,x = distribution of ||x|| p Z Definition: D is p-stable, if for every x, D x = D p,x. Examples:  p = 2: Standard normal distribution.  p = 1: Cauchy distribution.  Other p’s: no closed form pdf.

17 17 Indyk’s Algorithm For simplicity, assume p = 1. Input: a sequence X = (i 1,a 1 ),…,(i N,a N ) Output: a value z s.t. “Cauchy hash function”: h:[1,n]  R  h(1),…,h(n) are i.i.d. with Cauchy distribution  In practice, use bounded precision

18 18 Indyk’s Algorithm, 1 st Attempt 1.k  O(1/  2 log(1/  )) 2.generate k Cauchy hash functions h 1,…,h k 3.for t = 1,…,k do 4. A t  0 5.for j = 1,…,N do 6. read (i j,a j ) from data stream 7. for t = 1,…,k do 8. A t  A t + a j h t (i j ) 9.output median(A 1,…,A k )

19 19 Correctness Analysis Fix some t  [1,k] What value does A t have at the end of the execution? Recall: h t (1),…,h t (n) are i.i.d. with 1-stable distribution Therefore, A t is distributed the same as: ||x|| 1 Z  Z: random variable with Cauchy distribution

20 20 Correctness Analysis (cont.) Z 1,…,Z k : i.i.d. random variables with Cauchy distribution Output of algorithm: median(A 1,…,A k ) Same as: median(||x|| 1 Z 1,…,||x|| 1 Z k ) = ||x|| 1 median(Z 1,…,Z k ) Conclusion: enough to show:

21 21 Correctness Analysis (cont.) Claim: Let Z be distributed Cauchy. Then, Proof: The cdf of the Cauchy distribution is: Therefore, Claim: Let Z be distributed Cauchy. For any sufficiently small  > 0,

22 22 Correctness Analysis (cont.) Claim: Let Z 1,…,Z k be k = O(1/  2 log(1/  )) i.i.d. Cauchy random variables. Then, Proof:  For j = 1,…,k, let  Then, median(Z 1,…,Z k ) < 1 -  iff  j Y j ≥ k/2  E[  j Y j ] = k/2 -  k/4  By Chernoff-Heoffding bound, Pr[  j Y j ≥ k/2] <  /2  Similar analysis shows: Pr[median(Z 1,…,Z k ) > 1 +  ] <  /2

23 23 Space Analysis Space used: k = O(1/  2 log(1/  )) times:  A t :O(log m) bits  h t :O(n log m) bits Too much! This time we really need h t (1),…,h t (n) to be totally independent  Otherwise, resulting distribution is not stable  Cannot use universal hashing  What can we do?

24 24 Pseudo-Random Generators for Space-Bounded Computations [Nisan 90] Notation: U k = a random sequence of k bits An S-space R-random bits randomized algorithm A:  Uses at most S bits of space  Uses at most R random bits  Accesses random bits sequentially  A(x,U R ): (random) output of A on input x Nisan’s pseudo-random generator: G: {0,1} S log R  {0,1} R s.t.  For every S-space R-random bits randomized algorithm A,  for every input x,  A(x,U R ) has almost the same distribution as A(x,G(U S log R ))

25 25 Space Analysis Suppose input stream is guaranteed to come in the following order:  First all pairs of the form (1,*)  Then, all pairs of the form (2,*), ……  Finally, all pairs of the form (n,*) Then, we can generate the values h t (1),…,h t (n) on the fly, and no need to store them  O(log m) bits will suffice to store the hash function Therefore, for such input streams, Indyk’s algorithm uses:  O(log m) bits of space  O(n log m) random bits

26 26 Space Analysis (cont.) Conclusion: For “ordered” input streams, Indyk’s algorithm is an O(log m)-space O(n log m)-random bits randomized algorithm. Can use Nisan’s generator  h t can now be generated from only O(log m log n) random bits  Space needed: O(log n log m) bits Crucial observation: Indyk’s algorithm does not depend on the order of the input stream. Conclusion: If we generate the Cauchy hash functions using Nisan’s generator, then Indyk’s algorithm will work even for “unordered” streams.

27 27 Wrapping Up Space used: k = O(1/  2 log(1/  )) times:  A t :O(log m) bits  h t :O(log n log m) bits (using Nisan’s generator) Total: O(1/  2 log(1/  ) log n log m) bits

28 28 End of Lecture 13


Download ppt "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006"

Similar presentations


Ads by Google