Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Similar presentations


Presentation on theme: "Big Data Lecture 5: Estimating the second moment, dimension reduction, applications."— Presentation transcript:

1 Big Data Lecture 5: Estimating the second moment, dimension reduction, applications

2 The second moment A,B,A,C,D,D,A,A,E,B,E,E,F,… The second moment: f(x)
4 A 2 B 1 C 3 D E F The second moment:

3 Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain: h(x) f(x) x -1 4 A
B C 3 D E F Maintain:

4 Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain: h(x) f(x) x -1 4 A
B C 3 D E F Maintain:

5 AMS Analysis

6 2-wise independent hash family
Suppose h : [d]  [T] Fix 2 values t1 and t2 in the range of h Fix 2 values x1  x2 in the domain of h What is the probability that h(x1) = t1 and h(x2) = t2 ? x1 t1 ? x2 t2

7 2-wise independent hash family
H, a family of hash functions h, is 2-wise independent iff  x1x2 t1 t2 PrhH (h(x1) = t1 and h(x2) = t2) = 1/|T|2 x1 t1 ? x2 t2

8 2-wise independent hash family
H={(ax+b) mod T | 0  a,b < T} is 2-wise independent if T is a prime > d H={2((ax+b) mod T mod 2) - 1| 0  a,b < T} is approximately 2-wise independent from [d] to {-1,1} We can get an exact 2-wise ind. by more complicated constructions

9 Draw h from 2-wise ind. family
Z2 is an unbiased estimator for F2 !

10 What is the variance of Z2 ?
Here we will assume that h is drawn from a 4-wise inde. family H

11 What is the variance of Z2 ?

12 Chebyshev’s Inequality

13 Chebyshev’s Inequality
If  is small this is meaningless… We need to reduce the variance How ?

14 Averaging Draw k ind. hash functions h1, h2, …. , hk Use

15 Chebyshev’s Inequality
Pick

16 Boosting the confidence – Chernoff bounds
Pick 1/4 1/4

17 Boosting the confidence – Chernoff bounds
Now repeat the experiment s = O(log(1/)) times We get A1,…..,As (assume they are sorted) Return their median Why is this good ?

18 Boosting the confidence – Chernoff bounds
Each of A1,…..,As is bad ((1  ) far from F2) with probability ≤ ¼ For the median to be bad we need more than ½ of A1,…..,As to be bad (remove the pair consisting of the largest and smallest and repeat... If both components of some pair are good then median is good…) A1, A2 , ……. ,As-1,As

19 Boosting the confidence – Chernoff bounds
What is the probability that more than ½ are bad ? Chernoff: Let X = X1 + …..+ Xs where each Xi is Bernoulli with p = ¼ then s = O(log(1/)) with a large enough constant

20 Recap =

21 This is a random projection..
= Preserve distances in the sense:

22 Make it look more familiar..
= Preserve distances in the sense:

23 Dimension reduction (A random orthonormal k  d) =
We project into a random k-dim. subspace

24 Dimension reduction (A random orthonormal k  d) =
We project into a random k-dim. subspace JL: ε[0,1]

25 Dimension reduction (A random orthonormal k  d) =
We project into a random k-dim. subspace JL: ε[0,1]

26 Johnson-Lindenstrauss
JL: Project the vectors x1,….,xn into a random k-dimensional subspace for k=O(log(n)/2) then with probability 1-1/nc :

27 The proof (A random orthonormal k  d) =
Obs1: Its enough to prove for vectors such that ||x||2=1 JL:

28 The proof (A random orthonormal k  d) =
Obs1: Its enough to prove for vectors such that ||x||2=1 JL:

29 The proof (A random orthonormal k  d) =
Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

30 The proof Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

31 The case k=1 Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

32 The case k=1 Random unit vec = JL:

33 The case k=1 1 ε[0,1]

34 An application: approximate period
10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1, Find r such that is minimized

35 An application, approximate period
10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1, Find r such that is minimized

36 An application, approximate period
10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1, Find r such that is minimized

37 An exact algorithm Find r such that is minimized
For each value of r takes linear time  O(m2)

38 An exact algorithm Find r such that is minimized
For each value of r takes linear time  O(m2) We can sketch/project all windows of length r and compare the sketches … but O(m2k) just for sketching…

39 Obs1: We can sketch faster..
A running inner-product with a unit vector This is similar to a convolution of two vectors

40 Convolution 1 2 3 4 5 3 2 1

41 Convolution 1 2 3 4 5 3 2 1

42 Convolution 1 2 3 4 5 3 2 1

43 Convolution 1 2 3 4 5 3 2 1

44 Convolution 1 2 3 4 5 3 2 1 We can compute the convolution in O(mlog(r)) time using the FFT

45 Obs1: We can sketch faster
We can compute the first coordinate of all sketches in O(mlog(r)) time  We can sketch all positions in O(mlog(r)k) But we still have many possible values for r…

46 Obs2: Sketch only in powers of 2
We compute all sketches in O(log(m)mlog(r)k)

47 When r is not a power of 2 ? z x y S(x) S(y) Use S(x) + S(y) as S(z)

48 The algorithm z x y S(x) S(y) Compute sketches in powers of 2 in O(log(m)mlog(r)k) time For a fixed r we can approximate in O((m/r)*k) time Summing over r we get O(mlog(m) * k)

49 The algorithm z x y S(x) S(y) Total running time is O(mlog3m)

50 Bibliography Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1) (1999), W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz maps into a Hilbert space, Contemp Math 26 (1984), 189–206. Jirí Matousek: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2): (2008) Piotr Indyk, Nick Koudas, S. Muthukrishnan: Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB 2000:


Download ppt "Big Data Lecture 5: Estimating the second moment, dimension reduction, applications."

Similar presentations


Ads by Google