Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

Similar presentations


Presentation on theme: "Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)"— Presentation transcript:

1 Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

2 A scenario IPFrequency 131.107.65.143 18.9.22.692 80.97.56.202 131.107.65.14 18.9.22.69 80.97.56.20 IPFrequency 131.107.65.143 18.9.22.692 80.97.56.202 128.112.128.819 127.0.0.18 257.2.5.70 7.8.20.131 Challenge: compute something on the table, using small space. Challenge: compute something on the table, using small space. 131.107.65.14 Example of “something”: # distinct IPs max frequency other statistics…

3 Sublinear: a panacea?  Sub-linear space algorithm for solving Travelling Salesperson Problem?  Sorry, perhaps a different lecture  Hard to solve sublinearly even very simple problems:  Ex: what is the count of distinct IPs seen  Will settle for:  Approximate algorithms: 1+  approximation true answer ≤ output ≤ (1+  ) * (true answer)  Randomized: above holds with probability 95%  Quick and dirty way to get a sense of the data IPFrequency 131.107.65.143 18.9.22.692 80.97.56.202 128.112.128.819 127.0.0.18 257.2.5.70 8.3.20.121

4 Streaming data  Data through a router  Data stored on a hard drive, or streamed remotely  More efficient to do a linear scan on a hard drive  Working memory is the (smaller) main memory 2 2 2 2

5 Application areas  Data can come from:  Network logs, sensor data  Real time data  Search queries, served ads  Databases (query planning)  …

6 Problem 1: # distinct elements 2 2 5 5 7 7 5 5 5 5 i Frequency 21 53 71

7 Distinct Elements: idea 1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 2 2 7 7 5 5 [Flajolet-Martin’85, Alon-Matias-Szegedy’96]

8 Distinct Elements: idea 2 ZEROS(x) x=0.0000001100101 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process (int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process (int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2

9 Problem 2: max count  Problem: compute the maximum frequency of an element in the stream  Bad news:  Hard to distinguish whether an element repeated (max = 1 vs 2)  Good news:  Can find “heavy hitters”  elements with frequency > total frequency / s  using space proportional to s IPFrequency 21 53 71 2 2 5 5 7 7 5 5 5 5 heavy hitters

10 Heavy Hitters: CountMin 1 1 1 2 2 5 5 7 7 5 5 5 5 11 2 11 12 12 111 22 13 121 32 14 131 Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 11 [Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05]

11 Heavy Hitters: analysis 5 5 32 14 131 3 3 Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate

12 Problem 3: Moments IP 21 53 72 1 9 4 1 81 16

13

14 Scenario 2: distributed traffic Two sketches should be sufficient to compute something on the difference or sum IPFrequency 131.107.65.141 18.9.22.691 35.8.10.1401 IPFrequency 131.107.65.141 18.9.22.692 131.107.65.14 18.9.22.69 35.8.10.140

15 Common primitive: estimate sum a1a1 a2a2 a3a3 a4a4 a1a1 a3a3

16 Precision Sampling Framework a1a1 a2a2 a3a3 a4a4 u1u1 u2u2 u3u3 u4u4 ã 1 ã 2 ã 3 ã 4

17 Formalization Sum EstimatorAdversary

18 Precision Sampling Lemma  Goal: estimate ∑a i from {a ̃ i } satisfying |a i -a ̃ i |<u i.  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < S ̃ < 1.5*S + O(1)  with average cost equal to O(log n)  Example: distinguish Σ a i =3 vs Σ a i =0  Consider two extreme cases:  if three a i =1: enough to have crude approx for all (u i =0.1) if all a i =3/n: only few with good approx u i =1/n, and the rest with u i =1 ε1+ε S – ε < S̃ < (1+ ε)S + ε O(ε -3 log n) [A-Krauthgamer-Onak’11]

19 Precision Sampling Algorithm  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < S ̃ < 1.5*S + O(1)  with average cost equal to O(log n)  Algorithm:  Choose each u i  [0,1] i.i.d.  Estimator: S ̃ = count number of i‘s s.t. a ̃ i / u i > 6 (up to a normalization constant)  Proof of correctness:  we use only a ̃ i which are 1.5-approximation to a i  E[ S ̃ ] ≈ ∑ Pr[a i / u i > 6] = ∑ a i /6.  E[1/u i ] = O(log n) w.h.p. function of [ã i /u i - 4/ε] + and u i ’s concrete distrib. = minimum of O(ε -3 ) u.r.v. O(ε -3 log n) ε1+ε S – ε < S̃ < (1+ ε)S + ε

20 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 y1+y3y1+y3 y4y4 y2+y5+y6y2+y5+y6 x= H=

21 Streaming++  LOTS of work in the area:  Surveys  Muthukrishnan: http://algo.research.googlepages.com/eight.pshttp://algo.research.googlepages.com/eight.ps  McGregor: http://people.cs.umass.edu/~mcgregor/papers/08- graphmining.pdfhttp://people.cs.umass.edu/~mcgregor/papers/08- graphmining.pdf  Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49- Fall11/Notes/lecnotes.pdfhttp://www.cs.dartmouth.edu/~ac/Teach/CS49- Fall11/Notes/lecnotes.pdf  Open problems: http://sublinear.infohttp://sublinear.info  Examples:  Moments, sampling  Median estimation, longest increasing sequence  Graph algorithms  E.g., dynamic graph connectivity [AGG’12, KKM’13,…]  Numerical algorithms (e.g., regression, SVD approximation)  Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13]  related to Compressed Sensing

22


Download ppt "Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)"

Similar presentations


Ads by Google