 # Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

## Presentation on theme: "Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google."— Presentation transcript:

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google Inc. Presented by Amir Rothschild

Presenting:  1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space.  The algorithm achieves especially good space bounds for Zipfian distribution  2-pass algorithm for estimating the items with the largest change in frequency between two data streams.

Definitions:  Data stream:  where  Object o i appears n i times in S.  Order o i so that  f i = n i /n

The first problem:  FindApproxTop(S,k,ε) Input: stream S, int k, real ε. Output: k elements from S such that:  for every element O i in the output:  Contains every item with: n1n2nk

Clarifications:  This is not the problem discussed last week!  Sampling algorithm does not give any bounds for this version of the problem.

Hash functions  We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:

Let’s start with some intuition…  Idea:  Let s be a hash function from objects to {+1,-1}, and let c be a counter.  For each q i in the stream, update c += s(q i ) C S  Estimate ni=c*s(oi)  (since )

Realization s(O 1 )s(O 2 )s(O 2 )s(O 2 )s(O 3 )s(O 2 ) s1s1 +1 s2s2 +1 s3s3 s4s4 +1 E0 0

Claim:  For each element O j other then O i, s(O j )*s(O i )=-1 w.p.1/2 s(O j )*s(O i )=+1 w.p. 1/2.  So O j adds the counter +n j w.p. 1/2 and - n j w.p. 1/2, and so has no influence on the expectation.  O i on the other hand, adds +n i to the counter w.p. 1 (since s(O i )*s(O i )=+1)  So the expectation (average) is +n i.  Proof:

That’s not enough:  The variance is very high.  O(m) objects have estimates that are wrong by more then the variance.

First attempt to fix the algorithm…  t independent hash functions S j  t different counters C j  For each element qi in the stream: For each j in {1,2,…,t} do C j += S j (q i )  Take the mean or the median of the estimates C j *S j (o i ) to estimate n i. C1C3C2C4C5C6 S1S2S3S4S5S6

Still not enough  Collisions with high frequency elements like O 1 can spoil most estimates of lower frequency elements, as O k.

Ci The solution !!!  Divide & Conquer:  Don’t let each element update every counter.  More precisely: replace each counter with a hash table of b counters and have the items one counter per hash table. Ti hi Si

Presenting the CountSketch algorithm… Let’s start working…

h1h2ht t hash tables b buckets T1 h1 S1 T2 h2 S2 Tt ht St CountSketch data structure

The CountSketch data structure  Define CountSkatch d.s. as follows:  Let t and b be parameters with values determined later.  h1,…,ht – hash functions O -> {1,2,…,b}.  T1,…,Tt – arrays of b counters.  S1,…,St – hash functions from objects O to {+1,-1}.  From now on, define : hi[oj] := Ti[hi(oj)]

The d.s. supports 2 operations:  Add(q):  Estimate(q):  Why median and not mean?  In order to show the median is close to reality it’s enough to show that ½ of the estimates are good.  The mean on the other hand is very sensitive to outliers.

Finally, the algorithm:  Keep a CountSketch d.s. C, and a heap of the top k elements.  Given a data stream q 1,…,q n :  For each j=1,…,n: C.Add(q j ); If qj is in the heap, increment it’s count. Else, If C.Estimate(q j ) > smallest estimated count in the heap, add q j to the heap. (If the heap is full evict the object with the smallest estimated count from it)

And now for the hard part: Algorithms analysis

Definitions

Claims & Proofs

The CountSketch algorithm space complexity:

Zipfian distribution Analysis of the CountSketch algorithm for Zipfian distribution

Zipfian distribution  Zipfian(z): for some constant c.  This distribution is very common in human languages (useful in search engines).

Pr q (oi=q)

Observations  k most frequent elements can only be preceded by elements j with n j > (1-ε)n k  => Choosing l instead of k so that n l+1 <(1-ε)n k will ensure that our list will include the k most frequent elements. n1n1 n2n2 nknk n l+1

Analysis for Zipfian distribution  For this distribution the space complexity of the algorithm is where:

Proof of the space bounds: Part 1, l=O(k)

Proof of the space bounds: Part 2

Comparison of space requirements for random sampling vs. our algorithm

Yet another algorithm which uses CountSketch d.s. Finding items with largest frequency change

The problem  Let be the number of occurrences of o in S.  Given 2 streams S1,S2 find the items o such that is maximal.  2-pass algorithm.

The algorithm – first pass  First pass – only update the counters:

The algorithm – second pass  Pass over S1 and S2 and:

Explanation  Though A can change, items once removed are never added back.  Thus accurate exact counts can be maintained for all objects currently in A.  Space bounds for this algorithm are similar to those of the former with replaced by