# Efficient Computation of Frequent and Top-k Elements in Data Streams

## Presentation on theme: "Efficient Computation of Frequent and Top-k Elements in Data Streams"— Presentation transcript:

Efficient Computation of Frequent and Top-k Elements in Data Streams
Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara

Outline Problem Definition Space-Saving: Summarizing the Data Stream

Motivation Motivated by Internet advertising commissioners

Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: Very related, though, no integrated solution has been proposed Exact solution is O(min(N,A)) space  approximate variations

Practical Frequent Elements
-Deficient Frequent Elements [Manku ‘02]: All frequent elements output should have F > (φ - )N, where  is the user-defined error. φ N (φ - ) N

Practical Top-k FindApproxTop(S, k, ) [Charikar ‘02]:
Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kth ranked element. F4 (1 - ) F4

Related Work Algorithms Classification Counter-Based techniques
Keep an individual counter for each element If the observed ID is monitored, its counter is updated If the observed ID is not monitored, algorithm dependent action Sketch-Based techniques Estimate frequency for all elements using bit-maps of counters Each element is hashed into the counters’ space using a family of hash functions. Hashed-to counters are queried for the frequencies

Recent Work (Comparison)
Algorithm Nature Space Bound Handles CountSketch [Charikar ‘02] Sketch O(k/2 log N/δ), δ is the failure probability FindApproxTop(S, k, ) GroupTest [Cormode ’03] O(φ-1 log(φ-1) log(|A|)) Hot Items Frequent [Demaine ’02] Counter O(1/), proved by [Bose ‘03] FE Probabilistic-Inplace [Demaine ’02] O(m), m is the available memory FindCandidateTop(S, k, m/2) Lossy Counting [Manku ’02] (1/) log(N) -Deficient FE Sticky Sampling [Manku ’02] (2/) log(φ-1δ-1)

Outline Problem Definition Space-Saving: Summarizing the Data Stream

The Space-Saving Algorithm
Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors

Space-Saving By Example
Element B A C Count 4 3 1 error (max possible) Element B A D Count 4 3 2 error (max possible) 1 Element B A D Count 5 3 error (max possible) 1 Element B E C Count 5 4 error (max possible) 3 Element B E A Count 5 4 3 error (max possible) Element A B C Count 2 1 error (max possible) Element A B C Count 3 2 1 error (max possible) Element Count error (max possible) A B B A C A B B D D B E C Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error

Space-Saving Observations
S = ABBACABBDDBEC N = 13 Observations: The summation of the Counts is N Minimum number of hits, min ≤ N/m In this example, min = 4 The minimum number of hits, min, is an upper bound on the error of any element Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3

Space-Saving Proved Properties
S = ABBACABBDDBEC N = 13 S = ABBACABBDDBEC N = 13 If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4. Property 2 is important to guarantee the correctness and order of top-k. Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3

Space-Saving Intuition
Make use of the skewed property of the data A minority of the elements, the more frequent ones, gets the majority of the hits. Frequent elements will reside in the counters of bigger values. They will not be distorted by the ineffective hits of the infrequent elements. Numerous infrequent elements reside on the smaller counters.

Space-Saving Intuition (Cont’d)
If the skew remains, but the popular elements change overtime: The elements that are growing more popular will gradually be pushed to the top of the list. If one of the previously popular elements lost its popularity, its relative position will decline, as other counters get incremented.

Space-Saving Data Structure
We need a data structure that Increments counters in constant time Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]

Outline Problem Definition Space-Saving: Summarizing the Data Stream

Frequent Elements Queries
Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element

Frequent Elements Example
B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For N = 73, m = 8, φ = 0.15: Frequent Elements should have support of 11 hits. Candidate Frequent Elements are B, D, and G. Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.

Frequent Elements Space Bounds
General Distribution Zipf(α) Space-Saving O(1/) (1/)(1/α) GroupTest O(φ-1 log(φ-1) log(|A|)) Frequent O(1/) proved by[Bose’03] Lossy Counting (1/) log(N) Sticky Sampling (2/) log(φ-1δ-1)

FE: Quantitative Comparison
Example: N = 106, |A| = 104, φ = 10-1,  = 10-2, and δ, the failure probability, = 10-1 ,and Uniform data: Space-Saving and Frequent: 100 counters Sticky Sampling: 700 counters Lossy Counting: 1000 counters GroupTest: C*930 counters, C ≥ 1 Zipfian with α = 2: Space-Saving: 10 counters

FE: Qualitative Comparison
Frequent: It has a bound similar to Space-Saving in the general distribution case. It is built and queried in a way that does not allow the user to specify an error threshold. There is no feasible extension to track under-estimation errors. Every observation of a non-monitored element increases the errors for all the monitored elements, since their counters get decremented.

FE: Qualitative Comparison (Cont’d)
GroupTest: It does not output frequencies at all. It reveals nothing about the relative order of the elements. It assumes that IDs are 1 … |A|. This can only be enforced by building an indexed lookup table. Thus, practically it needs O(|A|) space.

FE: Qualitative Comparison (Cont’d)
Lossy Counting and Sticky Sampling: The theoretical space bound of Space-Saving is much tighter than those of Lossy Counting and Sticky Sampling.

Outline Problem Definition Space-Saving: Summarizing the Data Stream

Top-k Elements Queries
Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: Guaranteed top-k elements: Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. Guaranteed top-k’ (where k’≈k): The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.

Top-k Elements Example
B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For k = 3, m = 8: B, D, and G are the top-3 candidates. B, and D are guaranteed to be in the top-3. B , D, G and A are guaranteed to be the top-4. Here k’ = 4. B , and D are guaranteed to be the top-2. Another k’ = 2.

Top-k Elements Space Bounds
General Distribution Zipf(α) Space-Saving FindApproxTop(S, k, ): O(k/ * log(N)) Exact Top-k Problem: α = 1: O(k2 log(A) ) α > 1: O((k/ α)(1/α) k ) CountSketch O(k/2 * log(N / δ)) α ≥ 1: O(k * log(N / δ))

Top-k: Quantitative Comparison
For N = 106, |A| = 104, k = 100,  = 10-1, and δ = 10-1, and Uniform data: Space-Saving: 1000 counters CountSketch: C*2.3*107 counters, C >> 1 If the data is Zipfian with α = 2 Space-Saving: 66 counters CountSketch: C*230 counters, C >> 1

Top-k: Qualitative Comparison
CountSketch: General distribution: Space-Saving has a tighter theoretical space bound. Zipf(α) distribution: Space-Saving solves the exact problem, while CountSketch solves the approximate problem. Space-Saving has a tighter bound in cases of Skewed data Long streams It has 0-probability of failure.

Top-k: Qualitative Comparison (Cont’d)
Probabilistic-Inplace: Outputs m/2 elements, which is too many. Zipf(α) distribution: Probabilistic-Inplace does not offer space analysis in case of Zipfian data.

Outline Problem Definition Space-Saving: Summarizing the Data Stream

Experimental Results - Setup
Synthetic data: Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 N = 107 hits. Real Data (ValueClick, Inc.): Similar results Precision: number of correct elements found / entire output Recall: number of correct elements found / number of actual correct Run time: Processing Stream + Query Time Space used: Including hash table

Frequent Elements Results
Query: φ = 10-2,  = 10-4, and δ = 10-2 We compared with GroupTest and Frequent All algorithms had a recall of 1. That is, they all output the correct elements among their output. Space-Saving was able to guarantee all its output to be correct

Frequent Elements Precision

Frequent Elements Run Time

Frequent Elements Space Used

Top-k Elements Results
Query: k = 100,  = 10-4, and δ = 10-2 We compared with CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. Probabilistic-InPlace: was allowed the same number of counters as Space-Saving Space-Saving was able to guarantee all its output to be correct

Top-k Elements Precision

Top-k Elements Recall

Top-k Elements Run Time

Top-k Elements Space Used

Outline Problem Definition Space-Saving: Summarizing the Data Stream