Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University.

Similar presentations


Presentation on theme: "1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University."— Presentation transcript:

1 1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara

2 2 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

3 3 Motivation Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks stream for advertisements to display. If the user's profile is not a frequent clicker, then s/he will probably not click any displayed advertisement. –Show Pay-Per-Impression advertisements. If the user's profile is a frequent clicker, then s/he may click a displayed advertisement. –Show Pay-Per-Click advertisements. –Retrieve top advertisements to choose what to display.

4 4 Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: –Very related, though, no integrated solution has been proposed –Exact solution is O(min(N,A)) space approximate variations

5 5 Practical Frequent Elements -Deficient Frequent Elements [Manku 02]: –All frequent elements output should have F > (φ - )N, where is the user-defined error. φ Nφ N ( φ - ) N

6 6 Practical Top-k FindApproxTop(S, k, ) [Charikar 02]: –Retrieve a list of k elements such that every element, E i, in the list has F i > (1 - ) F k, where E k is the k th ranked element. F4F4 (1 - ) F 4

7 7 Related Work Algorithms Classification –Counter-Based techniques Keep an individual counter for each element If the observed ID is monitored, its counter is updated If the observed ID is not monitored, algorithm dependent action –Sketch-Based techniques Estimate frequency for all elements using bit-maps of counters Each element is hashed into the counters space using a family of hash functions. Hashed-to counters are queried for the frequencies

8 8 Recent Work (Comparison) AlgorithmNatureSpace BoundHandles CountSketch [Charikar 02] Sketch O(k/ 2 log N/δ), δ is the failure probability FindApproxTop (S, k, ) GroupTest [Cormode 03] SketchO(φ -1 log(φ -1 ) log(|A|))Hot Items Frequent [Demaine 02] Counter O(1/ ), proved by [Bose 03] FE Probabilistic-Inplace [Demaine 02] CounterO(m), m is the available memory FindCandidate Top(S, k, m/2) Lossy Counting [Manku 02] Counter (1/ ) log( N) -Deficient FE Sticky Sampling [Manku 02] Counter (2/ ) log(φ -1 δ -1 ) -Deficient FE

9 9 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

10 10 The Space-Saving Algorithm Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors

11 11 Space-Saving By Example Element Count error (max possible) ABBACABBDD Element ABC Count221 error (max possible) 000 Element ABC Count321 error (max possible) 000 Element BAC Count431 error (max possible) 000 Element BAD Count432 error (max possible) 001 Element BAD Count533 error (max possible) 001 E Element BEA Count543 error (max possible) 030 Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error C Element BEC Count544 error (max possible) 033 B

12 12 Space-Saving Observations Observations: –The summation of the Counts is N Element BEC Count544 error (max possible) 033 S = ABBACABBDDBECN = 13 –Minimum number of hits, min N/m –In this example, min = 4 Element BEC Count544 error (max possible) 033 –The minimum number of hits, min, is an upper bound on the error of any element Element BEC Count544 error (max possible) 033

13 13 Space-Saving Proved Properties 1. If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F 1 = 5, min = 4. S = ABBACABBDDBECN = 13 Element BEC Count544 error (max possible) The Count at position i in Stream-Summary is no less than F i, the frequency of the i th ranked element. F(A) = F 2 = 3, Count 2 = 4. Element BEC Count544 error (max possible) 033 S = ABBACABBDDBECN = 13

14 14 Space-Saving Intuition Make use of the skewed property of the data A minority of the elements, the more frequent ones, gets the majority of the hits. Frequent elements will reside in the counters of bigger values. They will not be distorted by the ineffective hits of the infrequent elements. Numerous infrequent elements reside on the smaller counters.

15 15 Space-Saving Intuition (Contd) If the skew remains, but the popular elements change overtime: –The elements that are growing more popular will gradually be pushed to the top of the list. –If one of the previously popular elements lost its popularity, its relative position will decline, as other counters get incremented.

16 16 Space-Saving Data Structure We need a data structure that –Increments counters in constant time –Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine 02]

17 17 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

18 18 Frequent Elements Queries Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element

19 19 Frequent Elements Example For N = 73, m = 8, φ = 0.15: –Frequent Elements should have support of 11 hits. –Candidate Frequent Elements are B, D, and G. Element BDGAQFCE Count error Guaranteed Hits = Count - error –Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11. Element BDGAQFCE Count error Guaranteed Hits = Count - error

20 20 Frequent Elements Space Bounds Space BoundsGeneral DistributionZipf(α) Space-Saving O(1/ ) (1/ ) (1/ α) GroupTestO(φ -1 log(φ -1 ) log(|A|)) Frequent O(1/ ) proved by[Bose03] Lossy Counting (1/ ) log( N) Sticky Sampling (2/ ) log(φ -1 δ -1 )

21 21 Example: N = 10 6, |A| = 10 4, φ = 10 -1, = 10 -2, and δ, the failure probability, = 10 -1,and Uniform data: Space-Saving and Frequent: 100 counters Sticky Sampling: 700 counters Lossy Counting: 1000 counters GroupTest: C*930 counters, C 1 Zipfian with α = 2: Space-Saving: 10 counters FE: Quantitative Comparison

22 22 FE: Qualitative Comparison Frequent: –It has a bound similar to Space-Saving in the general distribution case. –It is built and queried in a way that does not allow the user to specify an error threshold. –There is no feasible extension to track under- estimation errors. –Every observation of a non-monitored element increases the errors for all the monitored elements, since their counters get decremented.

23 23 FE: Qualitative Comparison (Contd) GroupTest: –It does not output frequencies at all. –It reveals nothing about the relative order of the elements. –It assumes that IDs are 1 … |A|. This can only be enforced by building an indexed lookup table. –Thus, practically it needs O(|A|) space.

24 24 FE: Qualitative Comparison (Contd) Lossy Counting and Sticky Sampling: –The theoretical space bound of Space- Saving is much tighter than those of Lossy Counting and Sticky Sampling.

25 25 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

26 26 Top-k Elements Queries Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: –Guaranteed top-k elements: Any element whose guaranteed hits = (Count – error) Count k+1, is guaranteed to be in the top-k. –Guaranteed top-k (where kk): The top-k elements reported are guaranteed to be the correct top-k iff for every element in the top-k, guaranteed hits = (Count – error) Count k+1.

27 27 Top-k Elements Example For k = 3, m = 8: –B, D, and G are the top-3 candidates. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, and D are guaranteed to be in the top-3. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, D, G and A are guaranteed to be the top-4. Here k = 4. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, and D are guaranteed to be the top-2. Another k = 2. Element BDGAQFCE Count error Guaranteed Hits = Count - error

28 28 Top-k Elements Space Bounds Space Bounds General Distribution Zipf(α) Space- Saving FindApproxTop(S, k, ): O(k/ * log(N) ) Exact Top-k Problem: α = 1: O(k 2 log(A) ) α > 1: O((k/ α ) (1/ α) k ) CountSketchFindApproxTop(S, k, ): O(k/ 2 * log(N / δ) ) FindApproxTop(S, k, ): α 1: O(k * log(N / δ) )

29 29 Top-k: Quantitative Comparison For N = 10 6, |A| = 10 4, k = 100, = 10 -1, and δ = 10 -1, and Uniform data: Space-Saving: 1000 counters CountSketch: C*2.3*10 7 counters, C >> 1 If the data is Zipfian with α = 2 Space-Saving: 66 counters CountSketch: C*230 counters, C >> 1

30 30 Top-k: Qualitative Comparison CountSketch: –General distribution: Space-Saving has a tighter theoretical space bound. –Zipf(α) distribution: Space-Saving solves the exact problem, while CountSketch solves the approximate problem. Space-Saving has a tighter bound in cases of –Skewed data –Long streams It has 0-probability of failure.

31 31 Top-k: Qualitative Comparison (Contd) Probabilistic-Inplace: –Outputs m/2 elements, which is too many. –Zipf(α) distribution: Probabilistic-Inplace does not offer space analysis in case of Zipfian data.

32 32 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

33 33 Experimental Results - Setup Synthetic data: –Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 –N = 10 7 hits. Real Data (ValueClick, Inc.): Similar results Precision: –number of correct elements found / entire output Recall: –number of correct elements found / number of actual correct Run time: –Processing Stream + Query Time Space used: –Including hash table

34 34 Frequent Elements Results Query: φ = 10 -2, = 10 -4, and δ = We compared with –GroupTest and Frequent All algorithms had a recall of 1. –That is, they all output the correct elements among their output. Space-Saving was able to guarantee all its output to be correct

35 35 Frequent Elements Precision

36 36 Frequent Elements Run Time

37 37 Frequent Elements Space Used

38 38 Top-k Elements Results Query: k = 100, = 10 -4, and δ = We compared with –CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. –Probabilistic-InPlace: was allowed the same number of counters as Space-Saving Space-Saving was able to guarantee all its output to be correct

39 39 Top-k Elements Precision

40 40 Top-k Elements Recall

41 41 Top-k Elements Run Time

42 42 Top-k Elements Space Used

43 43 Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

44 44 Conclusion Contributions: –An integrated approach to solve an interesting family of problems –Strict error bounds using little space –Guarantees on results –Special attention was given to Zipfian data –Experimental validation Future Work: –Incremental frequent and top-k elements reporting


Download ppt "1 Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University."

Similar presentations


Ads by Google