Download presentation

Presentation is loading. Please wait.

Published bySharon Tillett Modified over 2 years ago

1
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of New South Wales, Australia

2
Introduction Counting distinct objects: Given a dataset D, return the number of distinct objects in D. Counting distinct objects against sliding windows: Given a data stream, return the number of distinct objects that arrive at or after timestamp t. Applications traffic management, call centers, wireless communication, stock market etc.

3
Introduction Approximate counting: Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee; |n-n’|/n ≤ ε with confidence (1 – δ) Contribution: FM based algorithms SE-FM (accuracy guarantee + space usage guarantee) PCSA-based algorithm (No accuracy guarantee (although practical) + more efficient) k-Skyband (Accuracy guarantee + efficient + no space usage guarantee)

4
FM Algorithm 0000 1010 FM SKETCH Let h(x) be a uniform hash function Let “pivot” p(y) be the position of left most 1- bit of h(x) FM be an array of size k initialized to zero For each record x in dataset FM[pivot] = 1; Let B=FM min be the position of left most 0-bit of FM Number of distinct elements = α * 2 B where α = 1.2897385 Each bit i of h(x) has 1/2 probability to be one FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) 10001010 FM min = 1 k = 4 P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985

5
FM Algorithm 1010 FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) 1010 FM min = 1 Each bit i of h(x) has 1/2 probability to be one A h(x) with first i bits zero and (i+1)th bit one has a probability 1/2 i+1 Let n be the number of distinct elements FM[0] is accessed appx. n/2 times FM[1] is accessed appx. n/4 times …. FM[i] is accessed appx. n/2 i+1 times If i >> log 2 n FM[i] will almost certainly be zero If i << log 2 n FM[i] will almost certainly be one If i ≈ log 2 n FM[i] may be zero or one Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n.

6
FM Algorithm FM 1 1010 B 1 = 1 Use r hash functions to create r FM Sketches Initialize each FM to zero For each record x in dataset For each hash function h i (x) FM i [pivot] = 1; Let B i be the position of left most 0-bit of FM i B = (B 1 + B 2 … + B r )/ r Number of distinct elements = α * 2 B where α = 1.2897385 1100 B 2 = 2 1101 B 3 = 2 FM 2 FM 3 B = (1 + 2 + 2)/3 = 1.67 Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є 2 log 1/δ)

7
FM-based Algorithm 1010 12345 Maintaining one FM sketch For each record (x,t) in dataset FM[pivot] = t; Answering a query For any t, let B = FM min (t) be the position of left most entry of FM with value less than t Number of distinct elements arrived after (inclusive) t = α * 2 B where α = 1.2897385 FM r1r2r3r2 h(r1) 0010 h(r2) 1101 h(r3) 00001000 FM min (4) = 0 1020304030503020

8
FM-based Algorithm Maintain r FM sketches Initialize each FM to zero For each record (x,t) in dataset For each hash function h i (x) FM i [pivot] = t; Answering a query For any t, let B i (t) be the position of left most entry smaller than t in i-th FM Let B = ( B 1 (t) + B 2 (t) … + B r (t) )/ r Number of distinct elements arrived after (inclusive) t = α * 2 B where α = 1.2897385

9
Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є 2 log 1/δ) Total Space: O(1/є 2 log 1/δ log m) Total maintenance cost for one record: O(1/є 2 log 1/δ log log m) Total query cost: O(1/є 2 log 1/δ log log m)

10
PCSA-based Algorithm Maintain r FM sketches but update j < r sketches Generate j hash functions H(x) that map x to [1,r] Initialize each FM to zero For each record (x,t) in dataset For each of the j hash functions H() i = H(x) Update i-th FM sketch Answering a query For any t, let B i (t) be the position of left most entry smaller than t in i- th FM Let B = ( B 1 (t) + B 2 (t) … + B r (t) )/ r Number of distinct elements arrived after (inclusive) t = (α * 2 B )/ j where α = 1.2897385 Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985” NOTE: No accuracy guarantee but performs well in practice

11
BJKST Algorithm Main Idea Let h() be a hash function to hash D to [1,m 3 ] where m = |D| For each record x, we generate its hash value h(x) Maintain k-th smallest distinct hash value k_min Number of distinct elements = n = km 3 /k_min Improved algorithm Use r hash functions Compute n i for each hash function h i () as above Report final answer as median of n i values Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in datastream. In RANDOM'02.

12
K-Skyband Technique Main Idea Let h() be a hash function to hash D to [1,m 3 ] where m = |D| For each record (x,t’) we generate h(x) and store record (x, h(x), t’) Answering a query q(t): Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t Get the k-th smallest distinct hashed value and apply BJKST algorithm Limitation: Requires storing all records

13
K-Skyband Technique For any time t, we need to find k-th smallest hash value arriving no later than t A record x dominates another record y if x arrives after y and has smaller hash value K-Skybands keeps only the objects that are dominated by at most (k-1) records Maintaining K-Skyband: Keep a counter for each record When a new element (x,t) arrives, increment the counter of all records dominated by it Remove the records with counter at least equal to k We increment the counters of groups to improve efficiency (Domination aggregation search tree) a e d c b h(x) t k = 2

14
K-Skyband Technique Answering Query: Find k_min (the k-th smallest hash value among elements arriving no later than t) Let z be the number of elements arrived before t k_min is the (z+k)-th overall smallest hash value Algorithm: Maintain a binary search tree eT that stores elements according to t Maintain a binary search tree eH that stores elements according to h(x) When a query q(t) arrives Compute z by using eT Find (z+k)-th overall smallest hash value from eH a e d c b h(x) t k = 2 z = 3 f k_min = 5 th smallest h(x)

15
Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Expected total space: O(1/є 2 log 1/δ log n) Expected time complexity: O(log 1/δ (log 1/є + log n))

16
Experiments Synthetic datasets following Uniform and Zipf distribution Real dataset WorldCup 98 HTTP requests (20 M records) j

17
Space Efficiency

19
Time Efficiency Maintenance cost

20
Time Efficiency Query response time

21
Accuracy

22
Thanks

23
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001. Space usage: 1/ε 2 log 1/δ m 1/2 Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio- temporal aggregation using sketches. In ICDE 2004. Space usage: O(N/ε 2 log 1/δ log m)

24
Space Requirement (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є 2 log 1/δ) Let m > 1/є and m > 1/δ; then k = O(log m) Size of one sketch is k = O(log m); Size of r sketches is: O(r log m) = O(1/є 2 log 1/δ log m); Total Space: O(1/є 2 log 1/δ log m)

25
Time Complexity (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є 2 log 1/δ) The elements in a sketch are stored in a min-heap to support logarithmic search/update; Hence, cost of one search/update operation: O( log k) = O( log log m) To maintain the sketches, we update r sketches for each record x Total maintenance cost for one record: O( r log log m) = O(1/є 2 log 1/δ log log m) To answer a query, we search in r sketches Total cost: O( r log log m) = O(1/є 2 log 1/δ log log m)

26
Space Usage (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Expected size of k-skyband = O (k ln (n/k) ) Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є 2 log 1/δ log n)

27
Time Complexity (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Answering Query q(t): Search eT to compute z: log (k log n) = O(log k + log n) Search eH to find (z+t)-th element: O(log k + log n) We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є + log n))

Similar presentations

OK

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on second green revolution Ppt on high voltage engineering tutorial Ppt on power grid operation Ppt on network theory of meaningful learning Adrenal gland anatomy and physiology ppt on cells Action words for kids ppt on batteries Ppt on power grid failure images Ppt on kinetic theory of matter/gas laws Download ppt on teamviewer free Ppt on ram and rom comparison