# Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

## Presentation on theme: "Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of."— Presentation transcript:

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of New South Wales, Australia

Introduction Counting distinct objects: Given a dataset D, return the number of distinct objects in D. Counting distinct objects against sliding windows: Given a data stream, return the number of distinct objects that arrive at or after timestamp t. Applications traffic management, call centers, wireless communication, stock market etc.

Introduction Approximate counting: Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee; |n-n’|/n ≤ ε with confidence (1 – δ) Contribution: FM based algorithms SE-FM (accuracy guarantee + space usage guarantee) PCSA-based algorithm (No accuracy guarantee (although practical) + more efficient) k-Skyband (Accuracy guarantee + efficient + no space usage guarantee)

FM Algorithm 0000 1010 FM SKETCH Let h(x) be a uniform hash function Let “pivot” p(y) be the position of left most 1- bit of h(x) FM be an array of size k initialized to zero For each record x in dataset FM[pivot] = 1; Let B=FM min be the position of left most 0-bit of FM Number of distinct elements = α * 2 B where α = 1.2897385 Each bit i of h(x) has 1/2 probability to be one FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) 10001010 FM min = 1 k = 4 P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985

FM Algorithm 1010 FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) 1010 FM min = 1 Each bit i of h(x) has 1/2 probability to be one A h(x) with first i bits zero and (i+1)th bit one has a probability 1/2 i+1 Let n be the number of distinct elements FM[0] is accessed appx. n/2 times FM[1] is accessed appx. n/4 times …. FM[i] is accessed appx. n/2 i+1 times If i >> log 2 n FM[i] will almost certainly be zero If i << log 2 n FM[i] will almost certainly be one If i ≈ log 2 n FM[i] may be zero or one Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n.

FM Algorithm FM 1 1010 B 1 = 1 Use r hash functions to create r FM Sketches Initialize each FM to zero For each record x in dataset For each hash function h i (x) FM i [pivot] = 1; Let B i be the position of left most 0-bit of FM i B = (B 1 + B 2 … + B r )/ r Number of distinct elements = α * 2 B where α = 1.2897385 1100 B 2 = 2 1101 B 3 = 2 FM 2 FM 3 B = (1 + 2 + 2)/3 = 1.67 Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є 2 log 1/δ)

FM-based Algorithm 1010 12345 Maintaining one FM sketch For each record (x,t) in dataset FM[pivot] = t; Answering a query For any t, let B = FM min (t) be the position of left most entry of FM with value less than t Number of distinct elements arrived after (inclusive) t = α * 2 B where α = 1.2897385 FM r1r2r3r2 h(r1) 0010 h(r2) 1101 h(r3) 00001000 FM min (4) = 0 1020304030503020

FM-based Algorithm Maintain r FM sketches Initialize each FM to zero For each record (x,t) in dataset For each hash function h i (x) FM i [pivot] = t; Answering a query For any t, let B i (t) be the position of left most entry smaller than t in i-th FM Let B = ( B 1 (t) + B 2 (t) … + B r (t) )/ r Number of distinct elements arrived after (inclusive) t = α * 2 B where α = 1.2897385

Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є 2 log 1/δ) Total Space: O(1/є 2 log 1/δ log m) Total maintenance cost for one record: O(1/є 2 log 1/δ log log m) Total query cost: O(1/є 2 log 1/δ log log m)

PCSA-based Algorithm Maintain r FM sketches but update j < r sketches Generate j hash functions H(x) that map x to [1,r] Initialize each FM to zero For each record (x,t) in dataset For each of the j hash functions H() i = H(x) Update i-th FM sketch Answering a query For any t, let B i (t) be the position of left most entry smaller than t in i- th FM Let B = ( B 1 (t) + B 2 (t) … + B r (t) )/ r Number of distinct elements arrived after (inclusive) t = (α * 2 B )/ j where α = 1.2897385 Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985” NOTE: No accuracy guarantee but performs well in practice

BJKST Algorithm Main Idea Let h() be a hash function to hash D to [1,m 3 ] where m = |D| For each record x, we generate its hash value h(x) Maintain k-th smallest distinct hash value k_min Number of distinct elements = n = km 3 /k_min Improved algorithm Use r hash functions Compute n i for each hash function h i () as above Report final answer as median of n i values Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in datastream. In RANDOM'02.

K-Skyband Technique Main Idea Let h() be a hash function to hash D to [1,m 3 ] where m = |D| For each record (x,t’) we generate h(x) and store record (x, h(x), t’) Answering a query q(t): Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t Get the k-th smallest distinct hashed value and apply BJKST algorithm Limitation: Requires storing all records

K-Skyband Technique For any time t, we need to find k-th smallest hash value arriving no later than t A record x dominates another record y if x arrives after y and has smaller hash value K-Skybands keeps only the objects that are dominated by at most (k-1) records Maintaining K-Skyband: Keep a counter for each record When a new element (x,t) arrives, increment the counter of all records dominated by it Remove the records with counter at least equal to k We increment the counters of groups to improve efficiency (Domination aggregation search tree) a e d c b h(x) t k = 2

K-Skyband Technique Answering Query: Find k_min (the k-th smallest hash value among elements arriving no later than t) Let z be the number of elements arrived before t k_min is the (z+k)-th overall smallest hash value Algorithm: Maintain a binary search tree eT that stores elements according to t Maintain a binary search tree eH that stores elements according to h(x) When a query q(t) arrives Compute z by using eT Find (z+k)-th overall smallest hash value from eH a e d c b h(x) t k = 2 z = 3 f k_min = 5 th smallest h(x)

Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Expected total space: O(1/є 2 log 1/δ log n) Expected time complexity: O(log 1/δ (log 1/є + log n))

Experiments Synthetic datasets following Uniform and Zipf distribution Real dataset WorldCup 98 HTTP requests (20 M records) j

Space Efficiency

Time Efficiency Maintenance cost

Time Efficiency Query response time

Accuracy

Thanks

P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001. Space usage: 1/ε 2 log 1/δ m 1/2 Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio- temporal aggregation using sketches. In ICDE 2004. Space usage: O(N/ε 2 log 1/δ log m)

Space Requirement (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є 2 log 1/δ) Let m > 1/є and m > 1/δ; then k = O(log m) Size of one sketch is k = O(log m); Size of r sketches is: O(r log m) = O(1/є 2 log 1/δ log m); Total Space: O(1/є 2 log 1/δ log m)

Time Complexity (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є 2 log 1/δ) The elements in a sketch are stored in a min-heap to support logarithmic search/update; Hence, cost of one search/update operation: O( log k) = O( log log m) To maintain the sketches, we update r sketches for each record x Total maintenance cost for one record: O( r log log m) = O(1/є 2 log 1/δ log log m) To answer a query, we search in r sketches Total cost: O( r log log m) = O(1/є 2 log 1/δ log log m)

Space Usage (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Expected size of k-skyband = O (k ln (n/k) ) Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є 2 log 1/δ log n)

Time Complexity (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Answering Query q(t): Search eT to compute z: log (k log n) = O(log k + log n) Search eH to find (z+t)-th element: O(log k + log n) We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є + log n))

Download ppt "Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of."

Similar presentations