Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

An Optimal Algorithm for the Distinct Elements Problem
Data Stream Algorithms Frequency Moments
Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Introduction to Algorithms Quicksort
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Fast Algorithms For Hierarchical Range Histogram Constructions
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Click to edit Present’s Name SLICE: Reviving Regions-Based Pruning for Reverse k Nearest Neighbors Queries Shiyu Yang 1, Muhammad Aamir Cheema 2,1, Xuemin.
CircularTrip: An Effective Algorithm for Continuous kNN Queries Muhammad Aamir Cheema Database Research Group, The School of Computer Science and Engineering,
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Computer Science and Engineering Loyalty-based Selection: Retrieving Objects That Persistently Satisfy Criteria Presented By: Zhitao Shen Joint work with.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
TinyLFU: A Highly Efficient Cache Admission Policy
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries Presented By: Muhammad Aamir Cheema Joint work with Xuemin Lin, Wenjie Zhang,
Sampling in Space Restricted Settings Anup Bhattacharya IIT Delhi Joint work with Davis Issac (MPI), Ragesh Jaiswal (IITD) and Amit Kumar (IITD)
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
ISOM MIS 215 Module 5 – Binary Trees. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Calculating frequency moments of Data Stream
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
A Unified Algorithm for Continuous Monitoring of Spatial Queries
A Unified Framework for Efficiently Processing Ranking Related Queries
A paper on Join Synopses for Approximate Query Answering
Stochastic Skyline Operator
Finding Frequent Items in Data Streams
Sublinear Algorithmic Tools 2
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Range-Efficient Counting of Distinct Elements
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Presented by: Mahady Hasan Joint work with
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Efficient Processing of Top-k Spatial Preference Queries
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of New South Wales, Australia

Introduction Counting distinct objects: Given a dataset D, return the number of distinct objects in D. Counting distinct objects against sliding windows: Given a data stream, return the number of distinct objects that arrive at or after timestamp t. Applications traffic management, call centers, wireless communication, stock market etc.

Introduction Approximate counting: Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee; |n-n’|/n ≤ ε with confidence (1 – δ) Contribution: FM based algorithms SE-FM (accuracy guarantee + space usage guarantee) PCSA-based algorithm (No accuracy guarantee (although practical) + more efficient) k-Skyband (Accuracy guarantee + efficient + no space usage guarantee)

FM Algorithm FM SKETCH Let h(x) be a uniform hash function Let “pivot” p(y) be the position of left most 1- bit of h(x) FM be an array of size k initialized to zero For each record x in dataset FM[pivot] = 1; Let B=FM min be the position of left most 0-bit of FM Number of distinct elements = α * 2 B where α = Each bit i of h(x) has 1/2 probability to be one FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) FM min = 1 k = 4 P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985

FM Algorithm 1010 FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) 1010 FM min = 1 Each bit i of h(x) has 1/2 probability to be one A h(x) with first i bits zero and (i+1)th bit one has a probability 1/2 i+1 Let n be the number of distinct elements FM[0] is accessed appx. n/2 times FM[1] is accessed appx. n/4 times …. FM[i] is accessed appx. n/2 i+1 times If i >> log 2 n FM[i] will almost certainly be zero If i << log 2 n FM[i] will almost certainly be one If i ≈ log 2 n FM[i] may be zero or one Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n.

FM Algorithm FM B 1 = 1 Use r hash functions to create r FM Sketches Initialize each FM to zero For each record x in dataset For each hash function h i (x) FM i [pivot] = 1; Let B i be the position of left most 0-bit of FM i B = (B 1 + B 2 … + B r )/ r Number of distinct elements = α * 2 B where α = B 2 = B 3 = 2 FM 2 FM 3 B = ( )/3 = 1.67 Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є 2 log 1/δ)

FM-based Algorithm Maintaining one FM sketch For each record (x,t) in dataset FM[pivot] = t; Answering a query For any t, let B = FM min (t) be the position of left most entry of FM with value less than t Number of distinct elements arrived after (inclusive) t = α * 2 B where α = FM r1r2r3r2 h(r1) 0010 h(r2) 1101 h(r3) FM min (4) =

FM-based Algorithm Maintain r FM sketches Initialize each FM to zero For each record (x,t) in dataset For each hash function h i (x) FM i [pivot] = t; Answering a query For any t, let B i (t) be the position of left most entry smaller than t in i-th FM Let B = ( B 1 (t) + B 2 (t) … + B r (t) )/ r Number of distinct elements arrived after (inclusive) t = α * 2 B where α =

Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є 2 log 1/δ) Total Space: O(1/є 2 log 1/δ log m) Total maintenance cost for one record: O(1/є 2 log 1/δ log log m) Total query cost: O(1/є 2 log 1/δ log log m)

PCSA-based Algorithm Maintain r FM sketches but update j < r sketches Generate j hash functions H(x) that map x to [1,r] Initialize each FM to zero For each record (x,t) in dataset For each of the j hash functions H() i = H(x) Update i-th FM sketch Answering a query For any t, let B i (t) be the position of left most entry smaller than t in i- th FM Let B = ( B 1 (t) + B 2 (t) … + B r (t) )/ r Number of distinct elements arrived after (inclusive) t = (α * 2 B )/ j where α = Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985” NOTE: No accuracy guarantee but performs well in practice

BJKST Algorithm Main Idea Let h() be a hash function to hash D to [1,m 3 ] where m = |D| For each record x, we generate its hash value h(x) Maintain k-th smallest distinct hash value k_min Number of distinct elements = n = km 3 /k_min Improved algorithm Use r hash functions Compute n i for each hash function h i () as above Report final answer as median of n i values Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in datastream. In RANDOM'02.

K-Skyband Technique Main Idea Let h() be a hash function to hash D to [1,m 3 ] where m = |D| For each record (x,t’) we generate h(x) and store record (x, h(x), t’) Answering a query q(t): Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t Get the k-th smallest distinct hashed value and apply BJKST algorithm Limitation: Requires storing all records

K-Skyband Technique For any time t, we need to find k-th smallest hash value arriving no later than t A record x dominates another record y if x arrives after y and has smaller hash value K-Skybands keeps only the objects that are dominated by at most (k-1) records Maintaining K-Skyband: Keep a counter for each record When a new element (x,t) arrives, increment the counter of all records dominated by it Remove the records with counter at least equal to k We increment the counters of groups to improve efficiency (Domination aggregation search tree) a e d c b h(x) t k = 2

K-Skyband Technique Answering Query: Find k_min (the k-th smallest hash value among elements arriving no later than t) Let z be the number of elements arrived before t k_min is the (z+k)-th overall smallest hash value Algorithm: Maintain a binary search tree eT that stores elements according to t Maintain a binary search tree eH that stores elements according to h(x) When a query q(t) arrives Compute z by using eT Find (z+k)-th overall smallest hash value from eH a e d c b h(x) t k = 2 z = 3 f k_min = 5 th smallest h(x)

Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Expected total space: O(1/є 2 log 1/δ log n) Expected time complexity: O(log 1/δ (log 1/є + log n))

Experiments Synthetic datasets following Uniform and Zipf distribution Real dataset WorldCup 98 HTTP requests (20 M records) j

Space Efficiency

Time Efficiency Maintenance cost

Time Efficiency Query response time

Accuracy

Thanks

P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, Space usage: 1/ε 2 log 1/δ m 1/2 Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio- temporal aggregation using sketches. In ICDE Space usage: O(N/ε 2 log 1/δ log m)

Space Requirement (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є 2 log 1/δ) Let m > 1/є and m > 1/δ; then k = O(log m) Size of one sketch is k = O(log m); Size of r sketches is: O(r log m) = O(1/є 2 log 1/δ log m); Total Space: O(1/є 2 log 1/δ log m)

Time Complexity (SE-FM) To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є 2 log 1/δ) The elements in a sketch are stored in a min-heap to support logarithmic search/update; Hence, cost of one search/update operation: O( log k) = O( log log m) To maintain the sketches, we update r sketches for each record x Total maintenance cost for one record: O( r log log m) = O(1/є 2 log 1/δ log log m) To answer a query, we search in r sketches Total cost: O( r log log m) = O(1/є 2 log 1/δ log log m)

Space Usage (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Expected size of k-skyband = O (k ln (n/k) ) Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є 2 log 1/δ log n)

Time Complexity (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є 2 ) and r = O(log 1/δ) Answering Query q(t): Search eT to compute z: log (k log n) = O(log k + log n) Search eH to find (z+t)-th element: O(log k + log n) We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є + log n))