Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Slides:

Advertisements

Similar presentations

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Fast Algorithms For Hierarchical Range Histogram Constructions

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev.

Dynamic Bayesian Networks (DBNs)

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

PCPs and Inapproximability Introduction. My T. Thai 2 Why Approximation Algorithms  Problems that we cannot find an optimal solution.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.

Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.

Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Data Stream Mining and Querying

Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias,

Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.

Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,

Correlating XML Data Streams Using Tree-Edit Distance Embeddings Minos Garofalakis & Amit Kumar Internet Management Research Department Bell Labs, Lucent.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Processing Continuous Network-Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Efficient Gathering of Correlated Data in Sensor Networks

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Maximum Network Lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Cardei, M.; Jie Wu; Mingming Lu; Pervaiz, M.O.; Wireless And Mobile.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.

Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Efficient k-Coverage Algorithms for Wireless Sensor Networks Mohamed Hefeeda.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Analyzing Massive Data Streams: Past, Present, and Future Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Data Mining: Concepts and Techniques Mining data streams

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

Clustering Data Streams A presentation by George Toderici.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

Approximating the MST Weight in Sublinear Time

RE-Tree: An Efficient Index Structure for Regular Expressions

Data Integration with Dependent Sources

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Presentation transcript:

Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work with Alin Dobra (U Florida), Johannes Gehrke (Cornell), and Rajeev Rastogi (Bell Labs)

2 Talk Outline  Introduction & Basic Stream Computation Model  Basic Sketching for the Single-Query Case  The Multiple-Query Case –Sketch Sharing –Correctness –Optimality  Experimental Study  Conclusions

3 Data-Stream Management data sets  Traditional DBMS – data stored in finite, persistent data sets  Data Streams – distributed, continuous, unbounded, rapid, time varying, noisy,...  Data-Stream Management – variety of modern applications –Network monitoring and traffic engineering –Telecom call-detail records –Network security –Financial applications –Sensor networks –Manufacturing processes –Web logs and clickstreams –Massive data sets

4 Data - Stream Processing Model  Approximate answers often suffice, e.g., trend analysis, anomaly detection  Requirements for stream synopses –Single Pass: Each record is examined at most once, in (fixed) arrival order –Small Space: Log or polylog in data stream size –Real-time: Per-record processing time (to maintain synopses) must be low –Delete-Proof: Can handle record deletions as well as insertions Stream Processing Engine Approximate Answer with Error Guarantees “Within 2% of exact answer with high probability” Stream Synopses (in memory) Continuous Data Streams Query Q R1 Rk (GigaBytes) (KiloBytes)

5 Synopses for Relational Streams  Conventional data summaries fall short –Quantiles and 1-d histograms [MRL98,99], [GK01], [GKMS02] Cannot capture attribute correlations Little support for approximation guarantees –Samples (e.g., using Reservoir Sampling) Perform poorly for joins [AGMS99] or distinct values [CCMN00] Cannot handle deletion of records –Multi-d histograms/wavelets Construction requires multiple passes over the data  Different approach: Pseudo-random sketch synopses –Only logarithmic space –Probabilistic guarantees on the quality of the approximate answer –Support insertion as well as deletion of records

6 Linear-Projection (aka AMS) Sketch Synopses  Goal:  Goal: Build small-space summary for distribution vector f(i) (i=1,..., M) seen as a stream of i-values  Basic Construct:  Basic Construct: Randomized Linear Projection of f() = project onto inner/dot product of f-vector –Simple to compute over the stream: Add whenever the i-th value is seen –Generate ‘s in small (logM) space using pseudo-random generators –Tunable probabilistic guarantees on approximation error –Delete-Proof: Just subtract to delete an i-th value occurrence Data stream: 3, 1, 2, 4, 2, 3, 5,... f(1) f(2) f(3) f(4) f(5) where = vector of random values from an appropriate distribution

7 Binary-Join COUNT Query  Problem: Compute answer for the query COUNT(R A S)  Example:  Exact solution: too expensive, requires O(M) space! –M = sizeof(domain(A)) Data stream R.A: Data stream S.A: = 10 ( )

8 Basic AMS Sketching Technique [AMS96]  Key Intuition: Use randomized linear projections of f() to define random variable X such that –X is easily computed over the stream (in small space) –E[X] = COUNT(R A S) –Var[X] is small  Basic Idea: –Define a family of 4-wise independent {-1, +1} random variables –Pr[ = +1] = Pr[ = -1] = 1/2 Expected value of each, E[ ] = 0 –Variables are 4-wise independent Expected value of product of 4 distinct = 0 –Variables can be generated using pseudo-random generator using only O(log M) space (for seeding)! Probabilistic error guarantees (e.g., actual answer is 10 ± 1 with probability 0.9)

9 AMS Sketch Construction  Compute random variables: and –Simply add to X R (X S ) whenever the i-th value is observed in R.A (S.A)  Define X = X R X S to be estimate of COUNT query  E[X] = COUNT(R A S), – is the self-join size of R Data stream S.A: Data stream R.A:

10 Summary of Binary-Join AMS Sketching  Step 1: Compute random variables: and  Step 2: Define X= X R X S  Steps 3 & 4: Average independent copies of X; Return median of averages  Main Theorem (AGMS99): Sketching approximates COUNT to within a relative error of with probability using space –Remember: O(log M) space for “seeding” the construction of each X x x x Average y x x x y x x x y copies median 2log(1/ )

11 AMS Sketching for Multi-Joins [DGGR02]  Problem: Compute answer for COUNT(R A S B T) =  Sketch-based solution –Compute random variables X R, X S and X T –Return X=X R X S X T (E[X]= COUNT(R A S B T)) Stream R.A: Stream S: A B Stream T.B: Independent families of {-1,+1} random variables

12 What About Multiple Queries?  For example, network monitoring can require 100’s of standing queries –Different trends/anomalies, different parts/elements of the network Stream Processing Engine Approximate Answers to Q1, Q2, …, Qn Stream Synopses (in memory) Continuous Data Streams Q1 R1 Rk (GigaBytes) (KiloBytes) Q2 Qn … Query Workload R AA BB ST A B RT Q1 Q2

13 Sketching for Multiple Standing Queries  Consider queries Q1 = COUNT(R A S B T) and Q2 = COUNT(R A=B T)  Naive approach: construct separate sketches for each join –,, are independent families of pseudo-random variables R AA BB ST A B RT

14 Sketch Sharing  Key Idea: Share sketch for relation R between the two queries –Reduces space required to maintain sketches (random seed + sketch itself) –Can be seen as vertex coalescing in the workload join graph A A BB A B Same family of random variables R T TS  Share sketch only for queries using same attributes of the same stream –Only sufficient condition! –Correctness of estimates may be an issue

15 Sketch Sharing: Correctness of Estimates  With sharing of sketches for both R and T, estimate X for Q1 = COUNT(R A S B T) may be incorrect  For correctness, families of random variables for distinct edges of a join query must be independent –Otherwise, unbiasedness and variance bounds are lost –Vertex coalescing and transitivity can cause this to be violated! –Very different from traditional MQO – probabilistic estimation A A B B A B Same family of random variables R T S

16 Sketch Sharing: Correctness & Optimality  Formalize the problem using notion of -equivalent join-graph edges –Edges using the same random family (directly or by transitivity)  Look for well-formed workload join graphs –No -equivalent edges from the same query –Necessary and sufficient condition for correctness –Many such well-formed graphs may exist  Optimality:  Optimality: Find well-formed join graph and allocation of sketching space to per-vertex sketches that is “optimal” –Minimizes some aggregate error metric over the query workload E.g., average or maximum error over all queries

17 The Space Allocation Problem  Simpler problem: Well-formed join graph is given! Allocate space…  Query Q = COUNT(R S T....) –Basic estimate X=X R X S X T..... –M R is space (number of iid sketch copies) allocated to vertex/stream R –Number of copies of X, M Q = min{M R, M S, M T,....)  Problem: Given join graph over queries Q1,..., Qr and memory M, allocate space M R, M S, M T,... to nodes/sketches X R, X S, X T,... of join graph such that one of the following is minimized – (average error), or (max error) subject to constraints –M R +M S +M T +... = M –M Qi = min{M R, M S, M T,...} (Qi = COUNT(R S T....)

18 The Space Allocation Problem (contd.)  For average error, the space-allocation problem is NP-hard (reduction from k-clique) –Propose a near-optimal heuristic Novel algorithm for solving the continuous relaxation Round to get a near-optimal integer solution  For maximum error, a greedy algorithm based on marginal gains is actually optimal

19 The General Problem  Discover both a well-formed join graph and an allotment of sketching space to its vertices such that workload error is minimized –NP-hard even for “easy” cases E.g., find well-formed join graph with minimum number of vertices  Propose an iterative greedy strategy –Start with disjoint per-query join graphs –At each step, determine the pair of join-graph vertices to coalesce such that Well-formedness is preserved Decrease in error is maximized –Use space-allocation algorithms to determine error decrease

20 Experimental Study  Compare our sketch sharing techniques vs. no sharing; also, measure effectiveness of our space-allocation techniques  Database schema and query workloads inspired by the TPC-H benchmark –Workload 1: 12 TPC-H queries –Workload 2: Workload 1 plus 17 random join queries  Synthetically-generated clustered, multi-dimensional data sets controlled by Zipfian skew parameters  Error metric for a single query: squared relative error –Average or Maximum of per-query errors in the workload –Repeat each experiment 100 times and average measurements

21 Average Error, Workload 1

22 Maximum Error, Workload 1

23 Average Error, Workload 2

24 Maximum Error, Workload 2

25 Conclusions  Addressed the problem of processing multiple data-stream queries concurrently using sketches  Introduced the concept of sketch sharing –Effectively share sketching space and computation across queries  Necessary and sufficient conditions for correct sketch-sharing –Well-formed workload join graphs  Novel algorithms for optimal sketch sharing –Space allocation and strategies for coalescing sketches  Experimental study verified benefits of our sketch-sharing techniques –Improvements in accuracy ranging from a factor of 2 to 4

26 Thank you!

27 Sketch Sharing: Problem Formulation  Problem: Given set of queries, compute join graph with minimum number of (shared) sketches, and such that all join query estimates are correct  Problem is NP-hard (reduction from vertex cover)  Simple greedy heuristic –Start with initial join graph with complete sharing –In each iteration, split node that minimizes the number of “bad” edges Splitting nodes in vertex cover gets rid of “bad” edges Join graph containing “bad” edges (in red) Initial graph A AA A A B AA AB A AA A A B AA AB A

28 Space Allocation to Sketches  Key Observation: Allocating identical space to each sketch may not optimize cumulative/max error for join query estimates  Consider query Q: COUNT(R S T....) –Query Q estimated as X=X R X S X T..... –Number of copies of X, M Q = min{M R, M S, M T,....) M R is space allocated to sketch X R  Relative square error for Q

29 Space Allocation to Sketches: Example  Consider queries Q1 = COUNT(R A S B T) and Q2 = COUNT(R A=B T) –Let M = 100, w Q1 = 2500 and w Q2 = 25 S T R T A A BB A B S T R T A A BB A B (25) Est Q1 = X R X S X T Est Q2 = X R X’ T M Q1 = 25M Q2 = 25 Est Q1 = X R X S X T Est Q2 = X R X’ T M Q1 = 30M Q2 = 10 More space to Q1 Equal space to Q1 & Q2 (30) (10)

30 Average Error, Workload 3