Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

Fast Algorithms For Hierarchical Range Histogram Constructions

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Efficient Query Evaluation on Probabilistic Databases

1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Data Stream Mining and Querying

A survey on stream data mining

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

An Integration Framework for Sensor Networks and Data Stream Management Systems.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

The Relational Model: Relational Calculus

Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

1 Relational Algebra and Calculas Chapter 4, Part A.

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.

Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

Calculating frequency moments of Data Stream

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

New Characterizations in Turnstile Streams with Applications

A paper on Join Synopses for Approximate Query Answering

Finding Frequent Items in Data Streams

Estimating L2 Norm MIT Piotr Indyk.

Spatial Online Sampling and Aggregation

Load Shedding Techniques for Data Stream Systems

AQUA: Approximate Query Answering

Data Integration with Dependent Sources

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Range-Efficient Computation of F0 over Massive Data Streams

Presentation transcript:

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002

Approximate Query Evaluation  Why?  Handling load – streams coming too fast  Data stream is archived in a off-site data warehouse, expensive access of archived data  Avoid unbounded storage and computation  Ad hoc queries need approximate history  Try to look at the data items only once and in a fixed order

Synopses  Queries may access or aggregate past data  Need bounded-memory history-approximation  Synopsis? – Succinct summary of old stream tuples – Like indexes/materialized-views, but base data is unavailable  Examples – Sliding Windows – Samples – Sketches – Histograms – Wavelet representation

Sketching Techniques  Self-Join Size Estimation  Stream of values from dom(A) = {1,2,…,n}  Let f(i) = frequency of value i  Consider SJ(A) = Σ i  dom(A) f(i) 2, or Gini’s index of homogeneity.  Useful in parallel DB applications, error estimation in query result size estimation and access plan costs.  Equivalent query: Q = COUNT (R |><| A R)

Evaluating SJ(A) = Σ f i 2  To update S, keep a counter f(i) for each value i in the domain D   (|dom(A)|) bits of storage  This is true for any deterministic algorithm  Question – estimating S in sub-linear space (O(log |dom(A)|))  A randomized technique that offers strong probabilistic guarantees on the quality of SJ(A)

Self-Join Size Estimation  AMS Technique (randomized sketches)  Given (f(1),f(2),…,f(n))  Generate a family of 4-wise independent binary random variables {Z i : i = 1,…, |dom(A)|}  Z i = random{-1,1}  P[Z i = +1] = P[Z i = -1] = ½ (E[Z i ] = 0)  By using tools like orthogonal arrays, such families can be constructed online using O(log|dom(A)|) space.

Self-Join Size Estimation  Define X = Σ i  dom(A) f(i)Z i (X incrementally computable)  Theorem Exp[X 2 ] = Σ f(i) 2 – Cross-terms f(i)Z i f(j)Z j have 0 expectation – Square-terms f(i)Z i f(i)Z i = f(i) 2  Hence, X 2 is an unbiased estimator for SJ(A)  Space = log (N + Σ f(i))  Independent samples X k reduce variance

Estimation Quality  How can independent samples X k improve the quality of estimation?  Keep s 1 x s 2 iid instantiations for X k  Use independent random seeds to get a family of Z i ’s for each instance  s 1 reduces variance, s 2 boosts confidence Avg(X 1j 2 ) Avg(X 2j 2 ) Avg(X 3j 2 ) Avg(X 4j 2 ) Avg(X 5j 2 ) Median (sketch) s1s1 s2s2 Atomic Sketch

Sample Run of AMS 36257V = Z 1 = Z 2 = X 1 = 5, X 1 2 = 25 X 2 = 14, X 2 2 = 196 Est = 110.5Σv i 2 = V = Z 1 = Z 2 = X 1 = 6, X 1 2 = 36,X 2 = 12, X 2 2 = 144, Est = 90Σ v i 2 = 130,

Comments on AMS  The self-join size can be computed on-line  Sufficiently small variance (controlled by s 1 and s 2 )  Can estimate SJ(A) in one pass  Probabilistic guarantee: a relative error of at most  with probability at least 1-   Can this method be extended for other queries?

Complex Aggregate Queries  A. Dobra et al. extend the idea of AMS to provide approximate answers to complex aggregate queries. AGG   SELECT AGG FROM R 1,R 2,…,R r where   AGG: COUNT/SUM/AVERAGE   : conjunction of equality joins (R i.A j = R k.A l )  The error of these estimates is at most ε with probability 1-δ.

 Attribute Constraint of  .  Each attribute of a relation appears at most once in the join condition .  For any attribute R i.A j occuring m > 1 times in , add m-1 new “attributes” to R i  Replace m-1 occurrences in  with the new attributes   = ((R 1.A 1 =R 2.A 1 ) & (R 1.A 1 =R 3.A 1 )) becomes ((R 1.A 1 =R 2.A 1 ) & (R 1.A 2 =R 3.A 1 ))

The COUNT Query  Q COUNT : # of tuples in the cross-product of R 1,…,R r that satisfy the equality constraints in .  SELECT COUNT(*) from R 1,…,R r where   Approach: Express the above query using mathematical operators.  Rename the 2n join attributes in  to A 1, A 2,…, A 2n such that each constraint in  is A j = A n+j for 1  j  n.

Projection of Domains  D = dom(A 1 ) x dom(A 2 ) x…x dom(A 2n )  S k is the set of join attributes appearing in R k  D k = dom(A k1 ) x dom(A k2 ) x…x dom(A k|Sk| ); A k1,…,A k|Sk| are attributes in S k  D k is a projection of D on attributes in S k

Value Assignment I  An assignment I assigns values to (a subset of) join attributes  If I  D, each join attribute is assigned a value I[j] by I.  If I  D k, then I only assigns a value I[j] to attributes j  S k.  Use letter j to refer to attribute A j

Value Assignment I  I[S k ] is the projection of I on attributes in S k.  f k (I) is the number of tuples in R k that match I.

Example  SELECT COUNT (*) FROM R1, R2, R3 where R1.A1 = R2.A1 and R2.A2 = R3.A1  Renaming: SELECT COUNT (*) FROM R1, R2, R3 where A1 = A3 and A2 = A4 – A1 = R1.A1A2 = R2.A2 – A3 = R2.A1 A4 = R3.A1  Note: Joins in the form A j = A n+j  S1 = {1}, S2 = {2,3}, S3 = {4}

Example (cont.)  D = dom(A1) x dom(A2) x dom(A3) x dom (A4)  I 1 =  D; I 1 [3] = 2  D1 = dom(A1); D2 = dom(A2) x dom(A3); D3 = dom(A4)  I 1 [S 1 ] =, I 1 [S 2 ] =, I 1 [S 3 ] =  f 2 ( ) = 5 means 5 tuples from R 2 have A2=3, A3=2

Exact Q COUNT  Q COUNT =Σ I  D,  j:I[j]=I[n+j] Π r k=1 f k (I[S k ]) – Product of the # of tuples in each relation that match a value assignment I – Summed over all I  D that satisfy the equi-join constraints ε

Approximate Q COUNT  Similar to self-join size!  Construct a random variable X that is an unbiased estimator for Q COUNT  Variance can be bounded from above  Use averaging and median-selection tricks to boost the accuracy and confidence of X  Achieve small relative error with high probability

Constructing X  For each pair of join attributes j, n+j in , build random variables {Z j,l : l=1,…,|dom(A j )|},where Z j,l  {-1, +1}  Random variables belonging to families for different attribute pairs are independent  Space:Σ n j=1 O(log|dom(A j )|)

Constructing X (2)  For each relation R k, define atomic sketch X k = Σ I  Dk (f k (I)Π j  Sk Z j,I[j] )  X = unbiased estimator of Q COUNT = Π r k=1 X k  E(X) = Q COUNT  Each X k can be efficiently computed as tuples of R k are streaming in. – Initialize X k to 0 – For each tuple t in the R k stream, addΠ j  Sk Z j,t[j] to X k

Join Graph and Acyclicity  A join graph is an undirected graph – a node for each relation, and, – an edge for each join attribute pair j, n+j  If the join graph is acyclic, the variance of X is bounded from above.  Many join queries are acyclic, like chain joins and star joins.

Probabilistic Guarantees  If Q COUNT is acyclic, multi-join COUNT query over relations R 1,…, R r, then a sketch of size O(log|dom(A j )|) is possible to approximate Q COUNT so that: – the relative error of the estimate is at most ε with probability 1-δ.

Basic Notions of Approximation  For aggregate queries (e.g., SUM, COUNT), approximation quality can be measured by absolute relative error: – |Estimated value – Actual value| / Actual value  Open question: for queries involving more than simple aggregation, how should we define approximation?  Consider S |><| B T: (S: {A,B}, T: {B,C}) ABC Doctor 810.3Lawyer 310.2Teacher ABC 810.3Lawyer 310.2Teacher Actual Result Approximate Result

Basic Notions of Approximation (2)  Can we accept this kind of approximation? ABC Doctor 810.3Lawyer 310.2Teacher Actual ResultApproximate Result ABC Doctor 810.3Student 310.2Teacher

Basic Notions of Approximation (3)  Can we provide useful (semantically correct) but stale results? ABC Doctor 810.3Lawyer 310.2Teacher Actual Result (at time t) Approximate Result (correct result at time t -  ) ABC Doctor 810.3Lawyer

Related Issues  Metric for set-valued queries  Composition of approximate operators  How is it understood/controlled by user?  Integrate into query language  Query planning and interaction with resource allocation  Accuracy-efficiency-storage tradeoff and global metric

Conclusions  To evaluate aggregate queries, O(|dom(A)|) size is needed to obtain exact answers.  True for any deterministic algorithms  We introduce randomized algorithms that uses only O(Σ n j=1 log|dom(A j )|) space.  Only 1 pass is needed for estimation.

Conclusions (2)  Express the target query by mathematical operators  Use an unbiased estimator for the expression  Independent samples reduce variance  A probabilistic guarantee: the relative error of the estimate is at most ε with probability 1-δ.  Question: Can we use this technique to answer queries involving more general selection predicates (e.g., inequality joins)?

References N. Alon, Y. Matias, M. Szegedy. The Space Complexity of Approximating the Frequency Moments, STOC ’96. A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD ’02.

Thank You!