Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.
Dr. Miguel Bagajewicz Sanjay Kumar DuyQuang Nguyen Novel methods for Sensor Network Design.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.
Evaluating Hypotheses
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Estimation 8.
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
8 TECHNIQUES OF INTEGRATION. There are two situations in which it is impossible to find the exact value of a definite integral. TECHNIQUES OF INTEGRATION.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Integrals 5.
Sampling Theory Determining the distribution of Sample statistics.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Random Sampling, Point Estimation and Maximum Likelihood.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
CMSC424: Database Design Instructor: Amol Deshpande
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
1 Lecture 3: March 6, 2007 Topic: 1. Frequency-Sampling Methods (Part I)
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Sampling Theory Determining the distribution of Sample statistics.
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
A paper on Join Synopses for Approximate Query Answering
Distribution of the Sample Means
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
AQUA: Approximate Query Answering
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Presentation transcript:

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering

Introduction Traditional query processing has focused on providing exact answers to queries. In large data providing an exact answer to a complex query can take minutes, or even hours. There are a number of scenarios in which an exact answer may not be required, a fast, approximate answer is preferred.

Introduction An approximate answer can also provide feedback on how well-posed a query is. It can provide a tentative answer to a query when the base data is unavailable. Provide an estimated response in orders of magnitude less time than the time to compute an exact answer. Avoiding or minimizing the number of accesses to the base data

Introduction Schemes for providing approximate join aggregates that rely on using random samples of base relations alone suffer from serious disadvantages. Instead, the use of precomputed samples of a small set of distinguished joins.

The Aqua System Improve response times for queries by avoiding accesses to the original data altogether. Instead, Aqua maintains smaller-sized statistical summaries, called synopses. Statistics take the form of various types of samples and histograms on the data in the data warehouse.

Aqua Architecture Aqua has three key components: Statistics Collection: responsible for collecting all the synopses which Aqua uses to answer queries posed by the user. Query Rewriting: responsible for parsing the input SQL query and generating an appropriately translated query. Additionally, the rewriting involves appropriate scaling of certain operators. Maintenance: responsible for keeping the synopses up to date in the presence of updates to the underlying data.

The Problem with Joins Non-Uniform Result Sample: In general, the join of two uniform random base samples is not a uniform random sample of the output of the join. In most cases, this non- uniformity significantly degrades the accuracy of the answer and the confidence bounds. Small Join Result Sizes: The join of two random samples typically has very few tuples, even when the actual join selectivity is fairly high. This can lead to both inaccurate answers and very poor confidence bounds since they critically depend on the query result size.

Non-Uniform Result Sample In order for the join of the base samples to be a uniform random sample of the actual join, the probability of any two joined tuples to be in the former should be the same as their probability in the latter.

Counter Example Consider the (equality) join of two relations R and S on an attribute X. The distribution of X values in the two relations are given in the figure. The edges connect joining tuples. Consider joining base samples from R and S. Assume that each tuple is selected for a base sample with probability 1/r. From the figure, we see that a1 and a2 are in the join if and only if both a tuples are selected from R and the one a tuple is selected from S. This occurs with probability 1/r 3, since there are three tuples that must be selected. On the other hand, a1 and b1 are in the join if and only if the four tuples incident to these edges are selected. This occurs with probability only 1/r 4. This contrasts with the fact that in a uniform random sample of the actual join, the probability that both a1 and a2 are selected equals the probability that both a1 and b1 are selected.

Small Join Result Sizes Consider two relations, A and B, and base samples comprising of 1% of each relation. The size of the foreign key join between A and B is equal to the size of A. However the expected size of the join of the base samples is 0.01% of the size of A, since for each tuple in A, there is only one tuple in B that joins with it, and that tuple is in the sample for B with only a 1% probability. In general, consider a k-way foreign key join and k base samples each comprising 1/r of the tuples in their respective base relations. Then the expected size of the join of the base samples is 1/r k of the size of the actual join. Thus, it is in general impossible to produce good quality approximate answers using samples on the base relations alone.

Join Synopses A practical and effective solution for producing approximate join aggregates of good quality. Precompute samples of join results, making quality answers possible even on complex joins. A way to precompute such samples is to execute all possible join queries of interest and collect samples of their results. This is not feasible since it is too expensive to compute and maintain.

Join Synopses Definition 1: Foreign Key Join: A 2-way join r 1 ⋈ r 2, r 1 ≠ r 2, is a foreign key join if the join attribute is a foreign key in r 1. For k ≥ 3, a k-way join is a k-way foreign key join if there is an ordering r 1, r 2,…,r k of the relations being joined that satisfies the following property: for i = 2,3,..., k, s i- 1 ⋈ r i is a 2-way foreign key join, where s i-1 is the relation obtained by joining r 1,r 2,...,r i-1.

Join Synopses In order to develop this solution, the database schema is modeled by a graph with a vertex for each base relation and a directed edge from a vertex u to a vertex v ≠ u if there are one or more attributes in u’s relation that constitute a foreign key for v’s relation. The edge is labeled with the foreign key. The figure shows the corresponding graph for the TPC-D schema.

Join Synopses Lemma 1: The subgraph of G on the k nodes in any k-way foreign key join must be a connected subgraph with a single root node. Denote the relation corresponding to the root node as the source relation for the k-way foreign key join. Lemma 2: There is a 1-1 correspondence between tuples in a relation r 1 and tuples in any k-way foreign key join with source relation r 1.

Join Synopses Definition 2: Join Synopses: For each node u in G, corresponding to a relation r 1, define J(u) to be the output of the maximum foreign key join r 1 ⋈ r 2 ⋈... ⋈ r k, with source r 1. Let S u be a uniform random sample of r 1. Define a join synopsis, J(S u ), to be the output of S u ⋈ r 2 ⋈ … ⋈ r k. The join synopses of a schema consists of J(S u ) for all u in G.

Join Synopses Theorem 1: Let r 1 ⋈... ⋈ r k, k > 2, be an arbitrary k-way foreign key join, with source relation r 1. Let u be the node in G corresponding to r 1, and let S, be a uniform random sample of r 1. Let A be the set of attributes in r k. Then, the following are true: J(S u ) is a uniform random sample of J(U), with |S u | tuples. (From Lemma 2.) r 1 ⋈ … ⋈ r k = π A J(u) (From above definition.) π A J(S u ) is a uniform random sample of r 1 ⋈ … ⋈ r k with |S u | tuples. (From above 2 statements.)

Join Synopses Lemma 3: From a single join synopsis for a node whose maximum foreign key join has K relations, we can extract a uniform random sample of the output of between K – 1 and 2 K – 1 – 1 distinct foreign key joins. Lemma 4: For any node u whose maximum foreign key join is a k-way join, the number of tuples in its renormalized join synopsis J(S u ) is at most k|S u |

Allocation Optimal Strategies: Seek to select join synopsis (join sample) sizes so as to minimize the average relative error over a collection of aggregate queries. For each relation, R i, we determine the fraction, f i, of the queries in S for which R i is either the source relation in a foreign key join or the sole relation in a query without joins. The average relative error bound over the queries is proportional to ∑ i [f i /√n i ]. n i is the number of tuples allocated to the join sample for the source R i

Allocation Heuristic Strategies: Strategies that can be used in the absence of query work load information. EqJoin: divides up the space allotted, N, equally amongst the relations. Each relation devotes all its allocated space to join synopses. (For relations with no descendants in the schema, this equates to a sample of the base relation.) CubeJoin: divides up the space amongst the relations in proportion to the cube root of their join synopsis tuple sizes. Each relation devotes all its allocated space to join synopses. PropJoin: divides up the space amongst the relations in proportion to their join synopsis tuple sizes. Each relation devotes all its allocated space to join synopses, and hence each join synopsis has the same number of tuples.

Improved Accuracy Measures A critical issue in approximate query answering of aggregation queries is that of providing confidence bounds for the answers. An important advantage of using join synopses is that queries with foreign key joins can be treated as queries without joins Known confidence bounds for single-table queries are much faster to compute and much more accurate than the confidence bounds for multi-table queries.

Maintenance of Join Synopses Need to maintain synchronization when base relation is updated. If a tuple is added: Let P u be current probability for including a newly arrived tuple for relation u in the random sample S u. Let u ⋈ r 2 ⋈ r 3 ⋈ …. ⋈ r k be the max foreign key join with source u. We add T p (new tuple) to S u with probability P u. If T p is added to S u, we add to J(S u ) the tuple T p ⋈ r 2 ⋈ r 3 ⋈ ….r k. On delete of a tuple T p from u: Find if T p is in S u. If there, then delete it from S u and remove the tuple in J(S u ) corresponding to T p. If sample becomes too small, repopulate the sample by rescanning relation u.

Experimental Evaluation Test bed – TPC-D decision support benchmark. DB of around 300 MB. The query used is

Join Synopsis Accuracy

Join Synopsis Maintenance

Conclusion Focus on computing approximate answers to aggregates computed on multi-way joins. For Database’s with schema’s that involve only foreign key joins, join synopses has been proposed as a solution. Join synopses can be maintained effectively during updates.