Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande

Contents 1. Introduction 2. Need for approximate answers 3. Problem with joins 4. Join synopses 5. Allocation 6. Maintenance of join synopses 7. Experimental Evaluation

Introduction This paper  demonstrates difficulty of providing good approximate answers to join queries  proposes join synopses as the efficient solution for this problem  presents strategy for allocating available space for join synopses  provides efficient algorithm for maintaining join synopses in presence of updates to the base relations

Why Approximate answers ? Reduce overhead for large DBs and improve response time Reduce access to the base relation Example of Approximate answers  Initial queries in the data mining which are used to determine what the interesting queries are  Queries requesting numerical answers and full precision of exact answer not needed e.g. total, average The research in this paper was conducted while developing efficient approximate query answering system, Aqua.

Aqua System improve response time by avoiding frequent access to original data maintains smaller sized statistical summaries, called synopses, on warehouse. sits on the top of the DBMS. There key components  Statistic Collection  Query Rewriting  Maintenance Collects all synopses, uses it to answer queries posed by user parses sql input, rewrite queries for scaling certain operators to fit for synopses Keep synopses up to date during updating of original data

Problem with Joins Natural set of synopses for an approximate query includes uniform random samples of each base relations Non-uniform result samples - For the join to be uniform random sample, probabilities of tuples in join samples must be equal Small join result Join of relations R & S on attribute X aabbaabb abab R.XS.X a2 a1 b1 Probabilities of tuples a1 and a2 being selected should be same as prob. of tuples a1 and b1 selected in join uniform random sampling prob. (a1,a2)=(1/r)*(1/r)*(1/r)=1/r 3 prob. (a1,b1)=(1/r)*(1/r)*(1/r) *(1/r)=1/r 4 To get uniform join samples is very difficult

Join Synopses Naïve way - execute all possible join queries and collect samples Join synopses - samples are taken from small set of distinguished joins Can obtain random samples of all possible joins in the schema This is scheme is for foreign key joins Modeled database schema as a graph  vertex - base relation  directed edge (u to v) – if u has at least one attribute which is foreign key in v

Join Synopses Key result proved - There is 1-1 correspondence between a tuple in relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendents in the graph. A sample Sr of a relation ‘r’ can be used to produce another relation J(Sr) called a join synopsis of ‘r’. ( provides random samples). Join synopses of R is simply a sample of R where as for C it is the join of N, R and sample of C.

Join Synopses For each node u in database schema G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1. Let Su be a uniform random sample of r1. The join synopsis J(Su) is the output of Suxr2xr3…..xrk. J(Su) is a uniform random sample of J(u) with |Su| tuples. Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join. From 1 join synopsis for a node whose foreign key join has k relations, we can extract a URS of the output of between k-1 & pow(2,k-1)-1 distinct foreign key joins.

Allocation Allocate space among various join synopses when certain properties of query workload are known. Identify heuristics for the common case when such properties are not known. Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins. For each relation Ri, find fraction Fi of queries in S for which Ri is the source relation in a foreign key join. It is known that the error bounds are inversely proportional to sqrt(n).(n- number of tuples in join sample). Select join synopsis sizes so as to minimize the average relative error.

Allocation The average relative error bound over the queries is proportional to sum(f i /sqrt(n i )) n i is selected so as to minimize the above equation for the total memory allocated for join synopses For each relation R i if s i = size of single join synopses tuple then join synopses size is chosen so as sum (n i s i ) <= Total memory allocated In the absence of query work load information heuristic strategies can be used.  EqJoin  CubeJoin  PropJoin divides space equally amongst the relations divides space proportional to cube root of their join synopses tuple sizes divides space proportional to their join synopses tuple sizes

Maintenance of Join Synopses Need to maintain the join synopses when base relation is updated (insert or delete) does not require frequent access to base relation If a new tuple is inserted  Let P u be the probability of newly arrived tuple for relation u in random sample S u  Let uxr2xr3x….xrk be the max foreign key join with source u.  We add ‘T’ (new tuple) to S u with probability P u.  If ‘T’ is added to S u, we add to J(S u ) the tuple Txr2xr3x….rk

Maintenance of Join Synopses  If T is added to S u and S u exceeds its target size, then select uniformly at random a tuple T’ to evict from S u and remove the tuple in J(S u ) corresponding to T’. On delete of a tuple T from u  T is in S u delete the tuple from S u and remove the tuple from J(S u ) corresponding to T  If sample becomes too small due to many deletions repopulate by scanning relation u. This algorithm performs lookups with the base relation with small probability P u

Experimental Evaluation Two classes of experiments  Accuracy experiments  Maintenance experiments Accuracy Experiments  Compares accuracy of techniques based on join synopses and based on base samples  parameters varied - query selectivity and total space allocated to precomputed summaries (summary size/join synopses size) Maintenance Experiments  Study cost of keeping join synopses up to date in presence of insertions/deletions to the underlying data.

Experimental Evaluation Ran results on TPC-D decision support benchmark Query used is an aggregate that is computed on join of Lineitem, Customer, Order, Supplier, Nation and Region. The query used is region parameter is set to ‘ASIA’ and selection predicate is on o_orderdate column to the range [1/1/94, 1/1/95] Query selectivity is varied using these parameters

Experimental Evaluation Accuracy Experiments

Experimental Evaluation Maintenance Experiments tuples inserted in lineitem table

Conclusion Provides uniform random sampling for joins in the database having foreign key joins. Focus on computing approximate answers to aggregates computed on multi-way joins. Join synopses can be maintained effectively during updates.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Similar presentations

Presentation on theme: "Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Similar presentations

Presentation on theme: "Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande."— Presentation transcript:

Similar presentations

About project

Feedback