Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want.

Similar presentations


Presentation on theme: "Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want."— Presentation transcript:

1 Query Sampling in DB2

2 Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want just approximate answers

3 Solution: Sampling Processing less data => huge performance improvement Approximate answers often suffice

4 Current Support of Sampling in DB2: The Rand() Function RAND() returns a uniform random number between 0 and 1 SELECT * FROM original query WHERE rand() < 0.01 Advantage:User can easily specify the size of sample he wants Disadvantage:The RAND() operator does not provide any optimization as it is applied to the query result

5 Current Support of Sampling in DB2: The TABLESAMPLE operator Can place sampling clause after any SQL table reference SELECT … FROM T TABLESAMPLE BERNOULLI(10.0) WHERE … General form of sampling clause TABLESAMPLE samplingMethod(p) samplingMethod is one of the two:  BERNOULLI: row-level Bernoulli sampling  SYSTEM: page-level (efficient) sampling method p = inclusion probability for each row (%) = (expected) sampling fraction

6 Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.) Advantage:By pushing sampling to the bottom of a query tree, can provide huge performance improvements Disadvantage: How to extrapolate the sampling rate at the base table to the query result? Joins Group by Count (DISTINCT) Subqueries

7 Extrapolation Problem Let R be a base relation referenced in query Q. Then: uniform random sample of Q ≠ Q evaluated over a uniform random sample of R (and possibly other tables) Example (CMN99): Let Q = R S, where: R(A,B) = {(a1, b0), (a2, b1), (a2, b2), (a2, b3),…, (a2, bk)} S(A,C) = {(a2, c0), (a1, c1), (a1, c2), (a1, c3),…, (a1, ck)} sample(R, rate1) sample(S, rate2) cannot generate sample(R S, rate3) for any reasonable values of rate1 and rate2

8 Solution Push sampling to the bottom of the query tree – effectively replacing RAND() with TABLESAMPLE - whenever possible –Selections –Projections without duplicate removal –Foreign key joins –Some types of group by (using materialized views)

9 Rule Implementation The “pushdown” implemented as a condition-action rule –When all operators in a query commute with sampling, the rule is fired –Applied during query rewrite optimization

10 Foreign Key Joins A two-way join r1 r2 is a foreign key join if a join attribute is the foreign key in r1 (this definition can be easily extended to n-way join). The relation r1 is called a source relation. A random sample of r1 r2 can be produced by joining a random sample of r1 (the source relation) with r2.

11 FK-Join Query Transformation select * from (select * from F, D where F.foreignkey = D.primarykey) where RAND()<0.1 select * from F TABLESAMPLE bernoulli (10), D where F.foreignkey = D.primarykey)

12 Sampling group by queries using materialized views create view MQT as ( select DISTINCT D1.x1, D2.x2 from F, D1, D2 where F.y1=D1.x1 and Fy2=D2.x2) select * from (select (exp1, exp2,…) from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2 and pred1, pred2,… group by x1, x2) where RAND()<0.1 select (exp1, exp2,…) from F, MQT TABLESAMPLE bernoulli (10.0) where F.y1=MQT.x1 and F.y2=MQT.x2 and pred1, pred2,… group by x1, x2

13 Experiments Modified 12 queries from TPCH –62% performance improvement for 10% sample –76% improvement for 1% sample Defined several MQTs and queries in TPCH schema –95% improvement for 1% sample


Download ppt "Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want."

Similar presentations


Ads by Google