Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want.

Query Sampling in DB2

Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want just approximate answers

Solution: Sampling Processing less data => huge performance improvement Approximate answers often suffice

Current Support of Sampling in DB2: The Rand() Function RAND() returns a uniform random number between 0 and 1 SELECT * FROM original query WHERE rand() < 0.01 Advantage:User can easily specify the size of sample he wants Disadvantage:The RAND() operator does not provide any optimization as it is applied to the query result

Current Support of Sampling in DB2: The TABLESAMPLE operator Can place sampling clause after any SQL table reference SELECT … FROM T TABLESAMPLE BERNOULLI(10.0) WHERE … General form of sampling clause TABLESAMPLE samplingMethod(p) samplingMethod is one of the two:  BERNOULLI: row-level Bernoulli sampling  SYSTEM: page-level (efficient) sampling method p = inclusion probability for each row (%) = (expected) sampling fraction

Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.) Advantage:By pushing sampling to the bottom of a query tree, can provide huge performance improvements Disadvantage: How to extrapolate the sampling rate at the base table to the query result? Joins Group by Count (DISTINCT) Subqueries

Extrapolation Problem Let R be a base relation referenced in query Q. Then: uniform random sample of Q ≠ Q evaluated over a uniform random sample of R (and possibly other tables) Example (CMN99): Let Q = R S, where: R(A,B) = {(a1, b0), (a2, b1), (a2, b2), (a2, b3),…, (a2, bk)} S(A,C) = {(a2, c0), (a1, c1), (a1, c2), (a1, c3),…, (a1, ck)} sample(R, rate1) sample(S, rate2) cannot generate sample(R S, rate3) for any reasonable values of rate1 and rate2

Solution Push sampling to the bottom of the query tree – effectively replacing RAND() with TABLESAMPLE - whenever possible –Selections –Projections without duplicate removal –Foreign key joins –Some types of group by (using materialized views)

Rule Implementation The “pushdown” implemented as a condition-action rule –When all operators in a query commute with sampling, the rule is fired –Applied during query rewrite optimization

Foreign Key Joins A two-way join r1 r2 is a foreign key join if a join attribute is the foreign key in r1 (this definition can be easily extended to n-way join). The relation r1 is called a source relation. A random sample of r1 r2 can be produced by joining a random sample of r1 (the source relation) with r2.

FK-Join Query Transformation select * from (select * from F, D where F.foreignkey = D.primarykey) where RAND()<0.1 select * from F TABLESAMPLE bernoulli (10), D where F.foreignkey = D.primarykey)

Sampling group by queries using materialized views create view MQT as ( select DISTINCT D1.x1, D2.x2 from F, D1, D2 where F.y1=D1.x1 and Fy2=D2.x2) select * from (select (exp1, exp2,…) from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2 and pred1, pred2,… group by x1, x2) where RAND()<0.1 select (exp1, exp2,…) from F, MQT TABLESAMPLE bernoulli (10.0) where F.y1=MQT.x1 and F.y2=MQT.x2 and pred1, pred2,… group by x1, x2

Experiments Modified 12 queries from TPCH –62% performance improvement for 10% sample –76% improvement for 1% sample Defined several MQTs and queries in TPCH schema –95% improvement for 1% sample

Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want.

Similar presentations

Presentation on theme: "Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want.

Similar presentations

Presentation on theme: "Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want."— Presentation transcript:

Similar presentations

About project

Feedback