Download presentation
Presentation is loading. Please wait.
Published byStephen Morrison Modified over 8 years ago
1
Query Sampling in DB2
2
Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want just approximate answers
3
Solution: Sampling Processing less data => huge performance improvement Approximate answers often suffice
4
Current Support of Sampling in DB2: The Rand() Function RAND() returns a uniform random number between 0 and 1 SELECT * FROM original query WHERE rand() < 0.01 Advantage:User can easily specify the size of sample he wants Disadvantage:The RAND() operator does not provide any optimization as it is applied to the query result
5
Current Support of Sampling in DB2: The TABLESAMPLE operator Can place sampling clause after any SQL table reference SELECT … FROM T TABLESAMPLE BERNOULLI(10.0) WHERE … General form of sampling clause TABLESAMPLE samplingMethod(p) samplingMethod is one of the two: BERNOULLI: row-level Bernoulli sampling SYSTEM: page-level (efficient) sampling method p = inclusion probability for each row (%) = (expected) sampling fraction
6
Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.) Advantage:By pushing sampling to the bottom of a query tree, can provide huge performance improvements Disadvantage: How to extrapolate the sampling rate at the base table to the query result? Joins Group by Count (DISTINCT) Subqueries
7
Extrapolation Problem Let R be a base relation referenced in query Q. Then: uniform random sample of Q ≠ Q evaluated over a uniform random sample of R (and possibly other tables) Example (CMN99): Let Q = R S, where: R(A,B) = {(a1, b0), (a2, b1), (a2, b2), (a2, b3),…, (a2, bk)} S(A,C) = {(a2, c0), (a1, c1), (a1, c2), (a1, c3),…, (a1, ck)} sample(R, rate1) sample(S, rate2) cannot generate sample(R S, rate3) for any reasonable values of rate1 and rate2
8
Solution Push sampling to the bottom of the query tree – effectively replacing RAND() with TABLESAMPLE - whenever possible –Selections –Projections without duplicate removal –Foreign key joins –Some types of group by (using materialized views)
9
Rule Implementation The “pushdown” implemented as a condition-action rule –When all operators in a query commute with sampling, the rule is fired –Applied during query rewrite optimization
10
Foreign Key Joins A two-way join r1 r2 is a foreign key join if a join attribute is the foreign key in r1 (this definition can be easily extended to n-way join). The relation r1 is called a source relation. A random sample of r1 r2 can be produced by joining a random sample of r1 (the source relation) with r2.
11
FK-Join Query Transformation select * from (select * from F, D where F.foreignkey = D.primarykey) where RAND()<0.1 select * from F TABLESAMPLE bernoulli (10), D where F.foreignkey = D.primarykey)
12
Sampling group by queries using materialized views create view MQT as ( select DISTINCT D1.x1, D2.x2 from F, D1, D2 where F.y1=D1.x1 and Fy2=D2.x2) select * from (select (exp1, exp2,…) from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2 and pred1, pred2,… group by x1, x2) where RAND()<0.1 select (exp1, exp2,…) from F, MQT TABLESAMPLE bernoulli (10.0) where F.y1=MQT.x1 and F.y2=MQT.x2 and pred1, pred2,… group by x1, x2
13
Experiments Modified 12 queries from TPCH –62% performance improvement for 10% sample –76% improvement for 1% sample Defined several MQTs and queries in TPCH schema –95% improvement for 1% sample
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.