Query Sampling in DB2.

Query Sampling in DB2

Motivation Many algorithms do not scale up with data volume
Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want just approximate answers

Solution: Sampling Processing less data => huge performance improvement Approximate answers often suffice Can you tell the difference between the two pie charts? It's pretty difficult, yet the pie chart based on the sample can be computed in about 1/100th of the time.

Current Support of Sampling in DB2:
The Rand() Function RAND() returns a uniform random number between 0 and 1 SELECT * FROM original query WHERE rand() < 0.01 Advantage: User can easily specify the size of sample he wants Disadvantage: The RAND() operator does not provide any optimization as it is applied to the query result Note that we need to scale up the answer for aggregates like SUM and COUNT, but not for AVERAGE

Current Support of Sampling in DB2: The TABLESAMPLE operator
Can place sampling clause after any SQL table reference SELECT … FROM T TABLESAMPLE BERNOULLI(10.0) WHERE … General form of sampling clause TABLESAMPLE samplingMethod(p) samplingMethod is one of the two: BERNOULLI: row-level Bernoulli sampling SYSTEM: page-level (efficient) sampling method p = inclusion probability for each row (%) = (expected) sampling fraction

Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.)
Advantage: By pushing sampling to the bottom of a query tree, can provide huge performance improvements Disadvantage: How to extrapolate the sampling rate at the base table to the query result? Joins Group by Count (DISTINCT) Subqueries

Extrapolation Problem
Let R be a base relation referenced in query Q. Then: uniform random sample of Q ≠ Q evaluated over a uniform random sample of R (and possibly other tables) Example (CMN99): Let Q = R S, where: R(A,B) = {(a1, b0), (a2, b1), (a2, b2), (a2, b3),…, (a2, bk)} S(A,C) = {(a2, c0), (a1, c1), (a1, c2), (a1, c3),…, (a1, ck)} sample(R, rate1) sample(S, rate2) cannot generate sample(R S, rate3) for any reasonable values of rate1 and rate2

Solution Push sampling to the bottom of the query tree – effectively replacing RAND() with TABLESAMPLE - whenever possible Selections Projections without duplicate removal Foreign key joins Some types of group by (using materialized views)

Foreign Key Joins A two-way join r1 r2 is a foreign key join if a join attribute is the foreign key in r1 (this definition can be easily extended to n-way join). The relation r1 is called a source relation. A random sample of r r2 can be produced by joining a random sample of r1 (the source relation) with r2.

FK-Join Query Transformation
select * from ( select * from F, D where F.foreignkey = D.primarykey) where RAND()<0.1 select * from F TABLESAMPLE bernoulli (10), D where F.foreignkey = D.primarykey)

Sampling group by queries using materialized views
select * from (select (exp1, exp2,…) from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2 and pred1, pred2,… group by x1, x2) where RAND()<0.1 create view MQT as ( select DISTINCT D1.x1, D2.x2 from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2) select (exp1, exp2,…) from F, MQT TABLESAMPLE bernoulli (10.0) where F.y1=MQT.x1 and F.y2=MQT.x2 and pred1, pred2,… group by x1, x2

Experiments Modified 12 queries from TPCH
62% performance improvement for 10% sample 76% improvement for 1% sample Defined several MQTs and queries in TPCH schema 95% improvement for 1% sample

Query Sampling in DB2.

Similar presentations

Presentation on theme: "Query Sampling in DB2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Query Sampling in DB2.

Similar presentations

Presentation on theme: "Query Sampling in DB2."— Presentation transcript:

Similar presentations

About project

Feedback