# Lecture 10 Query Optimization II Automatic Database Design.

## Presentation on theme: "Lecture 10 Query Optimization II Automatic Database Design."— Presentation transcript:

Lecture 10 Query Optimization II Automatic Database Design

Recap: Query Planning Use analytical cost to estimate time needed for a query execution plan tree Selectivity (fraction of tuples returned from input): – col = value: 1/ICARD– 1/nth of # of unique col values, 1/10 if no index – col > value: (value – max) / (max – min) or 1/3 – col1 = col2: 1/max(ICARD(c1), ICARD(c2)) or 1/10

Selinger Heuristics Push down all filters and projections Skip cross-joins Left-deep plans only Get from O(n!) to O(2^n) optimization time

Selinger Optimizer Algorithm algorithm: compute optimal way to generate every sub-join: size 1, size 2,... n (in that order) e.g. {A}, {B}, {C}, {AB}, {AC}, {BC}, {ABC} R  set of relations to join For i in {1...|R|}: for S in {all length i subsets of R}: optjoin(S) = a join (S-a), where a is the relation that minimizes: cost(optjoin(S-a)) + min. cost to join optjoin(S-a) to a + min. access cost for a Precomputed in previous iteration!

Selinger, as code R  set of relations to join For i in {1...|R|}: for S in {all length i subsets of R}: optcost s = ∞ optjoin S = ø for a in S: //a is a relation c = optcost s-a + min. cost to join optjoin s-a to a + min. access cost for a if c < optcost s optcost s = c optjoin s = optjoin s-a joined optimally w/ a This is the same algorithm as on the previous slide, written differently Pre-computed in previous iteration!

Example 4 Relations: ABCD (only consider NL join) Optjoin: A = best way to access A (e.g., sequential scan, or predicate pushdown into index...) B = " " " " B C = " " " " C D = " " " " D {A,B} = AB (or BA) {A,C} = AC (or CA) {B,C} = BC (or CB) {A,D} … {B,D} {C,D} R  set of relations to join For i in {1...|R|}: for S in {all length i subsets of R}: optjoin(S) = a join (S-a), where a is the relation that minimizes: cost(optjoin(S-a)) + min. cost to join (S-a) to a + min. access cost for a Optjoin

Example (con’t) Optjoin {A,B,C} = remove A: ({B,C})A remove B: ({A,C})B remove C: ({A,B})C {A,C,D} = … {A,B,D} = … {B,C,D} = … … {A,B,C,D} = remove A: ({B,C,D})A remove B: ({A,C,D})B remove C: ({A,B,D})C remove D: ({A,B,C})D R  set of relations to join For i in {1...|R|}: for S in {all length i subsets of R}: optjoin(S) = a join (S-a), where a is the relation that minimizes: cost(optjoin(S-a)) + min. cost to join (S-a) to a + min. access cost for a Optjoin

Complexity Number of subsets of set of size n = |power set of n| = 2 n (here, n is number of relations) How much work per subset? Have to iterate through each element of each subset, so this at most n n2 n complexity (vs n!) n=12  49K vs 479M R  set of relations to join For i in {1...|R|}: for S in {all length i subsets of R}: optjoin(S) = a join (S-a), where a is the relation that minimizes: cost(optjoin(S-a)) + min. cost to join (S-a) to a + min. access cost for a Optjoin

Interesting Orders Push down sorts when it is profitable – Merge joins usually faster than NLJ Another round of dynamic programming For k interesting orders, have complexity kn2 n

Study Break – Join Ordering For the query: SELECT * FROM A,B,C,D WHERE A.v = B.v and B.w = C.w and C.w = D.w; How many left-deep plans are possible? How many plans or subsets of plans do we evaluate using the opt algo? Which one(s) can we eliminate as cross products?

Automatic DB Design Key idea: optimize data layout for performance Make a well-known set of queries execute fast Use cost models to estimate utility of different designs

Materialized Views sales : (saleid, date, time, register, product, price,...) CREATE MATERIALIZED VIEW sales_by_date AS SELECT date, product, sum(price), count(*) AS quantity FROM sales GROUP BY date, product Key properties: Kept up to date as data is added Selected for use automatically by optimizer when appropriate

Conclusions Use dynamic programming to efficiently enumerate costs of different query plans – Start with one table and add more Physical db design is complicated! – Picking the right indexes and materialized views – Combinations of heuristics and what-if cost modeling needed – Designs may be adaptive to changing workloads