CS 440 Database Management Systems

Name: CS 440 Database Management Systems
Uploaded: 2017-12-19T14:15:03+00:00
Duration: PTM21S14
Channel: Franklin Lang
Description: CS 440 Database Management Systems

CS 440 Database Management Systems
Lecture 7: Query Optimization

DBMS Architecture Today’s lecture User/Web Forms/Applications/DBA
query transaction Today’s lecture Query Parser Transaction Manager Query Rewriter Logging & Recovery Query Optimizer Lock Manager Past lectures Query Executor Files & Access Methods Buffers Lock Tables Buffer Manager Main Memory Storage Manager Storage

Many query plans to execute a SQL query
Compute the join of R(A,B) S(B,C) T(C,D) U(D,E) S R T U S U T R Even more plans: multiple algorithms to execute each operation hash join S R T U Sort-merge Table-scan Sort-merge index-scan index-scan Table-scan

Query optimization: picking the fastest plan
Optimal approach plan enumerate each possible plan measure its performance by running it pick the fastest one What’s wrong? Rule-based optimization Use a set of pre-defined rules to generate a fast plan e.g. If there is an index over a table, use it for scan and join.

Definitions Statistics on table R: T(R): Number of tuples in R
B(R): Number of blocks in R B(R) = T(R ) / block size V(R,A): Number of distinct values of attribute A in R

Review: Clustered index
The relation is stored on the disk according to the order of index. INDEX DATA 10 20 10 30 50 70 30 40 90 110 50 60 70 80

Plans to select tuples from R: sA=a(R)
We have a clustered index on R Plans: (Clustered) indexed-based scan Table-scan (sequential access) Statistics on R B(R)=5000, T(R)=200,000 V(R,A) = 2, one value appears in 95% of tuples. Clustered indexed scan vs. table-scan ? Clustered indexed scan vs. table-scan ? table-scan is the winner!

Query optimization methods
Rule-based optimizer fails It uses static rules The rules do not consider the distribution of the data. Cost-based optimization predict the cost of each plan search the plan space to find the fastest one do it efficiently Optimization itself should be fast!

Cost-based optimization
Plan space which plans to consider? it is time consuming to explore all alternatives. Cost estimator how to estimate the cost of each plan without executing it? we would like to have accurate estimation Search algorithm how to search the plan space fast? we would like to avoid checking inefficient plans

Space of query plans Selection algorithms: sequential, index-based
ordering: why does it matter? Join algorithms: nested loop, sort-merge, hash ordering Ordering/ Grouping can an “interesting order” be produced by join/ selection? algorithms: sorting, hash-based

Reducing plan space Multiple logical query plan for each SQL query
Star(sname,birthdate), StarsIn(movie, name, year) SELECT movie,sname FROM Stars, StarsIn WHERE Star.sname = StarsIn.name AND year = 1950 movie,sname movie, sname s year=1950 StarsIn.name = Star.sname StarsIn.name = Star.sname Star year=1950 Generally Faster StarsIn Star StarsIn

Reducing plan space Push selection down to reduce # of rows
Push projection down to reduce # of columns SELECT movie, sname FROM Stars, StarsIn WHERE Star.sname = StarsIn.name movei, sname movie, sname StarsIn.name = Star.sname StarsIn.name = Star.sname movie, name sname StarsIn Star StarsIn Star Less effective than pushing down selection.

Reducing plan space The algorithm requires exponential computation!
System-R style considers only left-deep joins T U S R S R T U Left-deep trees allow us to generate fully pipelined plans Intermediate results not written to temporary files. Not all left-deep trees are fully pipelined (e.g., SM join).

Reducing plan space System R-style avoids the plans with Cartesian products The size of a Cartesian product is generally larger than (natural) joins. Example: R(A,B), S(B,C), U(C,D) (R ⋈ U) ⋈ S has a Cartesian product pick (R ⋈ S) ⋈ U instead If cannot avoid Cartesian products, delay them.

Cost estimation Relative accuracy
Goal is to compare plans, not to predict exact cost More of an art than an exact science Each operator: input size, cost, output size estimate cost based on input size Example: sort-merge join of R ⋈ S is 3 B(R) + 3 B(S) estimate output size (for next operator) or selectivity selectivity: ratio of output to input

Cost estimation: Selinger Style
Input: stats on each table T(R): Number of tuples in R B(R): Number of blocks in R B(R) = T(R ) / block size V(R,A): Number of distinct values of attribute A in R Assumptions on attribute and predicate independence When no estimate available, use magic numbers. New alternative approach Histogram of database too much information to keep, use histogram

Selectivity factors: selection
Point selection: S = sA=a(R) T(S) ranges from 0 to T(R) – V(R,A) + 1 consider its mean: F = 1 / V (R,A) Range selection: S = sA>a(R) F = (max(A) – a) / (max(A) – min(A)) not-athematic inequality: use magic number F = 1 / 3 Range selection: S = s b <A<a(R) F = (a - b) / (max(A) – min(A)) If not athematic, use magic number F = 1 / 4

Range selection: column in (set of values) F: union of point selections

S = sA=1 AND B>10(R) multiply 1/V(R,A) for equality and 1/3 for inequality T(R) = 10,000, V(R,A) = 50 T(S) = / (50 * 3) = 66 S = sA=1 OR B>10(R) sum of estimates of predicates minus their product T(S) = – 66 = 3467

Selectivity factors: join predicates
Containment of values assumption V(S,A) <= V (R,A): A values in S is a subset of A values in R Let’s assume V (S,A) <= V (R,A) Each tuple t in S joins x tuple(s) in R consider its mean: x = T(R) / V (R,A) T(R ⋈A S) = T (S) * T(R) / V(R,A) T(R ⋈A S) = T(R) * T(S) / max(V(R,A), V(S,A)) Typical join: A is a key in R and a foreign key in S

Search the plan space Baseline: exhaustive search
enumerate all combinations and compare their costs enormous space! S R T U T U S R Search method parameters plan tree development construction: bottom-up, top-down modification: improve a somehow-connected tree algorithms heuristic selections: make choices based on heuristics hill climbing: find “nearby” plans with lowest cost Dynamic programming: construction by greedy selection

Plan search: System-R style
A.K.A: Selinger style optimization Bottom-up start from the ground relation (in FROM) work up the tree to form a plan compute the cost of larger plans based on its sub-trees. Dynamic programming greedily remove sub-trees that are costly (useless)

Dynamic programming Step 1: For each {Ri}: size({Ri}) = B(Ri)
plan({Ri}) = Ri cost({Ri}) = cost of access to Ri e.g. B(Ri) if no index on Ri Step 2: For each {Ri, Rj}: size({Ri,Rj}) = estimate of the size of join plan({Ri,Rj}) = join algorithm cost = cost function of size of Ri and Rj #I/O access of the chosen join algorithm plan({Ri,Rj}): the join algorithm with smallest cost

Dynamic programming Step i: For each S ⊆ {R1, …, Rn} of cardinality i do: Compute size(S) for every S1 ,S2 s.t. S = S1  S2 c = cost(S1) + cost(S2) + cost(S1 ⋈ S2) cost(S) = the smallest C plan(S) = the plan for cost(S) Return Plan({R1, …, Rn})

Dynamic programming: example
Let’s assume that the cost of each join is the size of its intermediate results. to simplify the example other cost measures, #I/O access, are possible. cost(R) = 0 (no intermediate results) cost(R ⋈ S) = (no intermediate results) cost( (R ⋈ S) ⋈ T) = cost(R ⋈ S) + cost(T) + size( R ⋈ S ) = size(R ⋈ S)

Dynamic programming: example
Relations: R, S, T, U Number of tuples: 2000, 5000, 3000, 1000 We use a toy size estimation method: size (A ⋈ B) = 0.01 * T(A) * T(B)

Query Size Cost Plan RS RT RU ST SU TU RST RSU RTU STU RSTU

Query Size Cost Plan RS 100k RT 60k RU 20k UR ST 150k TS SU 50k US TU 30k UT RST RSU RTU STU RSTU

Query Size Cost Plan RS 100k RT 60k RU 20k UR ST 150k TS SU 50k US TU 30k UT RST 3M S(RT) RSU 1M S(UR) RTU 0.6M T(UR) STU 1.5M S(UT) RSTU

Query Size Cost Plan RS 100k RT 60k RU 20k UR ST 150k TS SU 50k US TU 30k UT RST 3M S(RT) RSU 1M S(UR) RTU 0.6M T(UR) STU 1.5M S(UT) RSTU 30M 110k (US)(RT)

Plan search: all operations
Base relations access find all plans for accessing each base relations push down selections and projections choose good plans, discard bad ones keep the cheapest plan for unordered and each interesting order Join ordering use the bottom-up dynamic programming consider only left-deep join trees: n! ordering for n tables postpone Cartesian product Finally: grouping/ ordering use interesting order addition sorting

Nested subqueries Correlation: order of evaluation
Subqueries are optimized separately Correlation: order of evaluation uncorrelated queries nested subqueries do not reference outer subqueries evaluate the most deeply nested subquery first correlated queries: nested subqueries reference the outer subqueries Select name From employee X Where salary > (Select salary From employee Where employee_num = X.manager)

Nested subqueries – cont.
correlated queries: nested subqueries reference the outer subqueries Select name From employee X Where salary > (Select salary From employee Where employee_num = X.manager) The nested subquery is evaluated once for each tuple in the outer query. If there are small number of distinct values in the outer relation, it is worth sorting the tuples. reduces the #evaluation of the nested query.

Summary: optimization
Plan space Huge number of alternatives, semantically equivalent Why important Difference between good/bad plabs could be order of magnitude Idea goal map a declarative query to the most efficient plan Conventional wisdom: at least avoid bad plans

State of the art Industry: most optimizers use System-R style
Academia: always a core database research topic Optimizing for interactive querying Optimizing for novel parallel frameworks Industry: most optimizers use System-R style They started with rule-based. Oracle 7 and its prior versions used rule-based Oracle 7 – 10: rule based and cost based Oracle 10g (2003): cost-based

What you should know The importance of query optimization
difference between fast and slow plans Query optimization problem find the fast plans efficiently. The components of a cost-based (system R style) query optimizer: plan space definition cost estimation search algorithm

CS 440 Database Management Systems

Similar presentations

Presentation on theme: "CS 440 Database Management Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 440 Database Management Systems

Similar presentations

Presentation on theme: "CS 440 Database Management Systems"— Presentation transcript:

Similar presentations

About project

Feedback