Download presentation
Presentation is loading. Please wait.
Published byAvice Armstrong Modified over 6 years ago
1
Robust Query Processing through Progressive Optimization
Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic SIGMOD 2004
2
Steps in cost-based query optimization
Generate logically equivalent expressions using equivalence rules Annotate resultant expressions to get alternative query plans Choose the cheapest plan based on estimated cost
3
Estimation of plan cost
1. Statistical information about relations. Examples: number of tuples, number of distinct values for an attribute 2. Statistics estimation for intermediate results to compute cost of complex expressions 3. Cost formulae for algorithms, computed using statistics
4
Motivation Cost based optimization depends heavily upon accurate cardinality estimations What if there errors in those estimations? Errors can occur due to … Inaccurate statistics Invalid assumptions (e.g. attribute independence, parameter markers etc)
5
Progressive Query Optimization
Idea: lazily trigger reoptimization during execution if cardinality counts indicate current plan is suboptimal introduces checkpoint (CHECK) operator to compare actual vs estimated cardinality key idea: precompute cardinality ranges for which plan is optimal
6
Example of Progressive Optimization in Action
7
Road Map Background Risk – Opportunity trade off Architecture
Validity range and CHECK operator Materialization CHECK variants Placement of CHECK Performance analysis
8
Evaluating a re-optimization scheme
Risk Vs Opportunity Risk (performance loss / regression) Extent to which re-optimization is not worthwhile reoptimization chooses same or even worse plan cardinality errors may cancel each other out, and fixing one may give an even worse plan work redone Opportunity (performance gain) Refers to the aggressiveness more CHECK operators.. POP seeks to minimise risk and overhead through judicious placement of CHECK
9
Background Redbrick Kabra & DeWitt 98 (KD98)
Risk Opportunity Redbrick Star schema with fact table and multiple dimension tables First apply selections on dimension tables Then decide what plan to use Kabra & DeWitt 98 (KD98) Introduced idea of mid-query reoptimization Allow partial results to be use like materialized views But ad-hoc cardinality threshold, and only reoptimize fully materialized plans
10
Background Tukwila data integration system Query Scrambling
Risk Opportunity Tukwila data integration system optimizer may have no idea of statistics interleave optimization and query execution partial query plans Fragment: fully pipelined tree with doubly pipelined hash join Query Scrambling reorder query to deal with delayed sources
11
Background Eddies (Telegraph)
Risk Eddies (Telegraph) Ingres/DEC Rdb: run multiple access methods competitively then choose Parametric Query Optimization (PQO) e.g. Cole and Graefe 94, Hulgeri and Sudarshan 02 Choose from a set of plans, each optimal for selectivity range POP: converse: find optimal cardinality range for a give plan Opportunity
12
Example of Progressive Optimization in Action
13
Progressive Query Optimization(POP)
14
Architecture of POP
15
Architecture of POP CHECK operator to find if a plan is suboptimal
At optimization time, find out cardinality range (at CHECK location) for which plan is optimal At run time, ensure cardinality within [l,u] If violated, stop plan execution and reoptimize Location of CHECKs Re-optimize taking observed cardinality into account, and exploiting intermediate results where beneficial Heuristic: limit number of reoptimizations (default: 3)
16
Validity Ranges Consider a plan edge e that flows rows into operator o, let P be the subplan rooted at o. The validity range for e is an upper and lower bound on the number of rows flowing through e, such that if the range is violated at runtime, we can guarantee P is suboptimal Defined conservatively Ad-hoc thresholds (proposed earlier) are a bad idea E.g. even a 100x error on very small relation may not make a difference in optimal plan
17
Finding Optimality Ranges
18
Finding Optimality Ranges
Plan Popt with root operator oopt is being compared with another plan Palt different only in the root operator oalt.
19
Finding Optimality Ranges
Need to solve cost(Palt , c) – cost(Popt , c) = 0 where c is the cardinality on edge e Cost functions can be complex/non-linear/non- continuous
20
Newton-Raphson Iteration
21
What does this achieve? Detects suboptimality of the root operator where Popt and Palt share the same input edges. Validity range might miss a cross-over point with a plan that uses a different join order (and hence has different input edges). Two plans are structurally equivalent if they share the same set of edges where an edge is defined by the set of rows flowing through it during query execution. Allows different algorithms, and flipping inner/outer
22
Optimality wrt structurally equivalent plans
Theorem: …. Suppose edges edges ei1 , ei2 , … , eik are seen to be “erroneous” wrt cardinality. Then the following statements are equivalent: P is suboptimal with respect to another plan P' that has the same set of edges {e1 , e2 , … , em} At least one of Pi1 , Pi2 , … , Pik is suboptimal given the cardinality errors in those edges in {e1 , e2 , … , em } that lie under them. At least one of oi1 , oi2 , … , oik is a suboptimal operator given the cardinality errors in {e1 , e2 , … , em} that are in its input edges.
23
Optimality wrt structurally equivalent plans
Theorem: …. Suppose edges edges ei1 , ei2 , … , eik are seen to be “erroneous” wrt cardinality. Then the following statements are equivalent: P is suboptimal with respect to another plan P' that has the same set of edges {e1 , e2 , … , em} At least one of Pi1 , Pi2 , … , Pik is suboptimal given the cardinality errors in those edges in {e1 , e2 , … , em } that lie under them. At least one of oi1 , oi2 , … , oik is a suboptimal operator given the cardinality errors in {e1 , e2 , … , em} that are in its input edges.
24
Conservative detection of suboptimality
Suppose we “detect” suboptimality of (R Join S) Join T wrt estimated costs of (R Join T) Join S During run time, we can never observe the cardinality of R Join T We would be making an arbitrary guess as to the correlation of the predicates on the R and T tables Best not to infer suboptimality wrt such estimates However, reoptimization may result in a different join order
25
Exploiting Intermediate Results
All the intermediate results are stored as temporary MVs with cardinalities available to the optimizer can be reused if it leads to a better plan but not necessarily used, e.g. if join result is very large, and a different join order is preferred must be reused if it has performed side-effects Reoptimization done as part of same transaction
26
Optional use of MV
27
Variants of CHECK Variants applicable in different cases, trade off risk for opportunity Variants Lazy checking Lazy checking with eager materialization Eager checking without compensation Eager checking with buffering Eager checking with deferred compensation
28
Variants of CHECK
29
Lazy Checking
30
Lazy Checking Adding CHECKs above a materialization point (SORT, TEMP etc) No compensation needed - No results could have been output before reoptimization And materialized results can be re-used very low overhead
31
Lazy checking with eager materialization
32
Lazy checking with eager materialization
Can insert materialization point if it does not exists already Risk: overhead of materialization Typically done only for outer input of indexed nested- loop join low cost if outer is small (as estimated by optimizer) and INL is in trouble anyway if outer is large
33
Eager Checking Lazy checking may be too late
e.g. if very bad join order chosen, with huge intermediate results Idea: check even before entire result is materialized, and stop early Problem: what if some results have already been output? Compensation
34
Eager Checking Without Compensation
35
Eager Checking Without Compensation
EC without Compensation: CHECK is pushed down the materialization point, into pipeline
36
Eager Checking with Buffering
37
Eager Checking with Buffering
CHECK and buffer output from buffer once sure about bound e.g. [0,b), or [b,infinity] else reoptimize “delayed pipelining”
38
EC with Deferred Compensation
39
EC with Deferred Compensation
Only SPJ queries Identifier of all rows returned to the user are stored in a table S, which is used later in the new plan for anti-join with the new-result stream
40
CHECK Placement
41
CHECK Placement LCEM and ECB – outer side of nested-loop join
LC – above materialization points ECWC and ECDC – anywhere Do not place CHECKs if no alternative plan above CHECK simple queries with low estimated cost
42
Performance Analysis: Robustness
43
Performance Analysis: Robustness
TPC-H Q10: Replace constant in selection on lineitem by parameter marker, so optimizer doesn’t know actual selectivity 5 different optimal plans
44
Risk Analysis Analyze LC, LCEM, ECB Can be reoptimized more than once
Conclusion: low overhead/risk
45
Risk Analysis
46
Risk Analysis
47
Opportunity Analysis Goal: how often does opportunity to reoptimize arise? Introduce LC/LCEM/ECB checkpoints But turn off reoptimization, and run same plan Opportunity region for ECB: dotted line
48
Opportunity Analysis
49
POP in (in)action Real world workload (DMV data and queries)
Complex predicates leading to cardinality estimation errors substring comparison, like, IN, ..
50
POP in (in)action
51
POP in (in)action
52
POP in (in)action (contd.)
Re-optimization may result in the choice of worse plan due to: Two estimation errors canceling out each other Re-using intermediate results
53
Conclusions POP gives us a robust mechanism for re- optimization through inserting of CHECK (in its various flavors) Higher opportunity at low risk
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.