Presentation is loading. Please wait.

Presentation is loading. Please wait.

Access Path Selection in a Relational Database Management System Selinger et al.

Similar presentations


Presentation on theme: "Access Path Selection in a Relational Database Management System Selinger et al."— Presentation transcript:

1 Access Path Selection in a Relational Database Management System Selinger et al.

2 Query Optimization Declarative query language relieves programmer of the burden to choose an access plan. Difference between good and bad access plans can be several orders of magnitude. Problem: How do we find a good plan?

3 How to choose a good plan? Divide into three problems: What is the set of plans we consider? How do we compare (cost) plans? How do we choose a “good” plan from this set?

4 Background: SQL SELECT {DISTINCT} FROM {WHERE } {GROUP BY {HAVING }} {ORDER BY };

5 Processing SQL Statements Four steps: –Parsing –Optimization –Code generation –Execution

6 SQL Query Types Single block queries only –optimize nested sub-queries separately Correlated sub-queries are much harder and much more expensive than un-correlated sub-queries. –Rewrite to remove correlations where possible. (Tricky, more about this later.) SPJ queries only in this paper

7 Problem 1: Plan Space Fixed set of individual access methods Sequential & index (clustered/unclustered) scans NL-join, (sort)-merge join, hash join Sorting & hash-based grouping Plan flows in a non-blocking fashion with get-next iterators

8 Plan Space (Contd.) Assumptions in System R: –Selections of “sargable” predicates are "pushed down" –Projections are "pushed down" –Single query blocks –Only left-deep plan trees (There are n! plans (not factoring in choice of join method)) –Avoid Cartesian products

9 Problem 2: How to Cost a Plan Estimation of input and intermediate cardinalities based on simple statistical models (e.g., uniform distribution assumption, attribute independence) Estimation of costs for each operator based on statistics about input relations –Cost is weighted function between I/O and CPU (no distinction between random and sequential IO)

10 Cost Estimation (Selinger) Maintenance of simple statistics: –# of tuples & pages –# of values per column (only for indexed columns) Assumption of attribute independence When no estimate available, use magic number These estimations are done periodically

11 Cost Estimation (Today) Sampling: so far only concrete results for base relations Histograms: getting better. Common in industry, some interesting new research. Controlling "error propagation"

12 Problem 3: Choosing a Plan Exhaustive search Dynamic Programming (prunes suboptimal parts of the search space): System R Top-down, transformative version of DP: Volcano, Cascades (used in MS SQL Server?) Randomized search algorithms (e.g. Ioannidis & Kang) Techniques from Operations Research Etc.

13 System R Approach Recall: Only left-deep plan trees (n! different plans) Observation: Many of these plans share common prefixes, so do not recompute all of them: Dynamic Programming

14 Dynamic Programming Approach 1.Find all 1-table plans for each base relation 2.Try all ways of joining i-table plans saved so far with 1-table plans. Save cheapest unordered (i+1)-table plans, and cheapest (i+1)-table plans for each interesting order Note: secondary join predicates are just like selections that can’t be pushed down 3.At the end, GROUP BY and ORDER BY –Use plan in interesting order, or add sort to cheapest unordered plan.

15 Evaluation Complexity of dynamic programming: about n2 n-1, intermediate storage: plans “No-cartesian-products” rule can make a big difference for some queries. DP only works up to 10-15 joins Adding parameters to the search space makes things worse (e.g. expensive predicates, distribution, parallelism, etc.)

16 Nested Queries Subqueries optimized separately Uncorrelated vs. correlated subqueries –Uncorrelated subqueries are basically constants to be computed once –Correlated subqueries are like function calls

17 Query Rewrite in IBM DB2 Leung et al.

18 Why Query Rewrite? Problem: –Very complex queries automatically generated through tools with many levels of subqueries –Correlated subqueries –Selinger approach only works one block at a time Main idea: Transform the query into a simpler, equivalent query

19 Query Rewrite is Tricky SELECT P.pno FROM Parts P WHERE P.quantity = (SELECT COUNT(*) FROM Supply S WHERE S.pno = P.pno AND S.shipdate < 1-1-2000) INSERT INTO Temp (pnum, cnt) (SELECT P.pno, COUNT(*) FROM Supply S WHERE S.shipdate < 1-1- 2000 GROUP BY P.pno) SELECT P.pno FROM Parts P, Temp T WHERE P.quantity = T.cnt AND P.pno = T.pno


Download ppt "Access Path Selection in a Relational Database Management System Selinger et al."

Similar presentations


Ads by Google