Presentation on theme: "Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections 12.4-12.6)"— Presentation transcript:
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
CPSC 404, Laks V.S. Lakshmanan2 What you will learn from this lecture v What is a query plan? v What options exist for choosing query plans? v Example plans and cost estimation? v How are reduction factors estimated and used?
Query Optimization Overview v SQL query (extended) RA operator tree. v several choices (algorithms/strategies) for evaluating each operator. v Query evaluation plan = operator tree above plus commitment to a strategy for each op. Query evaluation plan = operator tree above plus commitment to a strategy for each op. v In general, many many plans. v QO makes search for the “best” plan. –Uses lot of info.: available indexes, constraint selectivity, operand sizes, value distributions, estimates of intermediate results, file structure, to name a few.
CPSC 404, Laks V.S. Lakshmanan4 Query Optimization Overview select blah from blah where … select project join select R S Operator Tree T
CPSC 404, Laks V.S. Lakshmanan5 Query Optimization Overview select blah from blah where … select project join select R S Plan (Tree) Indexed NLJ. T Nested Loops; Table scan Index probe On-the-fly. <==
Overview of Query Optimization v Plan: Tree of RA ops, with choice of algo for each op. – Each operator typically implemented using a `pull’ interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them. v Two main issues: – For a given query, what plans are considered? u Algorithm to search plan space for cheapest (estimated) plan. – How is the cost of a plan estimated? v Ideally: Want to find best plan. Practically: Avoid terrible plans! v We will study the System R approach. (Pioneered by IBM.)
Highlights of System R Optimizer v Impact: – Most widely used currently; works well for < 10 joins. v Cost estimation: Approximate art at best. – Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. – Considers combination of CPU and I/O costs. – Expose to cost & intermediate result size estimation (only) in this lecture. v Plan Space: Too large, must be pruned. – Only the space of left-deep plans is considered. u Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation. – Cartesian products avoided.
Schema for Examples v Assume the following sizes: v Ratings: – Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. v Songs: – Each tuple is 50 bytes long, 80 tuples per page, 500 pages. v Output cost ignored usually (pipelining). Songs ( sid : integer, sname : string, genre : string, year : date) Ratings ( uid : integer, sid : integer, time : date, rating : integer)
Some Strategy Options v Scan or index probe (if we manage to push a condition) – selection. v Scan/index scan – projection. [will see later, can use hashing for the DE part, if any!] v Tuple-based nested loops join (TNL) [almost never used], page-based NLJ, “block”-based NLJ (chunk-based), indexed NLJ, sort-merge join, … v A query plan chooses a strategy per op. logical query rewriting may precede plan selection.
Motivating Example (Plan 0) v Cost: *1000 I/Os v By no means the worst plan! v Misses several opportunities: selections could have been `pushed’ earlier, no use is made of any available indexes, etc. v Goal of optimization: To find more efficient plans that compute the same answer. SELECT S.genre FROM Ratings R, Songs S WHERE R.sid=S.sid AND R.uid=50 AND S.year>2000 Songs Ratings sid=sid uid=50 year > 2000 genre Songs Ratings sid=sid Uid=50 year > 2000 genre (Simple Nested Loops) (On-the-fly) RA Tree: Plan:
Nested Loop Join Plan (contd.) v The cost * 1000 for NLJ corresponds to putting Songs (the smaller table) in the outer for loop. v Which among the above I/Os will be sequential and which ones random? v Why? v When does the randomness of I/O make a big impact?
Alternative Plans 1 (No Indexes) v Main difference: push selects. v With 5 buffer pages, cost of plan: – Scan Ratings (1000) + write temp T1 (10 pages, if we have 100 users, uniform distribution). – Scan Songs (500) + write temp T2 (250 pages, if we have data for years ). – Join: u Sort T1 (2*2*10) [2 phases and 1 iteration in phase 2; just read and write in each.] u sort T2 (2*4*250) [2 phases and 3 iterations in phase 2 (why?).] u merge (10+250) [1 scan of each sorted table.] – output cost ignored usually. – Total: 4060 page I/Os. (2300 I/Os for Join for Select.) Ratings Songs sid=sid uid=50 genre (On-the-fly) year >’00 (Scan; write to temp T1) (Scan; write to temp T2) (Sort-Merge Join)
No Indexes (contd.) – The cost would be_____if nested loop join was used. – The cost would be_____if we materialized T2 and pipelined T1 and used NLJ. – And it’d be_____if T1 was materialized and T2 pipelined.
Alternative Plan 1 (contd.) v Block-based NLJ: group blocks/pages of one table into chunks of k pages and read the other table in page by page for every chunk of first table. –Note: the “block” in “block-based” refers to chunk. v If we used block-based nested loop (BNL ) join, (e.g., scan T2 for every 3-page chunk of T1), join cost = 10+4*250, total cost = = (Why 4 times?) v If we `push’ projections, T1 has only sid, T2 only sid and genre: –Assume fields of equal size (Ratings). Sizes: 10, 10, 15, 15 bytes (Songs). – T1 fits in ceil(10/4) = 3 pages, while T2 fits in 250/2 = 125 pages; cost of BNL drops to under 250 I/Os, total < – The cost will actually be ____.
Alternative Plans 2 With Indexes v Clustered index on uid (Ratings); uncl. on sid (Songs). v With clustered index on uid of Ratings, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages. v Indexed NL join with pipelining (outer is not materialized). v Decision not to push year>2000 before the join is based on availability of sid index on Songs; what if we pushed it? v Cost: Selection of Ratings tuples (10 I/Os); for each, must get matching Songs tuple (1000*1.2); total 1210 I/Os. (1.2 = avg. I/O for hash index.) v Join column sid is a key for Songs. –At most one matching tuple, unclustered index on sid OK. –projecting out unnecessary fields from outer doesn’t help. (why?) Ratings Songs sid=sid Uid=50 genre (On-the-fly) year > 2000 (Use hash index; do not write result to temp) (Index Nested Loops, with pipelining ) (On-the-fly) Hash index
Alternative Plans 2 With Indexes vTvTotal cost = 1210 I/Os when alternative 1 is used. What about alternative 2? vIvIf in the previous example, the index on uid was unclustered, the cost would be ____. vWvWould we still consider using index-based NLJ? vWvWhy (not)? vIvIf we used BNL join, then would pushing projection on outer relation help? Why?
Cost Estimation v For each plan considered, must estimate cost: – Must estimate cost of each operation in plan tree. u Depends on input cardinalities. u We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.) – Must estimate size of result for each operation in tree! u Use information about the input relations. u For selections and joins, assume independence of predicates. v We’ll discuss the System R cost estimation approach. – Inexact, but works ok in practice. – More sophisticated techniques known now.
Statistics and Catalogs v Need information about the relations and indexes involved. Recall: Catalogs typically contain at least: – # tuples (NTuples) and # pages (NPages) for each relation. – # distinct key values (NKeys) and NPages for each index. – Index height, low/high key values (Low/High) for each tree index. v Catalogs updated periodically. – Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. v More detailed information (e.g., histograms of the values in some field) are sometimes stored.
Size Estimation and Reduction Factors v Consider a query block: v Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. v Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples * product of all RF’s. – Implicit assumption that terms are independent! – Term col=value has RF 1/NKeys(I), given index I on col – Term col1=col2 has RF 1/ MAX (NKeys(I1), NKeys(I2)) – Term col>value has RF (High(I)-value)/(High(I)-Low(I)) SELECT attribute list FROM relation list WHERE term1 AND... AND termk
Summary v Query optimization is an important task in a relational DBMS. v Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). v Two parts to optimizing a query: – Consider a set of alternative plans. u Must prune search space; typically, left-deep plans only. – Must estimate cost of each plan that is considered. u Must estimate size of result and cost for each plan node. u Key issues: Statistics, indexes, operator implementations.