Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Slides:



Advertisements
Similar presentations
Examples of Physical Query Plan Alternatives
Advertisements

Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
1 Relational Query Optimization Module 5, Lecture 2.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Overview of Implementing Relational Operators and Query Evaluation
Introduction to Database Systems1 Relational Query Optimization Query Processing: Topic 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
1 Overview of Query Evaluation Chapter Overview of Query Evaluation  Plan : Tree of R.A. ops, with choice of alg for each op.  Each operator typically.
Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Curt Hill Query Evaluation Translating a query into action.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
1 Database Systems ( 資料庫系統 ) December 3, 2008 Lecture #10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Database Management System
Running Example – Airline
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 12: Query Processing
Overview of Query Optimization
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Introduction to Database Systems
Examples of Physical Query Plan Alternatives
Relational Operations
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Chapter 12 Query Processing (1)
Overview of Query Evaluation
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Query Evaluation

An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day DATE, descCode CHAR(10)) SELECT ename FROM Employees NATURAL JOIN Maintenances WHERE planeId = 100 AND rating > 5;  ename (  planeId=100 AND rating>5 (Employees Maintenances))  ename  planeId=100 AND rating>5 Maintenances Employees RA expressions can are represented by an expression tree. An algorithm is chosen for each node in the expression tree.

Query Optimization SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational operators, with labels identifying the algorithm to use at each node. –These expression trees can be transformed to "better" trees. –Algorithms for individual operators can be combined in many ways to evaluate a query. Indexes are very important. The process of finding a good evaluation plan is called query optimization.

Clustering/Non-Clustering Indexes Clustering index: tuples (of the relation) with same search key are stored together as controlled by the index. –Same as "Primary" Non-Clustering index: tuples (of the relation) with same search key are stored randomly, not controlled by the index. –Same as "Secondary"

Running Example – Airline Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day DATE, descCode CHAR(10)) Assume that –each tuple of Maintenances is 40 bytes long –a block can hold 100 Maintenances tuples (4K block) –we have 1000 blocks of such tuples. Assume that –each tuple of Employees is 50 bytes long, –a block can hold 80 Employees tuples –we have 500 blocks of such tuples.

Algorithms for selection  R.attr = value (R) if no index on R.attr, then just scan R. –How many disk accesses if R is the Maintenances relation? –On average 1000/2 block (pages) reads. If there is an index we have to typically do 3 disk accesses –This is, assuming a non-clustering B-Tree with 3 levels, with the root in main memory.  R.attr < value (R) Even when there is a non-clustering index we might better scan the relation ignoring the index. Why? Of course, if we have a clustering index, we use it.

Algorithms for projections Given a projection we have to scan the relation and drop certain fields of each tuple. –That’s easy. However, if we need to do a set projection (as opposed to bag projection) specified with the DISTINCT keyword in SQL, we need to remove duplicates. –This is more expensive. –Usually done by sorting, in order to co-locate the duplicates and then remove them. –Can be combined with the final pass of sorting.

Algorithms for joins Joins are expensive operations and very common. Consider the natural join of Maintenances and Employees. Index nested loops join Suppose Employees has an index (B-Tree) on the SIN column. We can scan Maintenances and, for each tuple, use the index to probe Employees for matching tuples. Analysis: –Takes about 3 I/Os on average to retrieve the appropriate leaf of the index. –For each of the 100,000 maintenance records we try to access the corresponding employee with 3 I/Os. –So, 100,000*3 = 300,000 I/Os on average!!

Algorithms for joins (sort-merge) Sort-merge Sort both tables on the join column, and then scan them to find matches. Analysis: –Sort Maintenances in two passes, and Employees in two passes –Cost for sort is 2 * 2 * 1000 = 4000 I/Os for Maintenances and 2 * 2 * 500 = 2000 I/Os. for Employees –Then we merge. This requires an additional scan of both tables. –Thus the total cost is = 7500 I/Os. (Much better!!)

Algorithms for joins (sort-merge) So, we have: index nested loops join: 300,000 I/Os sort-merge join : 7,500 I/Os. Why bother with index nested loops join? Well, “index nested loops” method has the nice property that it is incremental. –The cost of our example join is incremental in the number of Maintenances tuples that we process. –Therefore, if some additional selection in the query allows us to consider only a small subset of Maintenances tuples, we can avoid computing the full join of Maintenances and Employees. Suppose we only want the result of the join for the plane 100, and there are very few such maintenances. –For each such Maintenances tuple, we probe Employees, and we are done. Sort-merge join, on the other hand, will scan the entire Maintenances table at least once, –The cost of this step alone is likely to be much higher than the entire cost of index nested loops join.

Query Optimization Observe that the choice of index nested loops join is based on considering the query as a whole, including the extra selection on Maintenances, rather than just the join operation by itself. This leads us to the next topic, query optimization, which is the process of finding a good plan for an entire query. Query optimization is one of the most important tasks of a relational DBMS. The optimizer generates alternative plans and chooses the plan with the least estimated cost.

Query evaluations plans Recall: A query evaluation plan consists of an extended relational algebra tree, with additional annotations at each node indicating the implementation method to use for each relational operator.  ename (on the fly)  planeId=100 AND rating>5 (on the fly) (file scan) Employees Maintenances (file scan) (nested loops) Method to use Sometimes it might be possible, to pipeline the result of one operator to another operator without creating a temporary table for the intermediate result. This saves in cost. When the input to a unary operator (e.g.  or  ) is pipelined into it, we say the operator is applied on-the- fly.

Alternative query evaluation plans Let’s look at two (naïve plans)  ename (on the fly)  planeId=100 AND rating>5 (on the fly) Maintenances (file scan) Employees (file scan) (nested loops join)  ename (on the fly)  planeId=100 AND rating>5 (on the fly) Maintenances (file scan) Employees (file scan) (sort merge join) Cost for this plan: 300,000 I/Os for the join.  and  are done in the fly; no I/O cost for them. Cost for this plan: 7,500 I/Os for the join.  and  are done in the fly; no I/O cost for them. We don’t consider the cost of writing the final result, since it is the same for any plan.

Plan: Pushing selections I Good heuristic for joins is to reduce the sizes of the tables to be joined as much as possible. One approach is to apply selections early; i.e. 'push' the selection ahead of the join.  ename (on the fly)  planeId=100 Maintenances (file scan) Employees (file scan)  rating>5 (sort-merge join) (scan write to temp T2) (scan write to temp T1)

Plan: Pushing selections II Cost of applying planeId=100 to Maintenances is: –the cost of scanning Maintenances (1000 blocks) plus –the cost of writing the result to a temporary table Tl. To estimate the size of T1, we reason as follows: –if we know that there are 100 planes, we can assume that maintenances are spread out uniformly across all planes and estimate the number of blocks in T1 to be 1000/100 = 10. i.e = 1010 I/Os.  ename (on the fly)  planeId=100 Maintenances (file scan) Employees (file scan)  rating>5 (sort-merge join) (scan write to temp T2) (scan write to temp T1)

Plan: Pushing selections III The cost of applying rating> 5 to Employees is: –the cost of scanning Employees (500 blocks) plus –the cost of writing out the result to a temporary table, say, T2. If we assume that ratings are uniformly distributed over the range 1 to 10, we can approximately estimate the size of T2 as 250 blocks. I.e = 750 I/Os  ename (on the fly)  planeId=100 Maintenances (file scan) Employees (file scan)  rating>5 (sort-merge join) (scan write to temp T2) (scan write to temp T1)

Plan: Pushing selections IV To do a sort-merge join of T1 and T2, they are first completely sorted and then merged. Assume we do that in two passes for each one. Thus, –the cost of sorting T1 is 2 * 2 * 10 = 40 I/Os. –the cost of sorting T2 is 2 * 2 * 250 = 1000 I/Os. To merge (for the join) the sorted versions of T1 and T2, we need to scan these tables: –the cost of this step is =260 I/Os. The final projection is done on-the- fly, and by convention we ignore the cost of writing the final result. The total cost of the plan shown is: –( )+( ) = 3060 I/Os  ename (on the fly)  planeId=100 Maintenances (file scan) Employees (file scan)  rating>5 (sort-merge join) (scan write to temp T2) (scan write to temp T1)

Pushing projections A further refinement is to push projections, just like we pushed selections past the join. Observe that only the SIN attribute of T1 and the SIN and ename attributes of T2 are really required. When we scan Maintenances and Employees to do the selections, we could also eliminate unwanted columns. –This on-the-fly projection reduces the sizes of the temporary tables T1 and T2. –The reduction in the size of T1 is substantial because only an integer field is retained. –In fact, T1 now fits within three blocks (and be all can be in MM), and we can perform a block nested loops join with a single scan of T2. –The cost of the join step drops to under 250 page I/Os, and the total cost of the plan drops to about 2000 I/Os.

Plan using Indexes I Suppose that we have a clustering B-Tree index on the planeId field of Maintenances and another B-Tree index on the SIN field of Employees. We can use the plan in the figure.  ename (on the fly)  planeId=100 Maintenances (clustering index on planeId) Employees (index on SIN)  rating>5 (on the fly) (Index nested loops with pipelining ) (Use clustering index, don’t write results to temp)

Plan using Indexes II Assume that there are 100 planes and assume that the maintenances are spread out uniformly across all planes. –We can estimate the number of selected tuples to be 100,000/100=1000. –Since the index on planeId is clustering, these 1000 tuples appear consecutively and therefore, the cost is: –10 blocks I/Os + 2 I/Os (the 2 I/Os are for finding the first block via the index.  ename (on the fly)  planeId=100 Maintenances (clustering index on planeId) Employees (index on SIN)  rating>5 (on the fly) (Index nested loops with pipelining ) (Use clustered index, don’t write results to temp)

Plan using Indexes III For each selected tuple, we retrieve matching Employees tuples using the index on the SIN field; The join field SIN is a key for Employees. Therefore, at most one Employees tuple matches a given Maintenances tuple. –The cost of retrieving this matching tuple is 3 I/Os. –So, for the 1000 Maintenances tuples we get 3000 I/O’s. For each tuple in the result of the join, we perform the selection rating>5 and the projection of ename on-the-fly. So, total: = 3012 I/Os.  ename (on the fly)  planeId=100 Maintenances (clustering index on planeId) Employees (index on SIN)  rating>5 (on the fly) (Index nested loops with pipelining ) (Use clustered index, don’t write results to temp)

Plan using Indexes IV Why didn’t we push the selection rating>5 ahead of the join? Had we performed the selection before the join, the selection would involve scanning Employees (since no index is available on the rating field of Employees). Also, once we apply such a selection, we have no index on the SIN field of the result of the selection. Thus, pushing selections ahead of joins is a good heuristic, but not always the best strategy. –Typically, the existence of useful indexes is the reason a selection is not pushed.  ename (on the fly)  planeId=100 Maintenances (clustering index on planeId) Employees (index on SIN)  rating>5 (on the fly) (Index nested loops with pipelining ) (Use clustered index, don’t write results to temp)

Plan using Indexes V What about this different plan? The estimated size of T would be the number of blocks for the 1,000 Maintenances tuples that have planeId=100. –i.e. T is 10 blocks. The cost of  planeId=100 as before is 12 I/Os to retrieve plus 10 additional I/Os I/Os to write T. Then, sort T on the SIN attribute. 2passes = 40 I/Os. Employees is already sorted since it has a clustering index on SIN. The merge phase of the join needs a scan of both T and Employees. So, =510 I/Os. Total: (12+10) = 572 I/Os  ename (on the fly)  planeId=100 Maintenances (clustering index on planeId) Employees (clust. index on SIN)  rating>5 (on the fly) (sort-merge join)(Use clustering index, write results to temp T) This improved plan also demonstrates that pipelining is not always the best strategy.