DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245.

Slides:



Advertisements
Similar presentations
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Advertisements

1 Lecture 23: Query Execution Friday, March 4, 2005.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Dr. Kalpakis CMSC 661, Principles of Database Systems Query Execution [15]
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Notions of clustering Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
15.3 Nested-Loop Joins By: Saloni Tamotia (215). Introduction to Nested-Loop Joins  Used for relations of any side.  Not necessary that relation fits.
Notions of clustering Clustered file: e.g. store movie tuples together with the corresponding studio tuple. Clustered relation: tuples are stored in blocks.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Lecture 24: Query Execution Monday, November 20, 2000.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
CS 4432query processing1 CS4432: Database Systems II.
Query Execution :Nested-Loop Joins Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
Query Compiler: 16.7 Completing the Physical Query-Plan CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung ID: 212.
Query Processing & Optimization
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
CS 4432query processing - lecture 121 CS4432: Database Systems II Lecture #12 Query Processing Professor Elke A. Rundensteiner.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 242 Database Systems II Query Execution.
CPS216: Advanced Database Systems Notes 06:Query Execution (Sort and Join operators) Shivnath Babu.
CSCE Database Systems Chapter 15: Query Execution 1.
Advanced Database Systems Notes:Query Processing (Overview) Shivnath Babu.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CPS216: Advanced Database Systems Notes 02:Query Processing (Overview) Shivnath Babu.
SCUHolliday - COEN 17814–1 Schedule Today: u Query Processing overview.
CPS216: Data-Intensive Computing Systems Introduction to Query Processing Shivnath Babu.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
CPS216: Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Chapters 15-16a1 (Slides by Hector Garcia-Molina, Chapters 15 and 16: Query Processing.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CPS216: Advanced Database Systems Notes 07:Query Execution (Sort and Join operators) Shivnath Babu.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
CS 540 Database Management Systems
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
Query Processing COMP3017 Advanced Databases Nicholas Gibbins
CS4432: Database Systems II Query Processing- Part 1 1.
CPS216: Advanced Database Systems Notes 02:Query Processing (Overview) Shivnath Babu.
1 Ullman et al. : Database System Principles Notes 6: Query Processing.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
Chapter 15 QUERY EXECUTION.
Query Execution Presented by Khadke, Suvarna CS 257
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Chapters 15 and 16b: Query Optimization
Lecture 2- Query Processing (continued)
Lecture 24: Query Execution
Query Execution Index Based Algorithms (15.6)
Lecture 23: Query Execution
Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Lecture 22: Query Execution
Lecture 22: Query Execution
Lecture 11: B+ Trees and Query Execution
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom)

DBMS 2001Notes 5: Query Processing2 Query Processing How does the query processor execute queries? Query  Query Plan - SQL- expression of methods to realize the query Focus: Relational System

DBMS 2001Notes 5: Query Processing3 Example SELECT B, D FROM R, S WHERE R.C=S.C AND R.A = "c" AND S.E = 2

DBMS 2001Notes 5: Query Processing4 RABC S CDE a11010x2 b12020y2 c21030z2 d23540x1 e34550y3 AnswerB D 2 x

DBMS 2001Notes 5: Query Processing5 How to execute query? - Do Cartesian product RxS - Select tuples - Do projection Basic idea

DBMS 2001Notes 5: Query Processing6 RxSR.AR.BR.CS.CS.DS.E a x 2 a y 2. c x 2. Bingo! Got one...

DBMS 2001Notes 5: Query Processing7 Problem A Cartesian product RxS may be LARGE –need to create and examine n x m tuples, where n = |R| and m = |S| –For example, n = m = 1000 => 10 6 records  need more efficient evaluation methods

DBMS 2001Notes 5: Query Processing8 Overview of Query Execution {P1,P2,…..} {P1,C1>...} parse convert apply laws estimate result sizes consider physical plans estimate costs pick best execute Pi answer SQL query parse tree logical query plan “improved” l.q.p l.q.p. +sizes statistics

DBMS 2001Notes 5: Query Processing9 Relational Algebra - used to describe logical plans Ex: Original logical query plan  B,D  R.A =“c”  S.E=2  R.C=S.C  x RS OR:  B,D [  R.A=“c”  S.E=2  R.C = S.C (RxS)] SELECT  B,D  WHERE ...  FROM  R,S 

DBMS 2001Notes 5: Query Processing10 Improved logical query plan:  B,D  R.A = “c”  S.E = 2 R S Plan II natural join

DBMS 2001Notes 5: Query Processing11 R S A B C  A='c'  ( R )  E=2  ( S ) C D E a 1 10 A B C C D E 10 x 2 b 1 20c x 2 20 y 2 c y 2 30 z 2 d z 2 40 x 1 e y 3  B,D

DBMS 2001Notes 5: Query Processing12 Physical Query Plan: (1) Use R.A index to select tuples of R with R.A = “c” (2) For each R.C value found, use the index on S.Cto find matching tuples (3) Eliminate S tuples with S.E  2 (4) Join matching R,S tuples, project on attributes B and D, and place in result Detailed description to execute the query: - algorithms to implement operations; order of execution steps; how relations are accessed; For example:

DBMS 2001Notes 5: Query Processing13 R S A B C C D E a x 2 b y 2 c z 2 d x 1 e y 3 AC I1I1 I2I2 =“c” check=2? output: next tuple:

DBMS 2001Notes 5: Query Processing14 Outline (Chapter 6) (Relational algebra for queries –representation for logical query plans –operations that we need to support) Algorithms to implement relational operations –efficiency estimates (for selecting the most appropriate) We concentrate on algorithms for selections and joins

DBMS 2001Notes 5: Query Processing15 Physical operators Principal methods for executing operations of relational algebra Building blocks of physical query plans Major strategies –scanning tables –sorting, indexing, hashing

DBMS 2001Notes 5: Query Processing16 Cost Estimates Estimate only # of disk I/O operations –dominating efficiency factor exception: communication of data over network Simplifying assumptions –ignore the cost of writing the result result blocks often passed in memory to further operations ("pipelining") –I/O happens one block at a time (e.g., ignore usage of cylinder sized blocks)

DBMS 2001Notes 5: Query Processing17 Parameters for Estimation Kept as statistics for each relation R: –T(R) : # of tuples in R –B(R): # of blocks to hold all tuples of R –V(R, A): # of distinct values for attribute R.A = SELECT COUNT (DISTINCT A) FROM R M: # of available main memory buffers (estimate)

DBMS 2001Notes 5: Query Processing18 Cost of Scanning a Relation Normally assume relation R to be clustered, that is, stored in blocks exclusively (or predominantly) used for representing R For example, consider a clustered-file organization of relations DEPT(Name, …) and EMP(Name, Dname, …)

DBMS 2001Notes 5: Query Processing19 DEPT: Toy,... EMP: Ann, Toy,... EMP: Bob, Toy,... … DEPT: Sales,... EMP: John, Sales,... EMP: Ken, Sales,... … … Relation EMP might be considered clustered, relation DEPT probably not For a clustered relation R, sufficient to read (approx.) B(R) blocks for a full scan If relation R not clustered, most tuples probably in different blocks => input cost approx. T(R)

DBMS 2001Notes 5: Query Processing20 Classification of Physical Operators (1) By method: –sort-based process relation(s) in the sorted order –hash-based process relation(s) partitioned in hash buckets –index-based apply existing indexes especially useful for selections

DBMS 2001Notes 5: Query Processing21 Classification of Physical Operators (2) By applicability and cost: –one-pass methods if at least one argument relation fits in main memory –two-pass methods if memory not sufficient for one-pass process relations twice, storing intermediate results on disk –multi-pass generalization of two-pass for HUGE relations

DBMS 2001Notes 5: Query Processing22 Implementing Selection How to evaluate  C (R)? –Sufficient to examine one tuple at a time  Easy to evaluate in one pass: Read each block of R using one input buffer Output records that satisfy condition C –If R clustered, cost = B(R); else T(R) Projection  A (R) in a similar manner

DBMS 2001Notes 5: Query Processing23 Index-Based Selection Consider selection  A='c' (R) If there is an index on R.A, we can locate tuples t with t.A='c' directly What is the cost? –How many tuples are selected? estimate: T(R)/V(R,A) on the average if A is a primary key, V(R,A) =T(R)  1 disk I/O

DBMS 2001Notes 5: Query Processing24 Index-Based Selection (cont.) Index is clustering, if tuples with A='c' are stored in consecutive blocks (for any 'c') A index A

DBMS 2001Notes 5: Query Processing25 Selection using a clustering index We estimate a fraction T(R)/V(R,A) of all R tuples to satisfy A='c’. Apply same estimate to data blocks accessible through a clustering index => is an estimate for the number of block accesses Further simplifications: Ignore, e.g., –cost of reading the (few) index blocks –unfilled room left intentionally in blocks –…

DBMS 2001Notes 5: Query Processing26 Selection Example Consider  A=0 (R) when T(R)=20,000, B(R)=1000, and there's an index on R.A –simple scan of R if R not clustered: cost = T(R) = 20,000 if R clustered: cost = B(R) = 1000 –if V(R,A)=100 and index is … not clustering  cost = T(R)/V(R,A) = 200 clustering  cost = B(R)/V(R,A)= 10 –if V(R,A)=20,000 (i.e., A is key)  cost = 1 Time if disk I/O 15 ms 5 min 15 sec 3 sec 0,15 sec 15 ms

DBMS 2001Notes 5: Query Processing27 Processing of Joins Consider natural join R(X,Y) S(Y,Z) –general joins rather similarly, possibly with additional selections (for complex join conditions) Assumptions: –Y = join attributes common to R and S –S is the smaller of relations: B(S) ≤≤B(R)

DBMS 2001Notes 5: Query Processing28 One-Pass Join Requirement: B(S) < M, i.e., S fits in memory Read entire S in memory (using one buffer); Build a dictionary (balanced tree, hash table) using join attributes of tuples as search key Read each block of R (using one buffer); For each tuple t, find matching tuples from the dictionary, and output their join I/O cost: B(S) + B(R) ≤

DBMS 2001Notes 5: Query Processing29 What If Memory Insufficient? Basic join strategy: –"nested-loop" join – ”1+n pass” operation: one relation read once, the other repeateadly –no memory limitations can be used for relations of any size

DBMS 2001Notes 5: Query Processing30 Nested-loop join (conceptually) for each tuple s  S do for each tuple r  R do if r.Y = s.Y then output join of r and s; Cost (like for Cartesian product): T(S) *(1 + T(R)) = T(S) + T(S)T(R)

DBMS 2001Notes 5: Query Processing31 If R and S clustered, can apply block-based nested-loop join: for each chunck of M-1 blocks of S do Read blocks in memory; Insert tuples in a dictionary using the join attributes; for each block b of R do Read b in memory; for each tuple r in b do Find matching tuples from the dictionary; output their join with r;

DBMS 2001Notes 5: Query Processing32 Cost of Block-Based Nested-Loop Join Consider R(X,Y) S(Y,Z) when B(R)=1000, B(S)=500, and M = 101 –Use 100 buffers for loading S  500/100 = 5 chunks –Total I/O cost = 5 x ( ) = 5500 blocks R as the outer-loop relation  I/O cost 6000 –in general, using the smaller relation in the outer loop gives an advantage of B(R) - B(S) operations

DBMS 2001Notes 5: Query Processing33 Analysis of Nested-Loop join B(S)/(M-1) outer-loop iterations; Each reads M-1 + B(R) blocks  total cost = B(S) + B(S)B(R)/(M-1), or approx. B(S)B(R)/M blocks Not the best method, but sometimes the only choice Next: More efficient join algorithms

DBMS 2001Notes 5: Query Processing34 Sort-Based Two-Pass Join Idea: Joining relations R and S on attribute Y is rather easy, if the relations are sorted using Y –IF not too many tuples join for any value of the join attributes. (E.g. if  Y (R) =  Y (S) = {y}, all tuples match, and we may need to resort to nested-loop join) If relations not sorted already, they have to be sorted (with two-phase multi-way merge sort, since they do not fit in memory)

DBMS 2001Notes 5: Query Processing35 Sort-Based Two-Pass Join 1. Sort R with join attributes Y as the sort key; 2. Do the same for relation S; 3. Merge the sorted relations, using 1 buffer for current input block of each relation: - skip tuples whose Y-value y not in both R and S - read blocks of both R and S for all tuples whose Y value is y - output all possible joins of the matching tuples r  R and s  S

DBMS 2001Notes 5: Query Processing36 Example: Join of R and S sorted on Y 1 a 2 c 3 c 4 d 5 e... b 1 c 2 c 3 c 4 e 5... R(X, Y) S(Y, Z) … Main memory 1 a 2 c b 1 c 2 3 c 4 d c 3 c 4 2 c 2 2 c 3 2 c 4 3 c 2 3 c 3 3 c 4 5 e... e 5... …

DBMS 2001Notes 5: Query Processing37 Analysis of Sort-Based Two-Phase Join Consider R(X,Y) S(Y,Z) when B(R)=1000, B(S)=500, and M = 101 –Remember two-phase multiway merge sort: each block read + written + read + written once  4 x (B(R) + B(S)) = 6000 disk I/Os –Merge of sorted relations for the join: B(R) + B(S)= 1500 disk I/Os –Total I/O cost = 5 x ( B(R) + B(S) ) = 7500 Seems big, but for large R and S much better than B(R)B(S)/M of block-based nested loop join

DBMS 2001Notes 5: Query Processing38 Analysis of Sort-Based Two-Phase Join Limitations? Sorting requires max(B(R), B(S)) ≤≤M 2 Variation: Perform only phase I (building of sorted sublists) of the sorting, and merge all of them (can handle at most M) for the join –I/O cost = 3 x ( B(R) + B(S) ) –requires the union of R and S to fit in at most M sublists, each of which at most M blocks long  works if B(R)+ B(S) ≤≤M 2

DBMS 2001Notes 5: Query Processing39 Two-Phase Join with Hashing Idea: If relations do not fit in memory, first hash the tuples of each relation in buckets. Then join tuples in each pair of buckets. For a join on attributes Y, use Y as the hash key –Hash Phase: For each relation R and S: Use 1 input buffer, and M-1 output buffers as hash buckets Read each block and hash its tuples; When output buffer gets full, write it on disk as the next block of that bucket

DBMS 2001Notes 5: Query Processing40 Two-Phase Join with Hashing The hashing phase produces buckets (sequences of blocks) R 1, …, R M-1 and S 1, …, S M-1 Tuples r  R and s  S join iff r.Y = s.Y => h(r.Y) = h(s.Y) => r occurs in bucket R i and s occurs in bucket S i for the same i

DBMS 2001Notes 5: Query Processing41 Hash-Join: The Join Phase For each i = 1, …, M-1, perform one-pass join between buckets R i and S i  the smaller one has to fit in M-1 main memory buffers Average size for bucket R i is approx. B(R)/M, and B(S)/M for bucket S i  Approximated memory requirement min(B(R), B(S)) < M 2

DBMS 2001Notes 5: Query Processing42 Cost of Hash-Join Consider R(X,Y) S(Y,Z) when B(R)=1000, B(S)=500, and M = 101 Hashing  100 buckets for both R and S, with avg sizes 1000/100=10 and 500/100=5 I/O cost 4500 blocks: –hashing phase 2x x500 = 3000 blocks –join phase: (in total for the 100 one- pass joins) In general: cost = 3(B(R) + B(S))

DBMS 2001Notes 5: Query Processing43 Index-Based Join Still consider R(X,Y) S(Y,Z) Assume there's an index on S.Y Can compute the join by –reading each tuple t of R –locating matching tuples of S by index- lookup for t.Y, and –outputting their join with tuple t Efficiency depends on many factors

DBMS 2001Notes 5: Query Processing44 Cost of Index-Based Join Cost of scanning R: –B(R), if clustered; T(R), if not On the average, T(S)/V(S,Y) matching tuples found by index lookup; Cost of loading them (total for all tuples of R): –T(R)T(S)/V(S,Y), if index not clustered –T(R)B(S)/V(S,Y), if index clustered Cost of loading tuples of S dominates

DBMS 2001Notes 5: Query Processing45 Example: Cost of Index-Join Again R(X,Y) S(Y,Z) with B(R)=1000, B(S)=500; T(R) = 10,000, T(S) = 5000, and V(S,Y) = 100 Assume R clustered, and the index on S.Y is clustering  I/O cost ,000 x 500/100 = 51,000 blocks Often not this bad…

DBMS 2001Notes 5: Query Processing46 Index-Join useful... … when |R| << |S|, and V(S,Y) large (i.e, the index on S.Y is selective) For example, if Y primary key of S: –each of the T(R) index lookups locates at most one record of relation S => at most T(R) input operations to load blocks of S => Total cost only B(R) + T(R), if R clustered, and T(R) + T(R) = 2T(R), if R not clustered

DBMS 2001Notes 5: Query Processing47 Joins Using a Sorted Index [Book: 6.7.4] Still consider R(X,Y) S(Y,Z) Assume there's a sorted index on both R.Y and S.Y –B-tree or a sorted sequential index Scan both indexes in the increasing order of Y –like merge-join, without need to sort first –if index dense, can skip nonmatching tuples without loading them –very efficient Details to excercises?

DBMS 2001Notes 5: Query Processing48 Summary - Query Processing Overall picture: Query -> logical plan -> physical plan -> execution Physical operators –to implement physical query plans –selections, joins –one-pass, two-pass, (multi-pass) –based on scanning, sorting, hashing, existing indexes Next: More about query compilation (Chapter 7)