1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

Slides:



Advertisements
Similar presentations
1 Lecture 23: Query Execution Friday, March 4, 2005.
Advertisements

Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
Lecture 24: Query Execution Monday, November 20, 2000.
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang.
Query Execution :Nested-Loop Joins Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 242 Database Systems II Query Execution.
CPS216: Advanced Database Systems Notes 06:Query Execution (Sort and Join operators) Shivnath Babu.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245.
CPS216: Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.
Indexing and Execution May 20 th, Indexing - Recap Primary file organization: heap, sorted, hashed. Primary vs. secondary Clustered vs. unclustered.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CSCE Database Systems Chapter 15: Query Execution 1.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
Query Processing COMP3017 Advanced Databases Nicholas Gibbins
CS4432: Database Systems II Query Processing- Part 1 1.
CS 540 Database Management Systems
CS 440 Database Management Systems
Evaluation of Relational Operations: Other Operations
Implementation of Relational Operations (Part 2)
Issues in Indexing Multi-dimensional indexing:
Lecture 24: Query Execution
Lecture 13: Query Execution
Query Execution Index Based Algorithms (15.6)
Lecture 23: Query Execution
Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Evaluation of Relational Operations: Other Techniques
Lecture 22: Query Execution
CSE 444: Lecture 25 Query Execution
Lecture 22: Query Execution
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
Lecture 22: Friday, November 22, 2002.
Evaluation of Relational Operations: Other Techniques
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

1 Relational Operators

2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms

3 Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query or update Query execution plan Record, index requests Page commands Read/write pages

4 Logical v.s. Physical Operators Logical operators –what they do –e.g., union, selection, project, join, grouping Physical operators –how they do it –e.g., nested loop join, sort-merge join, hash join, index join

5 Query Execution Plans Purchase Person Buyer=name City=‘Madison’ buyer (Nested Loops Join) SELECT P.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘Madison’  Query Plan: logical tree implementation choice at every node scheduling of operations. (Table scan)(Index scan) Some operators are from relational algebra, and others (e.g., scan, group) are not.

6 How do We Combine Operations? The iterator model. Each operation is implemented by 3 functions: –Open: sets up the data structures and performs initializations –GetNext: returns the the next tuple of the result. –Close: ends the operations. Cleans up the data structures. Enables pipelining! Contrast with data-driven materialize model.

7 Cost Parameters Cost parameters –M = number of blocks that fit in main memory –B(R) = number of blocks holding R –T(R) = number of tuples in R –V(R,a) = number of distinct values of the attribute a Estimating the cost: –Important in optimization (next lecture) –Compute I/O cost only –We compute the cost to read the tables –We don’t compute the cost to write the result (because pipelining)

8 Reminder: Sorting Two pass multi-way merge sort Step 1: –Read M blocks at a time, sort, write –Result: have runs of length M on disk Step 2: –Merge M-1 at a time, write to disk –Result: have runs of length M(M-1)  M 2 Cost: 3B(R), Assumption: B(R)  M 2

9 Scanning Tables The table is clustered (I.e. blocks consists only of records from this table): –Table-scan: if we know where the blocks are –Index scan: if we have a sparse index to find the blocks The table is unclustered (e.g. its records are placed on blocks with other tables) –May need one read for each record

10 Cost of the Scan Operator Clustered relation: –Table scan: B(R); to sort: 3B(R) –Index scan: B(R); to sort: B(R) or 3B(R) Unclustered relation –T(R); to sort: T(R) + 2B(R)

11 One pass algorithm

12 One-pass Algorithms Selection  (R), projection  (R) Both are tuple-at-a-Time algorithms Cost: B(R) Input bufferOutput buffer Unary operator

13 One-pass Algorithms Duplicate elimination  (R) Need to keep a dictionary in memory: –balanced search tree –hash table –etc Cost: B(R) Assumption: B(  (R)) <= M

14 One-pass Algorithms Grouping:  city, sum(price) (R) Need to keep a dictionary in memory Also store the sum(price) for each city Cost: B(R) Assumption: number of cities fits in memory

15 One-pass Algorithms Binary operations: R ∩ S, R U S, R – S Assumption: min(B(R), B(S)) <= M Scan one table first, then the next, eliminate duplicates Cost: B(R)+B(S)

16 Nested loop join

17 Nested Loop Joins Tuple-based nested loop R S for each tuple r in R do for each tuple s in S do if r and s join then output (r,s) Cost: T(R) T(S), sometimes T(R) B(S)

18 Nested Loop Joins Block-based Nested Loop Join for each (M-1) blocks bs of S do for each block br of R do for each tuple s in bs do for each tuple r in br do if r and s join then output(r,s)

19 Nested Loop Joins... R & S Hash table for block of S (k < B-1 pages) Input buffer for R Output buffer... Join Result

20 Nested Loop Joins Block-based Nested Loop Join Cost: –Read S once: cost B(S) –Outer loop runs B(S)/(M-2) times, and each time need to read R: costs B(S)B(R)/(M-2) –Total cost: B(S) + B(S)B(R)/(M-2) Notice: it is better to iterate over the smaller relation first S R: S=outer relation, R=inner relation

21 Two pass algorithm

22 Two-Pass Algorithms Based on Sorting Duplicate elimination  (R) Simple idea: sort first, then eliminate duplicates Step 1: sort runs of size M, write –Cost: 2B(R) Step 2: merge M-1 runs, but include each tuple only once –Cost: B(R) –Some complications... Total cost: 3B(R), Assumption: B(R) <= M 2

23 Two-Pass Algorithms Based on Sorting Grouping:  city, sum(price) (R) Same as before: sort, then compute the sum(price) for each group As before: compute sum(price) during the merge phase. Total cost: 3B(R) Assumption: B(R) <= M 2

24 Two-Pass Algorithms Based on Sorting Binary operations: R ∩ S, R U S, R – S Idea: sort R, sort S, then do the right thing A closer look: –Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S) –Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis Total cost: 3B(R)+3B(S) Assumption: B(R)+B(S)<= M 2

25 Two-Pass Algorithms Based on Sorting Join R S Start by sorting both R and S on the join attribute: –Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match tuples –Cost: B(R)+B(S) Difficulty: many tuples in R may match many in S –If at least one set of tuples fits in M, we are OK –Otherwise need nested loop, higher cost Total cost: 5B(R)+5B(S) Assumption: B(R) <= M 2, B(S) <= M 2

26 Two-Pass Algorithms Based on Sorting Join R S If the number of tuples in R matching those in S is small (or vice versa) we can compute the join during the merge phase Total cost: 3B(R)+3B(S) Assumption: B(R) + B(S) <= M 2

27 Two Pass Algorithms Based on Hashing Idea: partition a relation R into buckets, on disk Each bucket has size approx. B(R)/M Does each bucket fit in main memory ? –Yes if B(R)/M <= M, i.e. B(R) <= M 2 M main memory buffers Disk Relation R OUTPUT 2 INPUT 1 hash function h M-1 Partitions 1 2 M B(R)

28 Hash Based Algorithms for  Recall:  (R)  duplicate elimination Step 1. Partition R into buckets Step 2. Apply  to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M 2

29 Hash Based Algorithms for  Recall:  (R)  grouping and aggregation Step 1. Partition R into buckets Step 2. Apply  to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M 2

30 Hash-based Join R S Recall the main memory hash-based join: –Scan S, build buckets in main memory –Then scan R and join

31 Partitioned Hash Join R S Step 1: –Hash S into M-1 buckets –send all buckets to disk Step 2 –Hash R into M-1 buckets –Send all buckets to disk Step 3 –Join every pair of buckets

Partitioned Hash-Join Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. v Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches. Partitions of R & S Input buffer for Ri Hash table for partition Si ( < M-1 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h M-1 Partitions 1 2 M-1...

33 Partitioned Hash Join Cost: 3B(R) + 3B(S) Assumption: min(B(R), B(S)) <= M 2

34 Hybrid Hash Join Algorithm When we have more memory: B(S) << M 2 Partition S into k buckets But keep first bucket S 1 in memory, k-1 buckets to disk Partition R into k buckets –First bucket R 1 is joined immediately with S 1 –Other k-1 buckets go to disk Finally, join k-1 pairs of buckets: –(R 2,S 2 ), (R 3,S 3 ), …, (R k,S k )

35 Hybrid Join Algorithm How big should we choose k ? Average bucket size for S is B(S)/k Need to fit B(S)/k + (k-1) blocks in memory –B(S)/k + (k-1) <= M –k slightly smaller than B(S)/M

36 Hybrid Join Algorithm How many I/Os ? Recall: cost of partitioned hash join: –3B(R) + 3B(S) Now we save 2 disk operations for one bucket Recall there are k buckets Hence we save 2/k(B(R) + B(S)) Cost: (3-2/k)(B(R) + B(S)) = (3-2M/B(S))(B(R) + B(S))

37 Indexed Based Algorithms In a clustered index all tuples with the same value of the key are clustered on as few blocks as possible a a aa a a a aa

38 Index Based Selection Selection on equality:  a=v (R) Clustered index on a: cost B(R)/V(R,a) Unclustered index on a: cost T(R)/V(R,a)

39 Index Based Selection Example: B(R) = 2000, T(R) = 100,000, V(R, a) = 20, compute the cost of  a=v (R) Cost of table scan: –If R is clustered: B(R) = 2000 I/Os –If R is unclustered: T(R) = 100,000 I/Os Cost of index based selection: –If index is clustered: B(R)/V(R,a) = 100 –If index is unclustered: T(R)/V(R,a) = 5000 Notice: when V(R,a) is small, then unclustered index is useless

40 Index Based Join R S Assume S has an index on the join attribute Iterate over R, for each tuple fetch corresponding tuple(s) from S Assume R is clustered. Cost: –If index is clustered: B(R) + T(R)B(S)/V(S,a) –If index is unclustered: B(R) + T(R)T(S)/V(S,a)

41 Index Based Join Assume both R and S have a sorted index (B+ tree) on the join attribute Then perform a merge join (called zig-zag join) Cost: B(R) + B(S)