Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Implementation of relational operations
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Implementation of Relational Operations CS186, Fall 2005 R&G - Chapter 14 First comes thought; then organization of that thought, into ideas and plans;
1 Implementation of Relational Operations Module 5, Lecture 1.
Evaluation of Relational Operators 198:541. Relational Operations  We will consider how to implement: Selection ( ) Selects a subset of rows from relation.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Evaluation of Relational Operations Yanlei Diao UMass Amherst March 01, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#14: Implementation of Relational Operations.
1 Implementation of Relational Operations: Joins.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Lec3/Database Systems/COMP4910/031 Evaluation of Relational Operations Chapter 14.
RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
1 Database Systems ( 資料庫系統 ) December 19, 2004 Lecture #12 By Hao-hua Chu ( 朱浩華 )
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Implementation of Relational Operations R&G - Chapter 14 First comes thought; then organization of that thought, into ideas and plans; then transformation.
Database Management Systems 1 Raghu Ramakrishnan Evaluation of Relational Operations Chpt 14.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Database Applications (15-415) DBMS Internals- Part VIII Lecture 19, March 29, 2016 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Relational Operator Evaluation. Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin)
Implementation of Relational Operations Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein, Mike Franklin, and etc for.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
Database Systems (資料庫系統)
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Introduction to Query Optimization
Evaluation of Relational Operations
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Examples of Physical Query Plan Alternatives
Database Applications (15-415) DBMS Internals- Part VI Lecture 18, Mar 25, 2018 Mohammad Hammoud.
Relational Operations
CS222P: Principles of Data Management Notes #11 Selection, Projection
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
CS222P: Principles of Data Management Notes #12 Joins!
Introduction to Database Systems
CS222: Principles of Data Management Notes #12 Joins!
Selected Topics: External Sorting, Join Algorithms, …
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Overview of Query Evaluation
Overview of Query Evaluation
Implementation of Relational Operations
Lecture 13: Query Execution
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
Presentation transcript:

Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud

Today… Last Session: DBMS Internals- Part VI Today’s Session: Algorithms for Relational Operations Today’s Session: DBMS Internals- Part VII Algorithms for Relational Operations (Cont’d) Announcements: P3 is due on Apr 15

DBMS Layers Queries Query Optimization and Execution Relational Operators Transaction Manager Files and Access Methods Recovery Manager Buffer Management Lock Manager Disk Space Management DB

Outline Introduction The Selection Operation The Projection Operation The Join Operation Done!

Assumptions We assume the following two relations: For Reserves, we assume: Each tuple is 40 bytes long, 100 tuples per page, 1000 pages For Sailors, we assume: Each tuple is 50 bytes long, 80 tuples per page, 500 pages Our cost metric is the number of I/Os We ignore the computational and output costs Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: date, rname: string)

The Projection Operation Consider the following query, Q, which implies a projection: How can we evaluate Q? Scan R and remove unwanted attributes (STEP 1) Eliminate any duplicate tuples (STEP 2) STEP2 is difficult and can be pursued using two basic approaches: Projection Based on Sorting Projection Based on Hashing SELECT DISTINCT R.sid, R.bid FROM Reserves R

The Projection Operation Faloutsos CMU - 15-415/615 The Projection Operation Discussions on: Projection Based on Sorting Projection Based on Hashing

Projection Based on Sorting The approach based on sorting has the following steps: Step 1: Scan R and produce a set of tuples, S, which contains only the wanted attributes Step 2: Sort S using external sorting Step 3: Scan the sorted result, compare adjacent tuples, and discard duplicates What is the I/O cost (assuming we use temporary relations)? Step 1: M + T I/Os, where M is the number of pages of R and T is the number of pages of the temporary relation Step 2: 2T × # of passes I/Os Step 3: T I/Os In step 2, the combination of all attributes is used as a key for sorting.

The Projection Operation: An Example Consider Q again: How many I/Os would evaluating Q incur? Step 1: M + T = 1000 I/Os + 250 I/Os, assuming each tuple written in the temporary relation is 10 bytes long Step 2: if B (say) is 20, we can sort the temporary relation in 2 passes at a cost of 2×250×2 = 1000 I/Os Step 3: add another 250 I/Os for the scan Total = 2500 I/Os SELECT DISTINCT R.sid, R.bid FROM Reserves R

B-Way Merge Sort How can we sort a file with N pages using B buffer pages? Pass 0: use B buffer pages and sort internally This will produce sorted B-page runs Passes 1, 2, …: use B – 1 buffer pages for input and the remaining page for output; do (B-1)-way merge in each run INPUT 1 . . . INPUT 2 . . . . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers

B-Way Merge Sort: I/O Cost Analysis Number of passes = For our example (i.e., 250 pages), using 20 buffer pages Number of passes = 1 + ⌈log(20-1)⌈250/20⌉⌉ = 2

The Projection Operation: An Example Consider Q again: How many I/Os would evaluating Q incur? Step 1: M + T = 1000 I/Os + 250 I/Os, assuming each tuple written in the temporary relation is 10 bytes long Step 2: if B (say) is 20, we can sort the temporary relation in 2 passes at a cost of 2×250×2 = 1000 I/Os Step 3: add another 250 I/Os for the scan Total = 2500 I/Os SELECT DISTINCT R.sid, R.bid FROM Reserves R Can we do better?

Projection Based on Modified External Sorting Projection based on sorting can be simply done by modifying the external sorting algorithm How can this be achieved? Pass 0: Project out unwanted attributes Passes 1, 2, 3, etc.: Eliminate duplicates during merging What is the I/O cost? Pass 0: M + T I/Os Passes 1, 2, 3, etc.: Cost of merging In step 2, the combination of all attributes is used as a key for sorting.

Projection Based on Modified External Sorting: An Example Consider Q again: How many I/Os would evaluating Q incur? Pass 0: M + T = 1000 + 250 I/Os Pass 1: read the runs (total of 250 pages) and merge them Grand Total = 1500 I/Os (as opposed to 2500 I/Os using the unmodified version!) SELECT DISTINCT R.sid, R.bid FROM Reserves R

The Projection Operation Faloutsos CMU - 15-415/615 The Projection Operation Discussions on: Projection Based on Sorting Projection Based on Hashing

Projection Based on Hashing The algorithm based on hashing has two phases: Partitioning Phase Duplicate Elimination Phase Partitioning Phase (assuming B buffers): Read R using 1 input buffer, one page at a time For each tuple in the input page Discard unwanted fields Apply hash function h1 to choose one of B-1 output buffers In step 2, the combination of all attributes is used as a key for sorting.

Projection Based on Hashing The algorithm based on hashing has two phases: Partitioning Phase Duplicate Elimination Phase Partitioning Phase: Two tuples that belong to different partitions are guaranteed not to be duplicates B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h1 B-1 Partitions . . . In step 2, the combination of all attributes is used as a key for sorting.

Projection Based on Hashing The algorithm based on hashing has two phases: Partitioning Phase Duplicate Elimination Phase Duplicate Elimination Phase: Read each partition and build a corresponding in-memory hash table, using hash function h2 (!= h1) on all fields, while discarding duplicates If a partition P does not fit in memory, apply hash-based projection algorithm recursively on P In step 2, the combination of all attributes is used as a key for sorting.

Projection Based on Hashing The algorithm based on hashing has two phases: Partitioning Phase Duplicate Elimination Phase What is the I/O cost of hash-based projection? Partitioning phase = M (to read R) + T (to write out the projected tuples) I/Os Duplicate Elimination phase = T (to read in every partition) (CPU and final writing costs are ignored) Total Cost = M + 2T In step 2, the combination of all attributes is used as a key for sorting.

Projection Based on Hashing: An Example Consider Q again: How many I/Os would evaluating Q incur? Partitioning phase: M + T = 1000 + 250 I/Os Duplicate Elimination phase: T = 250 I/Os Total = 1500 I/Os (as opposed to 2500 I/Os and 1500 I/Os using projection based on sorting and projection based on modified external sorting, respectively) SELECT DISTINCT R.sid, R.bid FROM Reserves R Which one is better, projection based on modified external sorting or projection based on hashing?

Sorting vs. Hashing The sorting-based approach is superior if: The duplicate frequency is high Or the distribution of (hash) values is very skewed With the sorting-based approach the result is sorted! Most DBMSs incorporate a sorting utility, which can be used to implement projection relatively easy Hence, sorting is the standard approach for projection! In step 2, the combination of all attributes is used as a key for sorting.

Index-Only Scan Can an index be used for projections? Useful if the key includes all wanted attributes As such, key values can be simply retrieved from the index without ever accessing the actual relation! This technique is referred to as index-only scan If an ordered (i.e., tree) index contains all wanted attributes as prefix of search key, we can: Retrieve index entries in order (index-only scan) Discard unwanted fields and compare adjacent tuples to eliminate duplicates In step 2, the combination of all attributes is used as a key for sorting.

Outline Introduction The Selection Operation The Projection Operation The Join Operation

The Join Operation Consider the following query, Q, which implies a join: How can we evaluate Q? Compute R × S Select (and project) as required But, the result of a cross-product is typically much larger than the result of a join Hence, it is very important to implement joins without materializing the underlying cross-product SELECT * FROM Reserves R, Sailors S WHERE R.sid = S.sid

The Join Operation We will study five join algorithms, two of which enumerate the cross-product and three which do not Join algorithms which enumerate the cross-product: Simple Nested Loops Join Block Nested Loops Join Join algorithms which do not enumerate the cross-product: Index Nested Loops Join Sort-Merge Join Hash Join

Assumptions We assume equality joins with: R representing Reserves and S representing Sailors M pages in R, pR tuples per page, m tuples total N pages in S, pS tuples per page, n tuples total We ignore the output and computational costs

The Join Operation We will study five join algorithms, two of which enumerate the cross-product and three which do not Join algorithms which enumerate the cross-product: Simple Nested Loops Join Block Nested Loops Join Join algorithms which do not enumerate the cross-product: Index Nested Loops Join Sort-Merge Join Hash Join

Simple Nested Loops Join Algorithm #0: (naive) nested loop (SLOW!) R(A,..) S(A, ......) m n

Simple Nested Loops Join Algorithm #0: (naive) nested loop (SLOW!) for each tuple r of R for each tuple s of S if match: print (r, s) R(A,..) S(A, ......) m n

Simple Nested Loops Join Algorithm #0: (naive) nested loop (SLOW!) for each tuple r of R for each tuple s of S if match: print (r, s) Outer Relation Inner Relation R(A,..) S(A, ......) m n

Simple Nested Loops Join Algorithm #0: (naive) nested loop (SLOW!) How many disk accesses (‘M’ and ‘N’ are the numbers of pages for ‘R’ and ‘S’)? R(A,..) S(A, ......) m n

Simple Nested Loops Join Algorithm #0: (naive) nested loop (SLOW!) How many disk accesses (‘M’ and ‘N’ are the numbers of pages for ‘R’ and ‘S’)? I/O Cost = M+m*N R(A,..) S(A, ......) m n

Simple Nested Loops Join Algorithm #0: (naive) nested loop (SLOW!) - Cost = M + (pR * M) * N = 1000 + 100*1000*500 I/Os - At 10ms/IO, total = ~6 days (!) I/O Cost = M+m*N R(A,..) S(A, ......) m n Can we do better?

Nested Loops Join: A Simple Refinement Algorithm: Read in a page of R Read in a page of S Print matching tuples COST= ? R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Nested Loops Join: A Simple Refinement Algorithm: Read in a page of R Read in a page of S Print matching tuples COST= M+M*N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Nested Loops Join Which relation should be the outer? COST= M+M*N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Nested Loops Join Which relation should be the outer? A: The smaller (page-wise) COST= M+M*N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Nested Loops Join M=1000, N=500 - if larger is the outer: Cost = 1000 + 1000*500 = 501,000 = 5010 sec (~ 1.4h) COST= M+M*N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Nested Loops Join M=1000, N=500 - if smaller is the outer: Cost = 500 + 1000*500 = 500,500 = 5005 sec (~ 1.4h) COST= N+M*N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Summary: Simple Nested Loops Join What if we do not apply the page-oriented refinement? Cost = M+ (pR * M) * N = 1000 + 100*1000*500 I/Os At 10ms/IO, total = ~6 days (!) What if we apply the page-oriented refinement? Cost = M * N + M = 1000*500+1000 I/Os At 10ms/IO, total = 1.4 hours (!) What if the smaller relation is the outer? Slightly better

The Join Operation We will study five join algorithms, two of which enumerate the cross-product and three which do not Join algorithms which enumerate the cross-product: Simple Nested Loops Join Block Nested Loops Join Join algorithms which do not enumerate the cross-product: Index Nested Loops Join Sort-Merge Join Hash Join

Block Nested Loops What if we have B buffer pages available? R(A,..) M pages, m tuples N pages, n tuples

Block Nested Loops What if we have B buffer pages available? A: Give B–1 buffer pages to outer and 1 page to inner R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Block Nested Loops Algorithm: COST= ? Read in B−1 pages of R Read in a page of S Print matching tuples COST= ? R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Block Nested Loops Algorithm: COST= M+⌈M/(B−1)⌉*N Read in B−1 pages of R Read in a page of S Print matching tuples COST= M+⌈M/(B−1)⌉*N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Block Nested Loops If the smallest (outer) relation fits in memory? 15-415/615 Faloutsos Block Nested Loops If the smallest (outer) relation fits in memory? That is, M = B−1 Cost =? R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Block Nested Loops If the smallest (outer) relation fits in memory? That is, M = B−1 Cost = N+M (minimum!) R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Nested Loops - Guidelines Pick as outer the smallest table (= fewest pages) Fit as much of it in memory as possible Loop over the inner

The Join Operation We will study five join algorithms, two of which enumerate the cross-product and three which do not Join algorithms which enumerate the cross-product: Simple Nested Loops Join Block Nested Loops Join Join algorithms which do not enumerate the cross-product: Index Nested Loops Join Sort-Merge Join Hash Join

Index Nested Loops Join What if there is an index on one of the relations on the join attribute(s)? A: Leverage the index by making the indexed relation inner R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Index Nested Loops Join Assuming an index on S: for each tuple r of R for each tuple s of S where s.sid = r.sid Add (r, s) to result R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Index Nested Loops Join What will be the cost? Cost: M + m * c (c: look-up cost) ‘c’ depends on the type of index, the adopted alternative and whether the index is clustered or un-clustered! R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

The Join Operation We will study five join algorithms, two of which enumerate the cross-product and three which do not Join algorithms which enumerate the cross-product: Simple Nested Loops Join Block Nested Loops Join Join algorithms which do not enumerate the cross-product: Index Nested Loops Join Sort-Merge Join Hash Join

Sort-Merge Join Sort both relations on join attribute(s) Scan each relation and merge This works only for equality join conditions! R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Sort-Merge Join: An Example ?

Sort-Merge Join: An Example NO

Sort-Merge Join: An Example ?

Sort-Merge Join: An Example YES Output the two tuples

Sort-Merge Join: An Example ?

Sort-Merge Join: An Example YES

Sort-Merge Join: An Example YES Output the two tuples

Sort-Merge Join: An Example ?

Sort-Merge Join: An Example NO

Sort-Merge Join: An Example ?

Sort-Merge Join: An Example YES Continue the same way! Output the two tuples

Sort-Merge Join COST = 2M*⌈1+logB−1⌈M/B⌉⌉ + 2N*⌈1+logB−1⌈N/B⌉⌉ + M + N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Sort-Merge Join COST = 2M*⌈1+logB−1⌈M/B⌉⌉ + 2N*⌈1+logB−1⌈N/B⌉⌉ + M + N For B = 100: = 2*1000*⌈1+log99⌈1000/100⌉⌉ + 2*500*⌈1+log99 ⌈500/100⌉⌉ + 1000 + 500 = 2*1000*2 + 2*500*2 + 1000 + 500 = 7,500 I/Os R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Sort-Merge vs. Block Nested Loop Assuming 100 buffer pages, Reserves and Sailors can be sorted in 2 passes Sort-Merge cost = 7,500 I/Os Block Nested Loop cost = ? I/Os R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Block Nested Loops Algorithm: COST= M+⌈M/(B−1)⌉*N Read in B−1 pages of R Read in a page of S Print matching tuples COST= M+⌈M/(B−1)⌉*N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Sort-Merge vs. Block Nested Loop Assuming 100 buffer pages, Reserves and Sailors can be sorted in 2 passes Sort-Merge cost = 7,500 I/Os Block Nested Loop cost = 6,500 I/Os R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Sort-Merge vs. Block Nested Loop Assuming 35 buffer pages, Reserves and Sailors can be sorted in 2 passes Sort-Merge cost = 7,500 I/Os Block Nested Loop cost = 15,500 I/Os R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

Sort-Merge vs. Block Nested Loop Assuming 300 buffer pages, Reserves and Sailors can be sorted in 2 passes Sort-Merge cost = 7,500 I/Os Block Nested Loop cost = 2,500 I/Os R(A,..) S(A, ......) M pages, m tuples N pages, n tuples The Block Nested Loops Join is more sensitive to the buffer size!

Sort-Merge Join Can we do better? COST = 2M*⌈1+logB−1⌈M/B⌉⌉ + 2N*⌈1+logB−1⌈N/B⌉⌉ + M + N R(A,..) S(A, ......) M pages, m tuples N pages, n tuples Can we do better?

Sort-Merge Join If B > √L where L is the number of pages of the larger relation: Using replacement sort (outputs on avg. 2B-sized runs/pass) and combining the merging phases of the sort and the join: COST = 3 (M + N) COST = 3 (1000 + 500) = 4,500 I/Os R(A,..) S(A, ......) M pages, m tuples N pages, n tuples

The Join Operation We will study five join algorithms, two which enumerate the cross-product and three which do not Join algorithms which enumerate the cross-product: Simple Nested Loops Join Block Nested Loops Join Join algorithms which do not enumerate the cross-product: Index Nested Loops Join Sort-Merge Join Hash Join

Hash Join The join algorithm based on hashing has two phases: Partitioning (also called Building) Phase Probing (also called Matching) Phase Idea: Hash both relations on the join attribute into k partitions, using the same hash function h Premise: R tuples in partition i can join only with S tuples in the same partition i In step 2, the combination of all attributes is used as a key for sorting.

Hash Join: Partitioning Phase Partition both relations using hash function h Two tuples that belong to different partitions are guaranteed not to match B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions . . . In step 2, the combination of all attributes is used as a key for sorting.

Hash Join: Probing Phase Read in a partition of R, hash it using h2 (!= h) Scan the corresponding partition of S and search for matches Partitions of R & S Input buffer for Si Hash table for partition Ri (k < B-1 pages) B main memory buffers Disk Output buffer Join Result hash fn h2 In step 2, the combination of all attributes is used as a key for sorting.

Hash Join: Cost Total Cost = 3 (M + N) What is the cost of the partitioning phase? We need to scan R and S, and write them out once Hence, cost is 2M + 2N = 2 (M + N) I/Os What is the cost of the probing phase? We need to scan each partition of R and S once (assuming no partition overflows) Hence, cost is M + N I/Os Total Cost = 3 (M + N) In step 2, the combination of all attributes is used as a key for sorting.

Hash Join: Cost (Cont’d) Total Cost = 3 (M + N) Joining Reserves and Sailors would cost 3 (500 + 1000) = 4,500 I/Os Assuming 10ms per I/O, hash join takes less than 1 minute! This underscores the importance of using a good join algorithm (e.g., Simple NL Join takes ~140 hours!) In step 2, the combination of all attributes is used as a key for sorting. But, so far we have been assuming that partitions fit in memory!

Memory Requirements and Overflow Handling How can we increase the chances for a given partition in the probing phase to fit in memory? Maximize the number of partitions If we partition R (or S) into k partitions, what would be the size of each partition (in terms of B)? At least k output buffer pages and 1 input buffer page Given B buffer pages, k = B – 1 Hence, the size of an R (or S) partition = M / (B – 1) What is the number of pages in the (in-memory) hash table built during the probing phase per a partition? f * M / (B – 1), where f is a fudge factor In step 2, the combination of all attributes is used as a key for sorting.

Memory Requirements and Overflow Handling What else do we need in the probing phase? A buffer page for scanning the S partition An output buffer page What is a good value of B as such? B > f * M / (B – 1) + 2 Therefore, we need (approx.) What if a partition overflows? Apply the hash join technique recursively (as is the case with the projection operation) In step 2, the combination of all attributes is used as a key for sorting.

Hash Join vs. Sort-Merge Join If (M is the # of pages in the smaller relation) and we assume uniform partitioning, the cost of hash join is 3(M+N) I/Os If (N is the # of pages in the larger relation), the cost of sort-merge join is 3(M+N) I/Os In step 2, the combination of all attributes is used as a key for sorting. Which algorithm to use, hash join or sort-merge join?

Hash Join vs. Sort-Merge Join If the available number of buffer pages falls between and , hash join is preferred (why?) Hash Join shown to be highly parallelizable (beyond the scope of the class) Hash join is sensitive to data skew while sort-merge join is not Results are sorted after applying sort-merge join (may help “upstream” operators) Sort-merge join goes fast if one of the input relations is already sorted In step 2, the combination of all attributes is used as a key for sorting.

The Join Operation We will study five join algorithms, two of which enumerate the cross-product and three which do not Join algorithms which enumerate the cross-product: Simple Nested Loops Join Block Nested Loops Join Join algorithms which do not enumerate the cross-product: Index Nested Loops Join Sort-Merge Join Hash Join

Next Class Queries Query Optimization and Execution Continue… Relational Operators Transaction Manager Files and Access Methods Recovery Manager Buffer Management Lock Manager Disk Space Management DB