Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.

Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1

Introduction to Database Systems 2 Joins With One Join Column v In algebra: R S. Common! Must be carefully optimized. R S is large; so, R S followed by a selection is inefficient. v Assume: M tuples in R, p R tuples per page, N tuples in S, p S tuples per page. –In our examples, R is Reserves and S is Sailors. v Cost metric : # of I/Os. Ignore output costs. SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid

Introduction to Database Systems 3 Cross-Product Based Joins v Apply selection on all pairs of tuples. v Family of “nested-loops” joins

Introduction to Database Systems 4 Tuple Nested Loops Join –Cost: M + p R * M * N = 1000 + 100*1000*500 I/Os. v How can you use locality to improve performance? foreach tuple r in R do foreach tuple s in S do if r i == s j then add to result

Introduction to Database Systems 5 Page-Oriented NL Join –Cost: M + M*N = 1000 + 1000*500 –If smaller relation (S) is outer, cost = 500 + 500*1000 –More improvements ? foreach page p R in R do foreach page p S in S do foreach tuple r in p R do foreach tuple s in p S do if r i == s j then add to result

Introduction to Database Systems 6 Block Nested Loops Join v Use one page as an input buffer for scanning the inner S, one page as the output buffer, and use all remaining pages to hold ``chunk’’ of outer R. –For each matching tuple r in R-chunk, s in S-page, add to result. Then read next R-chunk, scan S, etc.... R & S Hash table for chunk of R (k < B-1 pages) Input buffer for S Output buffer Join Result

Introduction to Database Systems 7 Examples of Block Nested Loops v Cost: Scan of outer + #outer chunks * scan of inner –#outer chunks = v With Reserves (R) as outer, and 100 pages per chunk: –Cost of scanning R is 1000 I/Os; a total of 10 chunks. –Per chunk of R, we scan Sailors (S); 10*500 I/Os. –If space for just 90 pages of R, we would scan S 12 times. v With 100-page chunk of Sailors as outer: –Cost of scanning S is 500 I/Os; a total of 5 chunks. –Per chunk of S, we scan Reserves; 5*1000 I/Os. v How does one exploit sequential I/O?

Introduction to Database Systems 8 Smarter than Cross-Products v Cross-product based joins necessary! v Improvements for special cases? –Simple conditions (X > Y) u ?? –Equality conditions (X = Y) u ??

Introduction to Database Systems 9 Index Nested Loops Join v Index on the join column of one relation? Make it the inner relation, and use it! –Cost: M + ( (M*p R ) * cost of finding matching S tuples v Cost: For each R tuple, an “index probe” v Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple. foreach tuple r in R do foreach tuple s in S where r i == s j do add to result

Introduction to Database Systems 10 Partitioned Joins v Step 1: break R and S into N partitions. v Step 2: join N corresponding pairs of partitions. v Why is this efficient? (do the math) v Focus on equality partitioning v What are common partitioning algorithms? v How are partition sizes distributed?

Introduction to Database Systems 11 Sort-Merge Join (R S) v Sort R and S on the join column, then scan them to do a ``merge’’ (on join col.), and output result tuples. –Mechanics of the algorithm? v What about skew? v Cost: M log M + N log N + (M+N) –The cost of scanning, M+N, could be M*N (very unlikely!) –how about sequential vs random I/O ? v With 35, 100 or 300 buffer pages, both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500. i=j

Introduction to Database Systems 12 Refinement of Sort-Merge Join v We can combine the merging phases in the sorting of R and S with the merging required for the join. –Let B >, where L is the size of the larger relation –Mechanics? v In practice, cost of sort-merge join, like the cost of external sorting, is linear.

Introduction to Database Systems 13 Hash-Join v Partition both relations using hash f’n h : R tuples in partition i will only match S tuples in partition i. Original Relation OUTPUT 2 B main memory buffers Disk INPUT 1 hash function h B-1 Partitions 1 2 B-1... v Read in a partition of R, hash it using h2. Scan matching partition of S, search for matches. Partitions of R & S Input buffer for Si Hash table for partition Ri (k < B-1 pages) B main memory buffers Disk Output buffer Disk Join Result hash f’n h2

Introduction to Database Systems 14 Observations on Hash-Join v #partitions k size of largest partition to be held in memory. Assuming uniformly sized partitions, and maximizing k, we get: –k= B-1, and M/(B-1) = B-1, i.e., B-1 must be > v If we build an in-memory hash table to speed up the matching of tuples, a little more memory is needed. v Skew ? v In partitioning phase, read+write both relns; 2(M+N). In matching phase, read both relns; M+N I/Os.

Introduction to Database Systems 15 Hash-Join vs Sort-Merge Join v Given a minimum amount of memory (what is this, for each?) both have a cost of 3(M+N) I/Os. v Hash Join superior if relation sizes differ greatly. Also, shown to be highly parallelizable. v Sort-Merge less sensitive to data skew; result is sorted; v Sequential and random I/O in different phases.

Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.

Similar presentations

Presentation on theme: "Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.

Similar presentations

Presentation on theme: "Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1."— Presentation transcript:

Similar presentations

About project

Feedback