Presentation is loading. Please wait.

Presentation is loading. Please wait.

1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.

Similar presentations


Presentation on theme: "1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational."— Presentation transcript:

1 1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational Operations Chapter 14 Focus on Join Algorithms

2 2  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Relational Operations  Several algorithms exist for each “logic operator”:  Selection ( ) Selects a subset of rows from relation.  Projection ( ) Deletes unwanted columns from relation.  Join ( ) Allows us to combine two relations.  Set-difference ( ) Tuples in reln. 1, but not in reln. 2.  Union ( ) Tuples in reln. 1 and in reln. 2.  Aggregation ( SUM, MIN, etc.) and GROUP BY  We will focus as example on the JOIN operator.

3 3  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Schema for Examples  Similar to old schema; rname added for variations.  Reserves:  Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.  Sailors:  Each tuple is 50 bytes long, 80 tuples per page, 500 pages. Sailors ( sid : integer, sname : string, rating : integer, age : real) Reserves ( sid : integer, bid : integer, day : dates, rname : string)

4 4  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Equality Joins With One Join Column  In algebra: R S. Common! Must be carefully optimized.  R S is large; so R S followed by a selection is inefficient.  Assume:  M tuples in R, p R tuples per page, N tuples in S, p S tuples per page.  In our examples, R is Reserves and S is Sailors.  We will consider more complex join conditions later.  Cost metric : # of I/Os. We will ignore output costs. SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid

5 5  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Typical Choices  Nested Loops Join  Simple Nested Loops Join: Tuple-oriented, Page-oriented  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join

6 6  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Simple Nested Loops Join  Tuple-oriented: For each tuple in outer relation R, we scan inner relation S.  Cost: M + p R * M * N = 1000 + 100*1000*500 I/Os.  Page-oriented: For each page of R, get each page of S, and write out matching pairs of tuples, where r is in R-page and S is in S-page.  Cost: M + M*N = 1000 + 1000*500 I/Os  If smaller relation (S) is outer, cost = 500 + 500*1000 foreach tuple r in R do foreach tuple s in S do if r i == s j then add to result RS

7 7  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Block Nested Loops Join  One page as input buffer for scanning inner S,  one page as the output buffer,  remaining pages to hold ``block’’ of outer R.  For each matching tuple r in R-block, s in S-page, add to result.  Then read next R-block, scan S again. Etc.  Find matching tuple?  Use in-memory hashing.... R & S Hash table for block of R (k < B-1 pages) Input buffer for S Output buffer... Join Result RS

8 8  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Examples of Block Nested Loops  Cost: Scan of outer + #outer blocks * scan of inner  #outer blocks =  With Reserves (R) as outer, and 100 pages of R as block:  Cost of scanning R is 1000 I/Os; a total of 10 blocks.  Per block of R, we scan Sailors (S); 10*500 I/Os.  E.g., If a block is 90 pages of R, we would scan S 12 times.  With 100-page block of Sailors as outer:  Cost of scanning S is 500 I/Os; a total of 5 blocks.  Per block of S, we scan Reserves; 5*1000 I/Os.  Optimizations?  With sequential reads considered, analysis changes: may be best to divide buffers evenly between R and S.  Double buffering would also be suitable.

9 9  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Index Nested Loops Join  An index on join column of one relation (say S), use S as inner and exploit the index.  Cost: M + ( (M*p R ) * cost of finding matching S tuples)  For each R tuple, cost of probing S index is :  about 1.2 for hash index,  2-4 for B+ tree.  Cost of retrieving S tuples (assuming Alt. (2) or (3) for data entries) depends on clustering:  Clustered index: 1 I/O (typical),  Unclustered: up to 1 I/O per matching S tuple. foreach tuple r in R do foreach tuple s in S where r i == s j do add to result

10 10  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Examples of Index Nested Loops  Hash-index (Alt. 2) on sid of Sailors (as inner):  Scan Reserves: 1000 page I/Os, 100*1000 tuples.  For each Reserves tuple: 1.2 I/Os to get data entry in index, plus 1 I/O to get to the one matching Sailors tuple. Total: 220,000 I/Os.  Hash-index (Alt. 2) on sid of Reserves (as inner):  Scan Sailors: 500 page I/Os, 80*500 tuples.  For each Sailors tuple: 1.2 I/Os to find index page with data entries + cost of retrieving matching Reserves tuples. Assuming uniform distribution, 2.5 reservations per sailor (100*1000)/(80*500). Cost of retrieving them is 1 or 2.5 I/Os depending on whether the index is clustered.

11 11  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Simple vs. Index Nested Loops Join  Assume: M Pages in R, p R tuples per page, N Pages in S, p S tuples per page, B Buffer Pages.  Nested Loops Join  Simple Nested Loops Join Tuple-oriented: M + p R * M * N Page-oriented: M + M * N Smaller as outer helps.  Block Nested Loops Join M + N*  M/(B-2)  Dividing buffer evenly between R and S helps.  Index Nested Loops Join M + ( (M*p R ) * cost of finding matching S tuples) cost of finding matching S tuples = cost of Probe + cost of retrieving  Even with unclustered index, if number of matching inner tuples for each outer tuple is small, cost of INLJ is much smaller than SNLJ.

12 12  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Join: Sort-Merge (R S)  Merge on Join Column: Advance scan of R until current R-tuple >= current S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple. At this point, all R tuples with same value in Ri ( current R group ) and all S tuples with same value in Sj ( current S group ) match ; So output for all pairs of such tuples. Then resume scanning R and S (as above)  R is scanned once; each S group is scanned once per matching R tuple. (Multiple scans of an S group are likely to find needed pages in buffer.) i=j (1). Sort R and S on the join column. (2). Scan R and S to do a ``merge’’ on join col. (3). Output result tuples.

13 13  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Example of Sort-Merge Join

14 14  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Example of Sort-Merge Join  Average Cost:  ~ In practice, roughly linear in M and N  M log M + N log N + (M+N)  Best case: ?  Worst case: ?

15 15  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Problem of Sort-Merge Join  Average Cost: M log M + N log N + (M+N)  The cost of scanning, M+N  Could be M*N  Many pages in R in same partition. ( Worst, All of them). The pages for this partition in S don’t fit into RAM. Re-scan S is needed. Multiple scan S is expensive!  Worst Case: M*N  Can guarantee M+N if key-FK join, or no duplicates. R S

16 16  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Comparison with Sort-Merge Join  Average Cost: O(M log M + N log N + (M+N))  Assume B = {35, 100, 300}; and R = 1000 pages, S = 500 pages  Sort-Merge Join: both R and S can be sorted in 2 passes, logM = log N = 2 total join cost: 2*2*1000 + 2*2*500 + (1000 + 500) = 7500.  Block Nested Loops Join: 2500 ~ 15000

17 17  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Refinement of Sort-Merge Join  IDEA : Combine the merging phases when sorting R ( or S) with the merging in join algorithm.  With B >, where L is the size of the larger relation. The number of runs per relation is less than.  Allocate 1 page per run of each relation, and `merge’ while checking the join condition.  Cost: (read+write R and S in Pass 0) + (read R and S in merging pass and join on fly) (+ writing of result tuples).  In example, cost goes down from 7500 to 4500 I/Os.  In practice, cost of sort-merge join, like the cost of external sorting, is linear.

18 Hash-Join Partition both relations using same hash fn h : R tuples in partition i will only match S tuples in partition i.  Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches. Partitions of R & S Input buffer for Si Hash table for partition Ri (k < B-1 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

19 19  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Cost of Hash-Join  In partitioning phase, read+write both relations:  2(M+N).  In matching phase, read both relations:  M+N I/Os.  Total : 3(M+N)  E.g., total of 4500 I/Os in our running example.

20 20  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Observation on Hash-Join  Memory Requirement  Partition fit into available memory?  Assuming B buffer pages. #partitions k <= B-1 (why?),  Assuming uniformly sized partitions, and maximizing k, we get: k= B-1, and M/(B-1) in-memory hash table to speed up the matching of tuples, a little more memory is needed: f * M/(B-1) f is fudge factor used to capture the small increase in size between the partition and a hash table for partition.  Probing phase, one for S, one for output, B> f*M/(B-1)+2 for hash join to perform well.

21 21  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Observation on Hash Join  Overflow  If the hash function does not partition uniformly, one or more R partitions may not fit in memory.  Significantly degrade the performance.  Can apply hash-join technique recursively to do the join of this overflow R-partition with corresponding S-partition.

22 22  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Hybrid Hash-Join  Minimum memory for Hash Join: B > f(m/k)  If more memory available, use it!  Extra space: B – (k+1) > f(M/k)  How? Hybrid Hash-Join.  Build an in-memory hash table for the first partition of R during the partitioning phase.  Join the remaining as Hash Join.  Saving: avoid writing the first partitions of R and S to disk. E.g. R = 500 pages, S=1000 pages B = 300 partition phase: scan R and write one partition out. 500 + 250 scan S and write out one partition. 1000 + 500 probing phase: only second partition is scaned: 250+500 Total = 3000 ( Hash Join will take 4500)

23 23  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Hash-Join vs. Block Nested Join  If hash table for entire smaller relation fits in memory, equal.  Otherwise, Hash-Join is more effective. H H H H H S1S2S3S4S5 R1 R2 R3 R4 R5 Block Nested Join Hash Join

24 24  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Hash-Join vs. Sort-Merge Join  Sort-Merge Join vs. Hash Join:  Given a certain amount of memory: B > N is the larger relation size. both have a cost of 3(M+N) I/Os.  If partition is not uniformly sized (data skew); Sort-Merge less sensitive; result is sorted.  Hash Join superior if relation sizes differ greatly; B is between and.

25 25  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join General Join Conditions  Equalities over several attributes  (e.g., R.sid=S.sid AND R.rname=S.sname ):  INL-Join : build index on (if S is inner); or use existing indexes on sid or sname.  SM-Join and H-Join : sort/partition on combination of the two join columns.  Inequality conditions  (e.g., R.rname < S.sname ):  INL-Join: need (clustered!) B+ tree index. Range probes on inner; # matches likely to be much higher than for equality joins.  Hash Join, Sort Merge Join not applicable.  Block NL quite likely to be the best join method here.

26 26  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Conclusion Not one method wins ! Optimizer must assess situation to select best possible candidate


Download ppt "1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational."

Similar presentations


Ads by Google