Presentation is loading. Please wait.

Presentation is loading. Please wait.

Join Processing in Database Systems with Large Main Memories ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Leonard D. Shapiro Donghui Zhang,

Similar presentations


Presentation on theme: "Join Processing in Database Systems with Large Main Memories ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Leonard D. Shapiro Donghui Zhang,"— Presentation transcript:

1 Join Processing in Database Systems with Large Main Memories ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Leonard D. Shapiro Donghui Zhang, UC Riverside, May 17, 2002

2 Content Join Definition Internal-Memory Joins External-Memory Model External-Memory Joins Performance Comparisons Conclusions

3 Problem Definition Join operator: given relation R and S, report all pairs (r, s) s.t. r  R, s  S and the two records satisfy some given condition. Equi-Join: join if a certain attribute of r is equal to an attribute of s.

4 FacNameDeptID Gunopulos1 Kumar2 Li2 Tsotras1 Stephen3 Zaniolo3 DeptIDDeptName 1Computer Science 2Electrical Engineering 3Mathematics FacultyDepartment SELECT FacName, DeptName FROM Faculty, Department WHERE Faculty.DeptID =Department.DeptID

5 Internal-Memory Solutions Nested-loop join: check for every record in R and every record in S; cost={R}{S} Sort-merge join: sort R, S followed by merging; cost=O({S}*log{S}) (if {R}<{S}) Hash join: build a hash table for R; for every record in S, probe the hash table; cost =O({S})

6 External-Memory Model Both relations R, S reside on disk; Each disk page holds up to B records; A disk page (block) has to be read in memory before records in it can be processed; Need to extend the internal-memory join algorithms to the external-memory model; Important for join: I/O, CPU.

7 Block-Nested Loop Join 1.for every block in R 2. scan through S; 3. join records in the R block with the S records. I/O: |R|*|S|, where |R| is the number of blocks R occupies; Works good for small buffer (e.g. two blocks).

8 External Sort-Merge Join Extend the internal-memory sort-merge join by changing the sorting algorithm to external-memory merge sort. Merge sort:

9 External Sort-Merge Join (cont.) Optimization: omit the final pass of merge sort by pipelining the sort result to join; If buffer size , can sort by reading R and S twice; E.g. page size=8KB, each relation has 10,000 pages (80MB), buffer size=100 pages (<1MB), two passes are enough.

10 Classic Hash Join 1.Build a in-memory hash table for the smaller relation; 2.For each record in the larger relation, probe the hash table. Works when the smaller relation R fits in memory. If the smaller relation does not fit in memory, partition into smaller buckets!

11 Simple Hash Join 1.for each logical bucket j 2. for each record r in R 3. if r is in bucket j then 4. insert r into the hash table; 5. for each record s in S 6. if s is in bucket j then 7. probe the hash table; Classic hash join is a special case, with one bucket; Optimization: write the tuples not in bucket j to disk; Works good when memory is large (nearly as large as |R|).

12 GRACE Hash Join 1.partition R into n buckets so that each bucket fits in memory; 2.partition S into n buckets; 3.for each bucket j do 4. for each record r in Rj do 5. insert into a hash table; 6. for each record s in Sj do 7. probe the hash table. Works good when memory is small.

13 Hybrid Hash Join Hybrid of simple hash join and GRACE; When partitioning R, keep the records of the first bucket in memory as a hash table; –Typically this means that the first bucket uses more pages in memory (all other partitions are 1 page each) When partitioning S, for records of the first bucket, probe the hash table directly; Saving: no need to write R 1 and S 1 to disk or read back to memory. Works good for large and small memory.

14 Handle Partition Overflow Case 1, overflow on disk: an R partition is larger than memory size (note: don’t care about the size of S partitions). –Solution (a) small partitions first and combine before join; –Solution (b) recursive partition. Case 2, overflow in memory: the in- memory hash table of R becomes too large. –Solution: revise the partitioning scheme and keep a smaller partition in memory.

15 Conclusions Addressed equi-join problem in the external-memory environment; With decreasing cost of memory, hash- based join is better than nested-loop and sort-merge joins; Proposed three hash-based algorithms (simple hash join, GRACE join and hybrid join), out of which the hybrid hash join is the best.

16 Hash-based Nested Loop Join This is a hybrid of hash-based and nested- loop join; In pure hash-based joins, we have two steps: first, partition the source relations; next, join each partition separately; Hash-based nested loops join: no need to partition; Read some pages of R to fill memory and build a hash table for it, then scan through S.


Download ppt "Join Processing in Database Systems with Large Main Memories ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Leonard D. Shapiro Donghui Zhang,"

Similar presentations


Ads by Google