Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.

Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna

Introduction Hashing is done if the data is too big to store in main memory buffers. – Hash all the tuples of the argument(s) using an appropriate hash key. – For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value. – This reduces the size of the operand(s) by a factor equal to the number of buckets.

Partitioning Relations by Hashing Algorithm: initialize M-1 buckets using M-1 empty buffers; FOR each block b of relation R DO BEGIN read block b into the Mth buffer; FOR each tuple t in b DO BEGIN IF the buffer for bucket h(t) has no room for t THEN BEGIN copy the buffer t o disk; initialize a new empty block in that buffer; END; copy t to the buffer for bucket h(t); END ; FOR each bucket DO IF the buffer for this bucket is not empty THEN write the buffer to disk;

Duplicate Elimination For the operation δ(R) hash R to M-1 Buckets. (Note that two copies of the same tuple t will hash to the same bucket) Do duplicate elimination on each bucket R i independently, using one-pass algorithm The result is the union of δ(R i ), where R i is the portion of R that hashes to the ith bucket

Requirements Number of disk I/O's: 3*B(R) – B(R) < M(M-1), only then the two-pass, hash-based algorithm will work In order for this to work, we need: – hash function h evenly distributes the tuples among the buckets – each bucket R i fits in main memory (to allow the one-pass algorithm) – i.e., B(R) ≤ M 2

Grouping and Aggregation Hash all the tuples of relation R to M-1 buckets, using a hash function that depends only on the grouping attributes (Note: all tuples in the same group end up in the same bucket) Use the one-pass algorithm to process each bucket independently Uses 3*B(R) disk I/O's, requires B(R) ≤ M 2

Union, Intersection, and Difference For binary operation we use the same hash function to hash tuples of both arguments. R U S we hash both R and S to M-1 R ∩ S we hash both R and S to 2(M-1) R-S we hash both R and S to 2(M-1) Requires 3(B(R)+B(S)) disk I/O’s. Two pass hash based algorithm requires min(B(R)+B(S))≤ M2

Hash-Join Algorithm Use same hash function for both relations; hash function should depend only on the join attributes Hash R to M-1 buckets R 1, R 2, …, R M-1 Hash S to M-1 buckets S 1, S 2, …, S M-1 Do one-pass join of R i and S i, for all i 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M 2

Sort based Vs Hash based For binary operations, hash-based only limits size to min of arguments, not sum Sort-based can produce output in sorted order, which can be helpful Hash-based depends on buckets being of equal size Sort-based algorithms can experience reduced rotational latency or seek time

Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.

Similar presentations

Presentation on theme: "Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.

Similar presentations

Presentation on theme: "Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna."— Presentation transcript:

Similar presentations

About project

Feedback