CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.

CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More

6/8/20162 Copyright notice Materials and lecture notes in this course are adapted from various sources, including the authors of the textbook and references, Internet, etc. The Copyright belong to the original authors. Thanks!

Hashed Files Hashing for disk files is called External Hashing The file blocks are divided into M equal-sized buckets, numbered bucket 0, bucket 1,..., bucket M-1. –Typically, a bucket corresponds to one (or a fixed number of) disk block. One of the file fields is designated to be the hash key of the file. The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function. Search is very efficient on the hash key. Collisions occur when a new record hashes to a bucket that is already full. –An overflow file is kept for storing such records. –Overflow records that hash to each bucket can be linked together.

CPSC 8620Notes 64 key  h(key) Hashing...... Buckets (typically 1 disk block)

CPSC 8620Notes 65...... Two alternatives records...... (1) key  h(key)

CPSC 8620Notes 66 (2) key  h(key) Index record key 1 Two alternatives

CPSC 8620Notes 67 (2) key  h(key) Index record key 1 Two alternatives Alt (2) for “secondary” search key

CPSC 8620Notes 68 Example hash function Key = ‘x 1 x 2 … x n ’ n byte character string Have b buckets h: add x 1 + x 2 + ….. x n – compute sum modulo b

CPSC 8620Notes 69  This may not be best function …  Read Knuth Vol. 3 if you really need to select a good function.

CPSC 8620Notes 610 Next: example to illustrate inserts, overflows, deletes h(K)

CPSC 8620Notes 611 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 01230123

CPSC 8620Notes 612 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 01230123 d a c b h(e) = 1

CPSC 8620Notes 613 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 01230123 d a c b h(e) = 1 e

CPSC 8620Notes 614 01230123 a b c e d EXAMPLE: deletion Delete: e f f g

CPSC 8620Notes 615 01230123 a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c

CPSC 8620Notes 616 01230123 a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c d

Overflows There are numerous methods for collision resolution, including the following: –Open addressing: Proceeding from the occupied position specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found. –Chaining: For this method, various overflow locations are kept, usually by extending the array with a number of overflow positions. In addition, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. –Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary.

CPSC 8620Notes 618 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit

CPSC 8620Notes 619 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket

Hashed Files The hash function h should distribute the records uniformly among the buckets –Otherwise, search time will be increased because many overflow records will exist. Main disadvantages of static external hashing: –Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. –Ordered access on the hash key is quite inefficient (requires sorting the records).

CPSC 8620Notes 621 How do we cope with growth? Overflows and reorganizations Dynamic hashing

CPSC 8620Notes 622 How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible Linear

Dynamic And Extensible Hashed Files Dynamic and Extendible Hashing Techniques –Hashing techniques are adapted to allow the dynamic growth and shrinking of the number of file records. –These techniques include the following: dynamic hashing, extensible hashing, and linear hashing. Both dynamic and extensible hashing use the binary representation of the hash value h(K) in order to access a directory. –In dynamic hashing the directory is a binary tree. –In extensible hashing the directory is an array of size 2 d where d is called the global depth. Records are distributed among buckets based on the values of the leading bits in their hash values

CPSC 8620Notes 624 Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K)  use i  grows over time…. 00110101

CPSC 8620Notes 625 (b) Use directory h(K)[i ] to bucket............

CPSC 8620Notes 626 Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 1 0001 1001 1100 Insert 1010

CPSC 8620Notes 627 Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 1 0001 1001 1100 Insert 1010 1 1100 1010

CPSC 8620Notes 628 Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 1 0001 1001 1100 Insert 1010 1 1100 1010 New directory 2 00 01 10 11 i = 2 2

CPSC 8620Notes 629 1 0001 2 1001 1010 2 1100 Insert: 0111 0000 00 01 10 11 2 i = Example continued

CPSC 8620Notes 630 1 0001 2 1001 1010 2 1100 Insert: 0111 0000 00 01 10 11 2 i = Example continued 0111 0000 0111 0001

CPSC 8620Notes 631 1 0001 2 1001 1010 2 1100 Insert: 0111 0000 00 01 10 11 2 i = Example continued 0111 0000 0111 0001 2 2

CPSC 8620Notes 632 00 01 10 11 2 i = 2 1001 1010 2 1100 2 0111 2 0000 0001 Insert: 1001 Example continued

CPSC 8620Notes 633 00 01 10 11 2 i = 2 1001 1010 2 1100 2 0111 2 0000 0001 Insert: 1001 Example continued 1001 1010

CPSC 8620Notes 634 00 01 10 11 2 i = 2 1001 1010 2 1100 2 0111 2 0000 0001 Insert: 1001 Example continued 1001 1010 000 001 010 011 100 101 110 111 3 i = 3 3

CPSC 8620Notes 635 Extensible hashing: deletion No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure)

CPSC 8620Notes 636 Deletion example: Run thru insert example in reverse!

CPSC 8620Notes 637 Note: Still need overflow chains Example: many records with duplicate keys 1 1101 1100 22 insert 1100 1100 if we split:

CPSC 8620Notes 638 Solution: overflow chains 1 1101 1100 1 insert 1100 add overflow block: 1100 1101

CPSC 8620Notes 639 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary +

CPSC 8620Notes 640 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary + Indirection (Not bad if directory in memory) Directory doubles in size - -

Dynamic Hashing In dynamic hashing the addresses of the buckets were either the n high-order bits or n − 1 high-order bits, depending on the total number of keys belonging to the respective bucket. Dynamic hashing maintains a tree-structured directory with two types of nodes: –Internal nodes that have two pointers—the left pointer corresponding to the 0 bit (in the hashed address) and a right pointer corresponding to the 1 bit. –Leaf nodes—these hold a pointer to the actual bucket with records.

Dynamic Hashing

CPSC 8620Notes 643 Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash 01110101 grows b i

CPSC 8620Notes 644 Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash 01110101 grows b i (b) File grows linearly

CPSC 8620Notes 645 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets

CPSC 8620Notes 646 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule

CPSC 8620Notes 647 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule insert 0101

CPSC 8620Notes 648 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule 0101 can have overflow chains! insert 0101

CPSC 8620Notes 649 Note In textbook, n is used instead of m n=m+1 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets n=10

CPSC 8620Notes 650 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets

CPSC 8620Notes 651 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010

CPSC 8620Notes 652 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010 0101 insert 0101

CPSC 8620Notes 653 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010 0101 insert 0101 11

CPSC 8620Notes 654 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010 0101 insert 0101 11 1111 0101

CPSC 8620Notes 655 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2...

CPSC 8620Notes 656 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2 0000 100 101 110 111 3...

CPSC 8620Notes 657 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2 0000 100 101 110 111 3... 100

CPSC 8620Notes 658 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2 0000 100 101 110 111 3... 100 101 0101

CPSC 8620Notes 659  When do we expand file? Keep track of: # used slots total # of slots = U

CPSC 8620Notes 660 If U > threshold then increase m (and maybe i )  When do we expand file? Keep track of: # used slots total # of slots = U

CPSC 8620Notes 661 Linear Hashing Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing Summary + + Can still have overflow chains -

CPSC 8620Notes 662 Example: BAD CASE Very full Very emptyNeed to move m here… Would waste space...

CPSC 8620Notes 663 Hashing - How it works - Dynamic hashing - Extensible - Linear Summary

CPSC 8620Notes 664 Next: Indexing vs Hashing Index definition in SQL Multiple key access

CPSC 8620Notes 665 Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Indexing vs Hashing

CPSC 8620Notes 666 INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Indexing vs Hashing

CPSC 8620Notes 667 Index definition in SQL Create index name on rel (attr) Create unique index name on rel (attr) defines candidate key Drop INDEX name

CPSC 8620Notes 668 CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...)... at least in SQL... Note

CPSC 8620Notes 669 ATTRIBUTE LIST  MULTIKEY INDEX (next) e.g., CREATE INDEX foo ON R(A,B,C) Note

CPSC 8620Notes 670 Motivation: Find records where DEPT = “Toy” AND SAL > 50k Multi-key Index

CPSC 8620Notes 671 Strategy I: Use one index, say Dept. Get all Dept = “Toy” records and check their salary I1I1

CPSC 8620Notes 672 Use 2 Indexes; Manipulate Pointers ToySal > 50k Strategy II:

CPSC 8620Notes 673 Multiple Key Index One idea: Strategy III: I1I1 I2I2 I3I3

CPSC 8620Notes 674 Example Record Dept Index Salary Index Name=Joe DEPT=Sales SAL=15k Art Sales Toy 10k 15k 17k 21k 12k 15k 19k

CPSC 8620Notes 675 For which queries is this index good? Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k

CPSC 8620Notes 676 Interesting application: Geographic Data DATA: x y...

CPSC 8620Notes 677 Queries: What city is at ? What is within 5 miles from ? Which is closest point to ?

CPSC 8620Notes 678 h n b i a c o d Example e g f m l k j

CPSC 8620Notes 679 h n b i a c o d 10 20 Example e g f m l k j

CPSC 8620Notes 680 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10

CPSC 8620Notes 681 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10 5 15

CPSC 8620Notes 682 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10 h i a b c d e f g n o m l j k 5 15

CPSC 8620Notes 683 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10 h i a b c d e f g n o m l j k Search points near f Search points near b 5 15

CPSC 8620Notes 684 Queries Find points with Yi > 20 Find points with Xi < 5 Find points “close” to i = Find points “close” to b =

CPSC 8620Notes 685 Many types of geographic index structures have been suggested kd-Trees (very similar to what we described here) Quad Trees R Trees...

CPSC 8620Notes 686 Two more types of multi key indexes Grid Partitioned hash

CPSC 8620Notes 687 Grid Index Key 2 X 1 X 2 …… X n V 1 V 2 Key 1 V n To records with key1=V 3, key2=X 2

CPSC 8620Notes 688 CLAIM Can quickly find records with –key 1 = V i  Key 2 = X j –key 1 = V i –key 2 = X j

CPSC 8620Notes 689 CLAIM Can quickly find records with –key 1 = V i  Key 2 = X j –key 1 = V i –key 2 = X j And also ranges…. –E.g., key 1  V i  key 2 < X j

How do we find entry i,j in linear structure? CPSC 8620Notes 690 i, j position S+0 position S+1 position S+2 position S+3 position S+4 position S+9 pos(i, j) = max number of i values N=4

How do we find entry i,j in linear structure? CPSC 8620Notes 691 i, j position S+0 position S+1 position S+2 position S+3 position S+4 position S+9 pos(i, j) = S + iN + j max number of i values N=4 Issue: Cells must be same size, and N must be constant! Issue: Some cells may overflow, some may be sparse...

CPSC 8620Notes 692 Solution: Use Indirection Buckets V 1 V 2 V 3 * Grid only V 4 contains pointers to buckets Buckets -- X1 X2 X3

CPSC 8620Notes 693 With indirection: Grid can be regular without wasting space We do have price of indirection

CPSC 8620Notes 694 Can also index grid on value ranges SalaryGrid Linear Scale 123 ToySalesPersonnel 0-20K1 20K-50K2 50K-3 8

CPSC 8620Notes 695 Grid files Good for multiple-key search Space, management overhead (nothing is free) Need partitioning ranges that evenly split keys + - -

CPSC 8620Notes 696 Idea: Key1 Key2 Partitioned hash function h1h2 010110 1110010

CPSC 8620Notes 697 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert

CPSC 8620Notes 698 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert

CPSC 8620Notes 699 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales  Sal=40k

CPSC 8620Notes 6100 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales  Sal=40k

CPSC 8620Notes 6101 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Sal=30k

CPSC 8620Notes 6102 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Sal=30k look here

CPSC 8620Notes 6103 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales

CPSC 8620Notes 6104 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales look here

CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.

Similar presentations

Presentation on theme: "CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.

Similar presentations

Presentation on theme: "CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More."— Presentation transcript:

Similar presentations

About project

Feedback