Presentation is loading. Please wait.

Presentation is loading. Please wait.

Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow.

Similar presentations


Presentation on theme: "Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow."— Presentation transcript:

1 Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow pages are allocated if needed (as a separate link list for each bucket. Only Page#s are needed for pointers) or a shared link list. 2 Overflow pages(as Single link list)...... Long overflow chains can develop and degrade performance. – Extendible and Linear Hashing are dynamic techniques to fix this problem.

2 Other Static Hashing overflow handling methods e.g., h(key) mod M h key bucket pages rec Overflow can be handled by open addressing also (more commonly used for internal hash tables where a bucket is a allocation of main memory. rec In Open Addressing, upon collision, search forward in the bucket sequence for the next open record slot. rec 01234560123456 h(rec_key)=1 Collision! 2? no 3? yes Then to search, apply h. If not found, search sequentially ahead until found (circle around to search start point)!

3 Other overflow handling methods h 0 (key) h then h 1 then h 2... bucket pages rec Overflow can be handled by re-hashing also. rec In re-hashing, upon collision, apply next hash function from a sequence of hash functions.. 01234560123456 rec Then to search, apply h. If not found, apply next hash function until found or list exhausted. These methods can be combined also.

4 Extendible Hashing Idea: Use directory of pointers to buckets, split just the bucket that overflowed double the directory when needed Directory is much smaller than file, so doubling it is cheap. Only one page of data entries is split. No overflow page! Trick lies in how hash function is adjusted!

5 Example blocking factor(bfr)=4 (# entries per bucket) Local depth of a bucket: # of bits used to determine if an entry belongs to bucket Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to (= max of local depths) Insert : If bucket is full, split it ( allocate 1 new page, re-distribute over those 2 pages ). GLOBAL DEPTH = gd DATA PAGES 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D 10* 1*21* 4*12*32* 16* 15*7*19* 5* To find the bucket for a new key value, r, take just the last global depth bits of h(r), not all of it! (last 2 bits in this example) (for simplicity we let h(r)=r here) E.g., h(5)=5=101 binary thus it's in bucket pointed in the directory by 01. Apply hash function, h, to key value, r Follow pointer of last 2 bits of h(r).

6 Example how did we get there? GLOBAL DEPTH = gd DATA PAGES 0 1 1 1 LOCAL DEPTH DIRECTORY Bucket A 4* First insert is 4: h(4) = 4 = 100 binary in bucket pointed in the directory by 0. 0 Bucket B

7 Example GLOBAL DEPTH = gd DATA PAGES 0 1 1 1 LOCAL DEPTH DIRECTORY Bucket A 4* Insert: 12, 32, 16 and 1 1 Bucket B 1* 12*32* 16* h(12) = 12 = 1100 binary in bucket pointed in the directory by 0. h(32) = 32 = 10 0000 binary in bucket pointed in the directory by 0. h(16) = 16 = 1 0000 binary in bucket pointed in the directory by 0. h(1) = 1 = 1 binary in bucket pointed in the directory by 1.

8 Example GLOBAL DEPTH = gd DATA PAGES 0 1 1 1 LOCAL DEPTH DIRECTORY Bucket A 4* Insert: 5, 21 and 13 1 Bucket B 1* 12*32* 16* 13*21* 5* h(5) = 5 = 101 binary in bucket pointed in the directory by 1. h(21) = 21 = 1 0101 binary in bucket pointed in the directory by 1. h(13) = 13 = 1101 binary in bucket pointed in the directory by 1.

9 0 0 1 1 Example GLOBAL DEPTH = gd DATA PAGES 0 1 2 2 LOCAL DEPTH DIRECTORY Bucket A 4* 9 th insert: 10 h(10) = 10 = 1010 binary in bucket pointed in the directory by 0. Collision! 1 Bucket B 1* 12*32* 16* 13*21* 5* 2 Bucket C 10* Split bucket A into A and C. 0 1 Reset one pointer. Double directory (by copying what is there and adding a bit on the left). Redistribute values among A and C (if necessary Not necessary this time since all 2's bits correct: 4 = 100 12 = 1100 32 =100000 16 = 10000 10 = 1010

10 0 0 1 1 Example GLOBAL DEPTH = gd 0 1 2 2 LOCAL DEPTH DIRECTORY Bucket A 4* h(15) = 15 = 1111 binary 1 Bucket B 1* 12*32* 16* 13*21* 5* 2 Bucket C 10* Split bucket B into B and D. No need to double directory because the local depth of B is less than the global depth. 0 1 Reset one pointer Redistribute values among B and D (if necessary, not necessary this time). DATA PAGES 1 Bucket D 15*7*19* Reset local depth of B and D 2 2 Inserts: 15, 7 and 19 h(15) = 7 = 111 binary h(19) = 15 = 1 0011 binary

11 Insert h(20)=20=10100  Bucket pointed to by 00 is full! 2 2 2 2 LOCAL DEPTH 2 GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D 1* 5*21*13* 10* 15*7*19* 00 01 10 11 4* 12* 32* 16* Split A.Double directory and reset 1 pointer. 00 01 10 11 0 0 0 0 1 1 1 1 3 Bucket E (`split image' of Bucket A) 3 3 4* 12* Redistribute contents of A

12 Points to Note 20 = binary 10100. Last 2 bits (00) tell us r belongs in either A or A2, but not which one. Last 3 bits needed to tell which one. – Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket. – Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to (= max of local depths) When does bucket split cause directory doubling? – Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page. – Use of least significant bits enables efficient doubling via copying of directory!)

13 Comments on Extendible Hashing If directory fits in memory, equality search answered with one disk access; else two. – Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. – Multiple entries with same hash value cause problems! Delete: If removal of data entry makes bucket empty, can be merged with its `split image’. –As soon as each directory element points to same bucket as its (merged) split image, can halve directory.

14 Linear Hash File Starts with M buckets (numbered 0, 1,..., M-1 and initial hash function, h 0 =mod M (or more general, h 0 (key)=h(key)mod M for any hash ftn h which maps into the integers Use Chaining to shared overflow-pages to handle overflows. At the first overflow, split bucket 0 into bucket 0 and bucket M and rehash bucket 0 records using h 1 =mod 2M. Henceforth if h 0 yields value  0, rehash using h 1 =mod 2M At the next overflow, split bucket 1 into bucket 1 and bucket M+1 and rehash bucket 1 records using h 1 =mod 2M. Henceforth if h 0 yields value  1, use h 1... When all of the original M buckets have been split (M collisions), then rehash all overflow records using h 1. Relabel h 1 as h 0, (discarding the old h 0 forever) and start a new "round" by repeating the process above for all future collisions (i.e., now there are buckets 0,...,(2M-1) and h 0 = MOD 2M ). To search for a record, let n = number of splits so far in the given round, if h 0 (key) is not greater than n, then use h 1, else use h 0.

15 02|BAID |NY |NY 5 21 Linear Hash ex. M=5 Bucketpg 0 45 1 99 2 23 3 78 4 98 25|CLAY |OUTBK|NJ 33|GOOD |GATER|FL 14|THAISZ|KNOB |NJ 11|BROWN |NY |NY 45 | | | 23 | | | 99 27|JONES |MHD |MN 22|ZHU |SF |CA Insert h 0 (27)  mod 5 (27)=2 C! 0  0,5, mod 10 rehash 0; n=0 27|JONES |MHD |MN OF | | | 21 Insert h 0 (8)  mod 5 (8)=3 8|SINGH |FGO |ND Insert h 0 (15)  mod 5 (15)=0  n 15|LOWE |ZAP |ND h 1 (15)  mod 10 (15)=5 15|LOWE |ZAP |ND Insert h 0 (32)  mod 5 (32)=2 ! 1  1,6, mod 10 rehash 1; n=1 32|FARNS |BEEP |NY 24|CROWE |SJ |CA 78 98 32|FARNS |BEEP |NY 6 101 21|BARBIE|NY |NY | | | 101 Insert h 0 (39)  mod 5 (39)=4 ! 2  2,7; mod 10 rehash 2; n=2 39|TULIP |DERLK|IN 7 104 | | | 104 39|TULIP |DERLK|IN Insert h 0 (31)  mod 5 (31)=1 ! 3  3,8; mod 10 rehash 3; n=3 31|ROSE |MIAME|OH 8 105 | | | 105 Insert h 0 (36)  mod 5 (36)=1 ! 4  4,9; mod 10 rehash 4; n=4! 36|SCHOTZ|CORN |IA 9 107 | | | 107 36|SCHOTZ|CORN |IA | | |

16 02|BAID |NY |NY 5 21 LHex. 2 nd rnd M=10 h 0  mod 10 Bucketpg 0 45 1 99 2 23 3 78 4 98 25|CLAY |OUTBK|NJ 33|GOOD |GATER|FL 14|THAISZ|KNOB |NJ 11|BROWN |NY |NY 45 | | | 23 | | | 27|JONES |MHD |MN 22|ZHU |SF |CA h 0 27=7 | | | 21 8|SINGH |FGO |ND 15|LOWE |ZAP |ND 24|CROWE |SJ |CA 78 98 32|FARNS |BEEP |NY 6 101 21|BARBIE|NY |NY | | | 7 104 | | | 39|TULIP |DERLK|IN 31|ROSE |MIAME|OH 8 105 | | | 9 107 | | | 36|SCHOTZ|CORN |IA | | | 99 101 104 105 107 h 0 32=2 Collision! rehash mod 20 10 109 | | | 109 h 0 39=9 h 0 31=1 Collision! rehash mod 20 11 110 | | | 110 | | | h 0 36=6 OF Insert h 0 (10)  mod 10 (10)=0 10|RADHA |FGO |ND ETC.

17 Summary Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to performance degradation due to collision handling problems. Extendible Hashing avoids performance problems by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) – Directory to keep track of buckets, doubles periodically. – Can get large with skewed data; additional I/O if this does not fit in main memory.

18 Summary Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. – Overflow pages not likely to be long. – Duplicates handled easily. – Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. skewed occurs when the hash values of the data entries are not uniform! v 1 v 2 v 3 v 4 v 5... v n Distribution skew count values... v 1 v 2 v 3 v 4 v 5... v n Count skew count values... v 1 v 2 v 3 v 4 v 5... v n Dist & Count skew count values


Download ppt "Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow."

Similar presentations


Ads by Google