Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Hash-Based Indexes Chapter 11. 2 Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.

Similar presentations


Presentation on theme: "1 Hash-Based Indexes Chapter 11. 2 Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic."— Presentation transcript:

1 1 Hash-Based Indexes Chapter 11

2 2 Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic hashing techniques exist: Trade-offs similar to ISAM vs. B+ trees.

3 3 Static Hashing  h ( k ) mod N = bucket to which data entry with key k belongs. (N = # of buckets) h(key) mod N h key Primary bucket pages Overflow pages 1 0 N-1

4 4 Static Hashing : h ( k ) mod N  # primary pages fixed (N = # of buckets)  allocated sequentially  never de-allocated  overflow pages if needed. h(key) mod N h key Primary bucket pages Overflow pages 2 0 N-1

5 5 Static Hashing  h ( k ) mod N = bucket to which data entry with key k belongs with N = # of buckets  Hash function works on search key of record r.  h() must distribute values over range [0... N-1].  For example, h ( key ) = (a * key + b)  a and b are constants  lots known about how to tune h.

6 6 Static Hashing Cons  Primary pages fixed space  static structure.  Fixed # bucket is the problem:  Rehashing can be done  Not good for search.  In practice, instead use overflow chains.  Long overflow chains can develop and degrade performance.  Solution: Employ dynamic techniques to fix this problem:  Extendible hashing, or  Linear Hashing

7 7 Extendible Hashing  Problem: Bucket (primary page) becomes full.  Solution: Why not re-organize file by doubling # of buckets?  Reading and writing all pages is expensive!  Idea : Use directory of pointers to buckets instead of buckets  double # of buckets by doubling the directory  split just the bucket that overflowed!  Discussion :  + Directory much smaller than file, so doubling is cheaper. + Only one page of data entries is split. + No overflow pages ever.  Trick lies in how hash function is adjusted!

8 8 Example  Directory : array[4]  To find bucket for r, take last ` global depth ’ # bits of function h ( r ) 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5*

9 9 Example  If h ( r ) = 5 = 101,  it is in bucket pointed to by 01.  If h ( r ) = 4 = 100,  it is in bucket pointed to by 00. 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5*

10 10 Insertion  Splitting may :  double the directory, or  simply link in a new page. To tell what to do : Compare global depth with local depth for split bucket. 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5*  Insert : If bucket is full, split it ( allocate new page, re-distribute content ).

11 11 Insert h (r)= 6 = binary 110 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5*

12 12 Insert h (r)= 6 = binary 110 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5* 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5* 6*

13 13 Insert h (r)=20 = binary 10100 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5*

14 14 Insert h (r)=20 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5* 20* 3 Bucket A2 Split Bucket A into two buckets A1 and A2. 4*12* 3 Bucket A1 32*16 *

15 15 Insert h (r)=20 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5* 20* 3 Bucket A24*12* 3 Bucket A1 32* 16*

16 16 Insert h (r)=20 20* 00 01 10 11 2 2 2 2 LOCAL DEPTH 2 2 DIRECTORY GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 1* 5*21*13* 32* 16* 10* 15*7*19* 4*12* 19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 32* 1*5*21*13* 16* 10* 15* 7* 4* 20* 12* LOCAL DEPTH GLOBAL DEPTH

17 17 Points to Note  20 = binary 10100.  Last 2 bits (00) tell us r belongs in A (or not A).  Last 3 bits needed to tell A1 or A2  Using bits:  Global depth of directory : Max # of bits needed to tell which bucket an entry belongs to  Local depth of a bucket : # of bits used to determine if an entry belongs to this bucket.

18 18 More Points to Note Bits:  Global depth of directory : Max # of bits needed to tell which bucket an entry belongs to  Local depth of a bucket : # of bits used to determine if an entry belongs to this bucket.  When does bucket split cause directory doubling?  Before insert, local depth of bucket = global depth.  Insert causes local depth to become < global depth ;  Directory is doubled by copying it over and `fixing’ pointer to split image page.

19 19 Extendible Hashing : Delete  Delete :  If removal of data entry makes bucket empty, can be merged with `split image’.  If each directory element points to same bucket as its split image, can halve directory.

20 20 Comments on Extendible Hashing  If directory fits in memory, then equality search answered with one disk access; else two.  100MB file, 100 bytes/rec, 4K pages contain 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory.  Directory grows in spurts.  If the distribution of hash values is skewed, directory can grow large.

21 21 Summary  Hash-based indexes: best for equality searches, cannot support range searches.  Static Hashing can lead to long overflow chains.  Extendible Hashing avoids overflow pages by splitting full bucket when new data to be added  Directory to keep track of buckets, doubles periodically


Download ppt "1 Hash-Based Indexes Chapter 11. 2 Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic."

Similar presentations


Ads by Google