Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.

Similar presentations


Presentation on theme: "Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006."— Presentation transcript:

1 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006

2 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke2 Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Basic idea: static hashing.  Two dynamic hashing techniques:  Extendible Hashing  Linear Hashing

3 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke3 Motivation  B+-tree needs log B N I/Os to get one key. Can we do faster?  If we use key as bucket number, then one I/O is good enough!  Problem: too many buckets!  What if (key % M)?  Got the idea, but should use a Hashing function on key. Primary bucket pages 1 0 M-1

4 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke4 Static Hashing  # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.  h ( k ) mod M = bucket to which data entry with key k belongs. (M = # of buckets) h(key) mod M h key Primary bucket pages Overflow pages 1 0 M-1

5 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke5 Static Hashing (Contd.)  Buckets contain data entries.  Hash fn works on search key field of record r. Must distribute values over range 0... M-1.  h ( key ) = (a * key + b) usually works well.  Long overflow chains can develop and degrade performance.  Extendible and Linear Hashing : Dynamic techniques to fix this problem.

6 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke6 Extendible Hashing  Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets?  Problem 1: Reading and writing all pages is expensive!  Problem 2: May have many under-utilized pages!  Idea : Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed!  Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page !  Trick lies in how hash function is adjusted!

7 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke7 Example  Directory is array of size 4.  To find bucket for r, take last ` global depth ’ # bits of h ( r ); we denote r by h ( r ).  If h ( r ) = 5 = binary 101, it is in bucket pointed to by 01.  Insert : If bucket is full, split it ( allocate new page, re-distribute ).  If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.) 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5*

8 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke8 Example  Insert 14.  14%4 = 2.  Does not overflow.  32%8 = 0, 16%8 = 0 -- stay in bucket 000  4 % 8 = 4, 12%8 = 4, 20%8 = 4 -- move to bucket 100 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5* 14*  Insert 20.  20%4 = 0.  Overflow!

9 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke9 Insert h (r)=20 (Causes Doubling) 20* 00 01 10 11 2 2 2 2 LOCAL DEPTH 2 2 DIRECTORY GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 1* 5*21*13* 32* 16* 10* 15*7*19* 4*12* 19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 32* 1*5*21*13* 16* 10* 15* 7* 4* 20* 12* LOCAL DEPTH GLOBAL DEPTH

10 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke10 Points to Note  When does bucket split cause directory doubling?  Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth ;  How to implement the doubling of directory?  Copying it over! (Use of least significant bits enables this!)  `fixing’ pointer to split image page.

11 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke11 Comments on Extendible Hashing  Good: no overflow page. If directory fits in memory, equality search answered with one disk access guaranteed; (else two).  Since directory contains only key + pointer, it should generally be small enough to fit in memory.  Bad: if the distribution of hash values is skewed, directory can grow large.  Multiple entries with same hash value cause problems!

12 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke12 Sample quiz question  Is it possible for an insertion to cause the directory to double more than once?  If yes, show an example.

13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke13 Deletion on Extendible Hashing  Find the record and delete.  Rule 1: If removal of data entry makes bucket empty, can be merged with `split image’.  Rule 2: If each directory element points to the same bucket as its split image, can halve directory.

14 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke14 Linear Hashing  This is another dynamic hashing scheme, an alternative to Extendible Hashing.  LH handles the problem of long overflow chains without using a directory, and handles duplicates.  Idea :  Keep a pointer which alternate through each bucket.  Whenever a new overflow page is generated, split the bucket the pointer points to, and add a new bucket at the end of the file.  Observation: the number of buckets is doubled after splitting all bucket!

15 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke15  Let there be eight records:  32, 44, 14, 18  9, 25, 31, 35  Let each disk page hold up to four records.  Consider inserting 10, 30, 36, 7, 5, 11… Example of Linear Hashing

16 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke16 Example of Linear Hashing  On split, h Level+1 is used to re-distribute entries. 0 h h 1 ( This info is for illustration only!) Level=0, N=4 00 01 10 11 000 001 010 011 (The actual contents of the linear hashed file) Next=0 PRIMARY PAGES Data entry r with h(r)=5 Primary bucket page 44* 36* 32* 25* 9*5* 14*18* 10* 30* 31*35* 11* 7* 0 h h 1 Level=0 00 01 10 11 000 001 010 011 Next=1 PRIMARY PAGES 44* 36* 32* 25* 9*5* 14*18* 10* 30* 31*35* 11* 7* OVERFLOW PAGES 43* 00 100 Let h Level =(a*key+b) % 4 h Level+1 =(a*key+b) % 8 If insert 22 ……

17 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke17 Example of Linear Hashing 0 h h 1 Level=0 00 01 10 11 000 001 010 011 Next=2 PRIMARY PAGES 44* 36* 32* 25* 9* 14*18* 10* 30* 31*35* 11* 7* OVERFLOW PAGES 43* 00 100 Let h Level =(a*key+b) % 4 h Level+1 =(a*key+b) % 8 22* 5* 01 101 Important notice: after buckets 010 and 011 are split, Next moves back To 000!

18 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke18 Example: End of a Round 0 h h 1 22* 00 01 10 11 000 001 010 011 00 100 Next=3 01 10 101 110 Level=0 PRIMARY PAGES OVERFLOW PAGES 32* 9* 5* 14* 25* 66* 10* 18* 34* 35*31* 7* 11* 43* 44*36* 37*29* 30* 0 h h 1 37* 00 01 10 11 000 001 010 011 00 100 10 101 110 Next=0 Level=1 111 11 PRIMARY PAGES OVERFLOW PAGES 11 32* 9*25* 66* 18* 10* 34* 35* 11* 44* 36* 5* 29* 43* 14* 30* 22* 31*7* 50*

19 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke19 LH File  In the middle of a round. Level h Buckets that existed at the beginning of this round: this is the range of Next Bucket to be split of other buckets) in this round Level hsearch key value) ( )( Buckets split in this round: If is in this range, must use h Level+1 `split image' bucket. to decide if entry is in created (through splitting `split image' buckets:

20 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke20 Linear Hashing (Contd.)  Insert : Find bucket by applying h Level / h Level+1 :  If bucket to insert into is full: Add overflow page and insert data entry. ( Maybe ) Split Next bucket and increment Next.  Can choose any criterion to `trigger’ split.  Since buckets are split round-robin, long overflow chains don’t develop!  Doubling of directory in Extendible Hashing is similar; switching of hash functions is implicit in how the # of bits examined is increased.

21 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke21 Summary  Hash-based indexes: best for equality searches, cannot support range searches.  Static Hashing can lead to long overflow chains.  Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. ( Duplicates may require overflow pages. )  Directory to keep track of buckets, doubles periodically.  Can get large with skewed data; additional I/O if this does not fit in main memory.

22 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke22 Summary (Contd.)  Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages.  Overflow pages not likely to be long.  Duplicates handled easily.  Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization.  For hash-based indexes, a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!

23 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke23 Sample Quiz Question  Min/Max # insertions to increase global depth by 1? 13* 00 01 10 11 2 2 2 2 2 LOCAL DEPTH GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D 10* 1*21* 4*12*32* 16* 15*7*19* 5* 14*  What about by 2?

24 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke24 0 h h 1 Level=0 00 01 10 11 000 001 010 011 Next=2 PRIMARY PAGES 44* 36* 32* 25* 9* 14*18* 10* 30* 31*35* 11* 7* OVERFLOW PAGES 43* 00 100 22* 5* 01 101 Sample Quiz Question  Min/Max # insertions to cause Next to point to bucket 011?  What about for Next to point to 000?


Download ppt "Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006."

Similar presentations


Ads by Google