Presentation is loading. Please wait.

Presentation is loading. Please wait.

E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.

Similar presentations


Presentation on theme: "E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address."— Presentation transcript:

1 E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address of page  bring page into main memory  searching within the page comes for free

2 E.G.M. PetrakisHashing2 Σ key space hash function........ data pages 0 1 2 m-1 b  page size b: maximum number of records in page  space utilization u: measure of the use of space

3 E.G.M. PetrakisHashing3 Collisions  Keys that hash to the same address are stored within the same page  If the page is full: i. page splits: allocate a new page and split page content between the old and the new page or ii. overflows: list of overflow pages x x x xx overflow

4 E.G.M. PetrakisHashing4 Access Time  Goal: find key in one disk access  Access time ~ number of accesses  Large u: good space utilization but many overflows or splits => more disk accesses  Non-uniform key distribution => many keys map to the same addresses => overflows or splits => more accesses

5 E.G.M. PetrakisHashing5 Categories of Methods  Static: require file reorganization  open addressing, separate chaining  Dynamic: dynamic file growth, adapt to file size  dynamic hashing,  extendible hashing,  linear hashing,  spiral storage…

6 E.G.M. PetrakisHashing6 Dynamic Hashing Schemes  File size adapts to data size without total reorganization  Typically 1-3 disk accesses to access a key  Access time and u are a typical trade- off  u between 50-100% (typically 69%)  Complicated implementation

7 E.G.M. PetrakisHashing7  Two disk accesses:  one to access the index, one to access the data  with index in main memory => one disk access  Problem: the index may become too large  Dynamic hashing (Larson 1978)  Extendible hashing (Fagin et.al. 1979) indexdata pages Schemes With Index

8 E.G.M. PetrakisHashing8  Ideally, less space and less disk accesses (at least one)  Linear Hashing (Litwin 1980)  Linear Hashing with Partial Expansions (Larson 1980)  Spiral Storage (Martin 1979) address space data space Schemes Without Index

9 E.G.M. PetrakisHashing9  Support for shrinking or growing file  shrinking or growing address space, the hash function adapts to these changes  hash functions using first (last) bits of key = b n-1 b n-2 ….b i b i-1 …b 2 b 1 b 0  h i (key) = b i-1 … b 2 b 1 b 0 supports 2 i addresses  h i : one more bit than h i-1 to address larger files Hash Functions

10 E.G.M. PetrakisHashing10 Dynamic Hashing ( Larson 1978 )  Two level index  primary h 1 (key): accesses a hash table  secondary h 2 (key): accesses a binary tree Index: binary tree h 1 (k) 1 st level h 2 (k) 2 nd level data pages b 2 3 4 1

11 E.G.M. PetrakisHashing11 Index  Fixed (static): h 1 (key) = key mod m  Dynamic behavior on secondary index  h 2 (key) uses i bits of key  the bit sequence of h 2 =b i-1 …b 2 b 1 b 0 denotes which path on the binary tree index to follow in order to access the data page  scan h 2 from right to left (bit 1: follow right path, bit 0: follow left path)

12 E.G.M. PetrakisHashing12 index h 1 (k) 1 st level h 2 (k) 2 nd level data pages b 2 3 4 1 012345012345 0 1 1 0 h 1 =1, h 2 =“0” h 1 =1, h 2 =“01” h 1 =1, h 2 =“11” h 1 =5, h 2 = any h 1 (key) = key mod 6 h 2 (key) = “01”<= depth of binary tree = 2 0

13 E.G.M. PetrakisHashing13  Initially fixed size primary index and no data  insert record in new page under h 1 address  if page is full, allocate one extra page  split keys between old and new page  use one extra bit in h 2 for addressing h 1 =1, h 2 =0 h 1 =1, h 2 =1 01230123 0 1 Insertions 01230123 01230123 b h 1 =1,h 2 =any

14 E.G.M. PetrakisHashing14

15 E.G.M. PetrakisHashing15 Deletions  Find record to be deleted using h 1, h 2  Delete record  Check “sibling” page:  less than b records in both pages ?  if yes merge the two pages  delete one empty page  shrink binary tree index by one level and reduce h 2 by one bit

16 E.G.M. PetrakisHashing16 merging 0 3 2 1 1 3 2 4 0 3 2 1 1 3 2 4 delete

17 E.G.M. PetrakisHashing17 Extendible Hashing ( Fagin et.al. 1979 )  Dynamic hashing without index  Primary hashing is omitted  Only secondary hashing with all binary trees at the same level  The index shrinks and grows according to file size  Data pages attached to the index

18 E.G.M. PetrakisHashing18 dynamic hashing with all binary trees at same level 1 0 0123401234 1 0 0 1 00 01 10 11 0123401234 2 2 1 dynamic hashing number of address bits

19 E.G.M. PetrakisHashing19  Initially 1 index and 1 data page  0 address bits  insert records in data page index storage 0 b 0 Insertions global depth d: size of index 2 d local depth l : Number of address bits

20 E.G.M. PetrakisHashing20 1 1 0101 d: global depth = 1 l : local depth = 1 d 1 l index storage 0 b 0 dl Page “0” Overflows

21 E.G.M. PetrakisHashing21 Page “0” Overflows (cont.)  1 more key bit for addressing and 1 extra page => index doubles !!  Split contents of previous page between 2 pages according to next bit of key  Global depth d: number of index bits => 2 d index size  Local depth l : number of bits for record addressing

22 E.G.M. PetrakisHashing22 Page “0” Overflows (again) 00 01 10 11 22 2 1 contains records with same 1 st bit of key contain records with same 2 bits of key d

23 E.G.M. PetrakisHashing23 Page “01” Overflows 3 000 001 010 011 100 101 110 111 1 2 3 3 d 1 more key bit for addressing 2 d-l : number of pointers to page

24 E.G.M. PetrakisHashing24 Page “100” Overflows  no need to double index  page 100 splits into two (1 new page)  local depth l is increased by 1 000 001 010 011 100 101 110 111 2 3 2 3 3 2 +1

25 E.G.M. PetrakisHashing25  If l < d, split overflowed page (1 extra page)  If l = d => index is doubled, page is split  d is increased by 1=>1 more bit for addressing  update pointers (either way): a) if d prefix bits are used for addressing d=d+1; for (i=2 d -1, i>=0,i--) index[i]=index[i/2]; b) if d suffix bits are used for (i=0; i <= 2 d -1; i++) index[i]=index[i]+2 d -1; d=d+1 Insertion Algorithm

26 E.G.M. PetrakisHashing26 Deletion Algorithm  Find and delete record  Check sibling page  If less than b records in both pages  merge pages and free empty page  decrease local depth l by 1 (records in merged page have 1 less common bit)  if l reduce index (half size)  update pointers

27 E.G.M. PetrakisHashing27 000 001 010 011 100 101 110 111 2 3 2 3 3 2 delete with merging 000 001 010 011 100 101 110 111 2 3 2 2 2 l < d 00 01 10 11 2 2 2 2 2

28 E.G.M. PetrakisHashing28  A page splits and there are more than b keys with same next bit  take one more bit for addressing (increase l)  if d=l the index doubles again !!  Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value)  if distribution is known, transform it to uniform  Dynamic hashing performs better for non- uniform distributions (affected locally) Observations

29 E.G.M. PetrakisHashing29  For n: records and page size b  expected size of index (Flajolet)  1 disk access/retrieval when index in main memory  2 disk accesses when index is on disk  overflows increase number of disk accesses Performance

30 E.G.M. PetrakisHashing30 Storage Utilization with Page Splitting  In general 50% < u < 100%  On the average u ~ ln2 ~ 69% ( no overflows ) b b before splitting after splitting After splitting

31 E.G.M. PetrakisHashing31 Storage Utilization with Overflows  Achieves higher u and avoids page doubling (d=l)  higher u is achieved for small overflow pages  u=2b/3b~66% after splitting  small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75%  double index only if the overflow overflows!! bb

32 E.G.M. PetrakisHashing32 Linear Hashing ( Litwin 1980)  Dynamic scheme without index  Indices refer to page addresses  Overflows are allowed  The file grows one page at a time  The page which splits is not always the one which overflowed  The pages split in a predetermined order

33 E.G.M. PetrakisHashing33 Linear Hashing (cont.)  Initially n empty pages  p points to the page that splits  Overflows are allowed b p b p

34 E.G.M. PetrakisHashing34 File Growing  A page splits whenever the “splitting criterion” is satisfied  a page is added at the end of the file  pointer p points to the next page  split contents of old page between old and new page based on key values p

35 E.G.M. PetrakisHashing35  b=b page =4, b overflow =1  initially n=5 pages  hash function h 0 =k mod 5  splitting criterion u > A%  alternatively split when overflow overflows, etc. 4 319 613 303 402 27 737 712 16 711 125 320 90 435 p 215522 438 new element 0 1 2 3 4

36 E.G.M. PetrakisHashing36  Page 5 is added at end of file  The contents of page 0 are split between pages 0 and 5 based on hash function h 1 = key mod 10  p points to the next page p 4 319 613 303 438 402 27 737 712 16 711 320 90 522 125 435 215 0 1 2 3 4 5

37 E.G.M. PetrakisHashing37  Initially h 0 =key mod n  As new pages are added at end of file, h 0 alone becomes insufficient  The file will eventually double its size  In that case use h 1 =key mod 2n  In the meantime  use h 0 for pages not yet split  use h 1 for pages that have already split  Split contents of page pointed to by p based on h 1 Hash Functions

38 E.G.M. PetrakisHashing38  When the file has doubled its size, h 0 is no longer needed  set h 0 =h 1 and continue (e.g., h 0 =k mod 10)  The file will eventually double its size again  Deletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%) Hash Functions (cont.)

39 E.G.M. PetrakisHashing39  Initially n pages and 0 <= h 0 (k) <= n  Series of hash functions  Selection of hash function: if h i (k) >= p then use h i (k) else use h i+1 (k) Hash Functions

40 E.G.M. PetrakisHashing40 Linear Hashing with Partial Expansions (Larson 1980)  Problem with Linear Hashing: pages to the right of p delay to split  large chains of overflows on rightmost pages  Solution: do not wait that much to split a page  k partial expansions: take pages in groups of k  all k pages of a group split together  the file grows at lower rates

41 E.G.M. PetrakisHashing41 Two Partial Expansions  Initially 2n pages, n groups, 2 pages/group  groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1)  Pages in same group spit together => some records go to a new page at end of file (position: 2n) 2 pointers to pages of the same group 0 1 n 2n

42 E.G.M. PetrakisHashing42 1 st Expansion  After n splits, all pages are split  the file has 3n pages (1.5 time larger)  the file grows at lower rate  after 1 st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n 0 n 2n 3n

43 E.G.M. PetrakisHashing43 2 nd Expansion  After n splits the file has size 4n  repeat the same process having initially 4n pages in 2n groups 2 pointers to pages of the same group 0 1 2n 4n

44 E.G.M. PetrakisHashing44 1 1,1 1,2 1,3 1,4 1,5 1,6 11,21,41,61,82 relative file size disk access/retrieval Linear Hashing Linear Hashing 2 partial expansions 3.563.534.04 deletion 3.313.213.57 insertio n 1.091.121.17 retrieval Linear Hashing 3 part. Exp. Linear Hashing 2 part. Exp. Linear Hashing b = 5 b’ = 5 u = 0.85

45 E.G.M. PetrakisHashing45 Dynamic Hashing Schemes  Very good performance on membership, insert, delete operations  Suitable for both main memory and disk  b=1-3 records for main memory  b=1-4 Kbytes for disk  Critical parameter: space utilization u  large u => more overflows, bad performance  small u => less overflows, better performance  Suitable for direct access queries (random accesses) but not for range queries


Download ppt "E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address."

Similar presentations


Ads by Google