Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing
Outline (static) hashing extendible hashing linear hashing Faloutsos - Pavlo CMU SCS 15-415/615 Outline (static) hashing extendible hashing linear hashing Hashing vs B-trees Faloutsos - Pavlo CMU SCS 15-415/615
(Static) Hashing Problem: “find EMP record with ssn=123” Faloutsos - Pavlo CMU SCS 15-415/615 (Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium? Faloutsos - Pavlo CMU SCS 15-415/615
Hashing A: Brilliant idea: key-to-address transformation: #0 page 123; Smith; Main str #123 page #999,999,999 Faloutsos - Pavlo CMU SCS 15-415/615
Hashing Since space is NOT free: use M, instead of 999,999,999 slots hash function: h(key) = slot-id #0 page 123; Smith; Main str #123 page #999,999,999 Faloutsos - Pavlo CMU SCS 15-415/615
Hashing Typically: each hash bucket is a page, holding many records: 123; Smith; Main str #h(123) M Faloutsos - Pavlo CMU SCS 15-415/615
Hashing Notice: could have clustering, or non-clustering versions: #0 page #h(123) M 123; Smith; Main str. Faloutsos - Pavlo CMU SCS 15-415/615
Hashing Notice: could have clustering, or non-clustering versions: ... EMP file 123; Smith; Main str. 234; Johnson; Forbes ave 345; Tompson; Fifth ave #0 page #h(123) M 123 ... Faloutsos - Pavlo CMU SCS 15-415/615
Indexing- overview hashing hashing functions size of hash table collision resolution extendible hashing Hashing vs B-trees Faloutsos - Pavlo CMU SCS 15-415/615
Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method Faloutsos - Pavlo CMU SCS 15-415/615
Design decisions - functions Goal: uniform spread of keys over hash buckets Popular choices: Division hashing Multiplication hashing Faloutsos - Pavlo CMU SCS 15-415/615
Division hashing h(x) = (a*x+b) mod M eg., h(ssn) = (ssn) mod 1,000 gives the last three digits of ssn M: size of hash table - choose a prime number, defensively (why?) Faloutsos - Pavlo CMU SCS 15-415/615
Division hashing eg., M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) in an army unit with predominantly male soldiers Thus: avoid cases where M and keys have common divisors - prime M guards against that! Faloutsos - Pavlo CMU SCS 15-415/615
Multiplication hashing h(x) = [ fractional-part-of ( x * φ ) ] * M φ: golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) in general, we need an irrational number advantage: M need not be a prime number but φ must be irrational Faloutsos - Pavlo CMU SCS 15-415/615
Other hashing functions quadratic hashing (bad) ... Faloutsos - Pavlo CMU SCS 15-415/615
Other hashing functions quadratic hashing (bad) ... conclusion: use division hashing Faloutsos - Pavlo CMU SCS 15-415/615
Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method Faloutsos - Pavlo CMU SCS 15-415/615
Size of hash table eg., 50,000 employees, 10 employee-records / page Q: M=?? pages/buckets/slots Faloutsos - Pavlo CMU SCS 15-415/615
Size of hash table eg., 50,000 employees, 10 employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and M: prime number Eg., in our case: M= closest prime to 50,000/10 / 0.9 = 5,555 Faloutsos - Pavlo CMU SCS 15-415/615
Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method Faloutsos - Pavlo CMU SCS 15-415/615
Collision resolution Q: what is a ‘collision’? A: ?? Faloutsos - Pavlo CMU SCS 15-415/615
Collision resolution FULL 123; Smith; Main str. #0 page #h(123) M Faloutsos - Pavlo CMU SCS 15-415/615
Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about collisions/overflows? (recall that buckets are ~90% full) A: ‘birthday paradox’ Faloutsos - Pavlo CMU SCS 15-415/615
Collision resolution open addressing linear probing (ie., put to next slot/bucket) re-hashing separate chaining (ie., put links to overflow pages) Faloutsos - Pavlo CMU SCS 15-415/615
Collision resolution FULL linear probing: 123; Smith; Main str. #0 page FULL 123; Smith; Main str. #h(123) M Faloutsos - Pavlo CMU SCS 15-415/615
Collision resolution FULL re-hashing h1() 123; Smith; Main str. h2() #0 page h1() FULL 123; Smith; Main str. #h(123) h2() M Faloutsos - Pavlo CMU SCS 15-415/615
Collision resolution FULL separate chaining 123; Smith; Main str. Faloutsos - Pavlo CMU SCS 15-415/615
Design decisions - conclusions Faloutsos - Pavlo CMU SCS 15-415/615 Design decisions - conclusions function: division hashing h(x) = ( a*x+b ) mod M size M: ~90% util.; prime number. collision resolution: separate chaining easier to implement (deletions!); no danger of becoming full Faloutsos - Pavlo CMU SCS 15-415/615
Outline (static) hashing extendible hashing linear hashing Hashing vs B-trees Faloutsos - Pavlo CMU SCS 15-415/615
Problem with static hashing problem: overflow? problem: underflow? (underutilization) Faloutsos - Pavlo CMU SCS 15-415/615
Solution: Dynamic/extendible hashing idea: shrink / expand hash table on demand.. ..dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’ [Fagin et al] Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing FULL 123; Smith; Main str. #0 page #h(123) M Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing FULL solution: don’t overflow – instead: SPLIT the bucket in two #0 page FULL 123; Smith; Main str. #h(123) M Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing in detail: keep a directory, with ptrs to hash-buckets Q: how to divide contents of bucket in two? A: hash each key into a very long bit string; keep only as many bits as needed Eventually: Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing directory 00... 01... 10... 11... 0001... 0111... 10101... 10... 10011... 10110... 11... 1101... 101001... Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing directory 00... 01... 10... 11... 0001... 0111... 10101... 10... 10011... 10110... 11... 1101... 101001... Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing directory 00... 01... 10... split on 3-rd bit 11... 0001... 0111... 00... 01... 10101... 10... split on 3-rd bit 10011... 10110... 11... 101001... 1101... Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing directory 00... 01... new page / bucket 10... 11... 0001... 0111... 00... 01... new page / bucket 10... 10011... 10101... 11... 101001... 10110... 1101... Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing directory (doubled) new page / bucket 000... 001... 1101... 10011... 0111... 0001... 101001... 10101... 10110... 000... 001... 010... 011... 100... 101... 110... 111... new page / bucket Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing BEFORE AFTER 00... 01... 10... 11... 000... 001... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 0001... 000... 001... 010... 011... 100... 101... 110... 111... 0111... 101001... 10101... 10110... 10011... 1101... Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing global local 1 1 3 2 BEFORE AFTER 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 2 0001... 000... 001... 010... 011... 100... 101... 110... 111... 0111... 101001... 10101... 10110... 10011... 1101... Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing 1 1 3 2 Do all splits lead to directory doubling? BEFORE AFTER 1 1 3 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 2 0001... 000... 001... 010... 011... 100... 101... 110... 111... 0111... 2 3 3 101001... 10101... 10110... 10011... 2 2 1101... Do all splits lead to directory doubling? A: Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing 1 1 3 2 Do all splits lead to directory doubling? BEFORE AFTER 1 1 3 00... 01... 10... 11... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 2 0001... 000... 001... 010... 011... 100... 101... 110... 111... 0111... 2 3 3 101001... 10101... 10110... 10011... 2 2 1101... Do all splits lead to directory doubling? A:NO! (most don’t) Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing 1 1 3 2 BEFORE AFTER 00... 01... 10... 11... 000... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 2 0001... 000... 001... 010... 011... 100... 101... 110... 111... 0111... 2 3 3 101001... 10101... 10110... 10011... 2 2 1101... Give insertions, that cause split, but no directory dup.? Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing 1 1 3 2 BEFORE AFTER 00... 01... 10... 11... 000... 10101... 10110... 1101... 10011... 0111... 0001... 101001... 2 0001... 000... 001... 010... 011... 100... 101... 110... 111... 0111... 2 3 3 101001... 10101... 10110... 10011... 2 2 1101... Give insertions, that cause split, but no directory dup.? A: 000… and 001… Faloutsos - Pavlo CMU SCS 15-415/615
Extendible hashing Summary: directory doubles on demand or halves, on shrinking files needs ‘local’ and ‘global’ depth Faloutsos - Pavlo CMU SCS 15-415/615
Outline (static) hashing extendible hashing linear hashing Faloutsos - Pavlo CMU SCS 15-415/615 Outline (static) hashing extendible hashing linear hashing Hashing vs B-trees Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - overview Motivation main idea search algo insertion/split algo deletion Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing Motivation: ext. hashing needs directory etc etc; which doubles (ouch!) Q: can we do something simpler, with smoother growth? Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing Motivation: ext. hashing needs directory etc etc; which doubles (ouch!) Q: can we do something simpler, with smoother growth? A: split buckets from left to right, regardless of which one overflowed (‘crazy’, but it works well!) - Eg.: Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing Initially: h(x) = x mod N (N=4 here) Assume capacity: 3 records / bucket Insert key ‘17’ 1 2 3 bucket- id 4 8 5 9 13 6 7 11 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing Initially: h(x) = x mod N (N=4 here) overflow of bucket#1 17 1 2 3 bucket- id 4 8 5 9 13 6 7 11 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing Initially: h(x) = x mod N (N=4 here) overflow of bucket#1 Split #0, anyway!!! 17 1 2 3 bucket- id 4 8 5 9 13 6 7 11 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing Initially: h(x) = x mod N (N=4 here) Split #0, anyway!!! Q: But, how? 17 1 2 3 bucket- id 4 8 5 9 13 6 7 11 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) 17 1 2 3 bucket- id 4 8 5 9 13 6 7 11 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - after split: A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) bucket- id 1 2 3 4 5 9 13 8 6 7 11 4 17 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - after split: A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) bucket- id 1 2 3 4 5 9 13 8 6 7 11 4 overflow 17 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - after split: A: use two h.f.: h0(x) = x mod N h1(x) = x mod (2*N) split ptr bucket- id 1 2 3 4 5 9 13 8 6 7 11 4 overflow 17 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - overview Motivation main idea search algo insertion/split algo deletion Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - searching? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) split ptr bucket- id 1 2 3 4 5 9 13 8 6 7 11 4 overflow 17 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - searching? Q1: find key ‘6’? Q2: find key ‘4’? Q3: key ‘8’? split ptr bucket- id 1 2 3 4 5 9 13 8 6 7 11 4 overflow 17 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - searching? Algo to find key ‘k’: compute b= h0(k); // original slot if b<split-ptr // has been split compute b=h1(k) search bucket b Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - overview Motivation main idea search algo insertion/split algo deletion Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - insertion? Algo: insert key ‘k’ compute appropriate bucket ‘b’ if the overflow criterion is true split the bucket of ‘split-ptr’ split-ptr ++ (*) Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - insertion? notice: overflow criterion is up to us!! Q: suggestions? Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - insertion? notice: overflow criterion is up to us!! Q: suggestions? A1: space utilization >= u-max Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - insertion? notice: overflow criterion is up to us!! Q: suggestions? A1: space utilization >= u-max A2: avg length of ovf chains > max-len A3: .... Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - insertion? Algo: insert key ‘k’ compute appropriate bucket ‘b’ if the overflow criterion is true split the bucket of ‘split-ptr’ split-ptr ++ (*) what if we reach the right edge?? Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) for the splitted ones) split ptr 2 3 4 5 6 1 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) split ptr 1 2 3 4 5 6 7 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) split ptr 1 2 3 4 5 6 7 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones) split ptr 1 2 3 4 5 6 7 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - split now? this state is called ‘full expansion’ split ptr 1 2 3 4 5 6 7 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - observations In general, at any point of time, we have at most two h.f. active, of the form: hn(x) = x mod (N * 2n) hn+1(x) = x mod (N * 2n+1) (after a full expansion, we have only one h.f.) Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - overview Motivation main idea search algo insertion/split algo deletion Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - deletion? reverse of insertion: Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - deletion? reverse of insertion: if the underflow criterion is met contract! Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - how to contract? h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones) split ptr 2 3 4 5 6 1 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - how to contract? h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones) split ptr ?? 2 3 4 5 1 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - how to contract? h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones) split ptr 2 3 4 5 1 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - how to contract? h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones) split ptr 2 3 4 5 1 Faloutsos - Pavlo CMU SCS 15-415/615
Linear hashing - how to contract? h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones) split ptr 2 3 4 5 1 Faloutsos - Pavlo CMU SCS 15-415/615
Outline (static) hashing extendible hashing linear hashing Hashing vs B-trees Faloutsos - Pavlo CMU SCS 15-415/615
Hashing - pros? Faloutsos - Pavlo CMU SCS 15-415/615
Hashing - pros? Speed, on exact match queries on the average Faloutsos - Pavlo CMU SCS 15-415/615
B(+)-trees - pros? Faloutsos - Pavlo CMU SCS 15-415/615
B(+)-trees - pros? Speed on search: Speed on insertion + deletion exact match queries, worst case range queries nearest-neighbor queries Speed on insertion + deletion smooth growing and shrinking (no re-org) Faloutsos - Pavlo CMU SCS 15-415/615
B(+)-trees vs Hashing Speed on search: Speed on insertion + deletion exact match queries, worst case range queries nearest-neighbor queries Speed on insertion + deletion smooth growing and shrinking (no re-org) Speed, on exact match queries on the average Faloutsos - Pavlo CMU SCS 15-415/615
Conclusions B-trees and variants: in all DBMSs hash indices: in some Faloutsos - Pavlo CMU SCS 15-415/615 Conclusions B-trees and variants: in all DBMSs hash indices: in some (but hashing is useful for joins - later...) Faloutsos - Pavlo CMU SCS 15-415/615