Presentation is loading. Please wait.

Presentation is loading. Please wait.

E.G.M. PetrakisHashing1  Data organization in main memory or disk  sequential, binary trees, …  The location of a key depends on other keys => unnecessary.

Similar presentations


Presentation on theme: "E.G.M. PetrakisHashing1  Data organization in main memory or disk  sequential, binary trees, …  The location of a key depends on other keys => unnecessary."— Presentation transcript:

1 E.G.M. PetrakisHashing1  Data organization in main memory or disk  sequential, binary trees, …  The location of a key depends on other keys => unnecessary key comparisons to find a key  Question: find key with a single comparison  Hashing: the location of a record is computed using its key only  Fast for random accesses - slow for range queries

2 E.G.M. PetrakisHashing2 Hash Table  Hash Function: transforms keys to array indices n... 4 3 2 1 0 h(key) dataindex h(key): Hash Function m

3 E.G.M. PetrakisHashing3 positionkeyrecord h(key) = key mod 1000

4 E.G.M. PetrakisHashing4 Good Hash Functions 1. Uniform: distribute keys evenly in space 2. Perfect: two records cannot occupy the same location or 3. Order preserving:  Difficult to find such hash functions  Property 2 is the most essential  Most functions are no better than h(key) = key mod m  Hash collision:

5 E.G.M. PetrakisHashing5 Collision Resolution 1. Open Addressing (rehashing): compute new position to store the key in the table (no extra space) i. linear probing ii. double hashing 2. Separate Chaining: lists of keys mapped to the same position (uses extra space)

6 E.G.M. PetrakisHashing6 Open Addressing  Computes a new address to store the key if it is occupied (rehashing)  if occupied too, compute a new address, … until an empty position is found  primary hash function: i=h(key)  rehash function: rh(i)=rh(h(key))  hash sequence: (h 0,h 1,h 2 …) = (h(key), rh(h(key)), rh(rh(h(key)))…)  To find a key follow the same hash sequence

7 E.G.M. PetrakisHashing7 Example  i=h(key)=key mod 100  rh(i) = (i+1) mod 100  key: 193  i=h(193)=93  rh(i)=(93+1)=94  Key 193 will occupy position 94 193

8 E.G.M. PetrakisHashing8 Problem 1: Locate Empty Positions  No empty position can be found i. the table is full  check on number of empty positions ii. the hash function fails to find an empty position although the table is not full !!  i=h(key) = key mod 1000  rh(i) = (i + 200) mod 1000 => checks only 5 positions on a table of 1000 positions  rh(i) = (i+1) mod 1000 successive positions  rh(i) = (i+c) mod 1000 where GCD(c,m) = 1

9 E.G.M. PetrakisHashing9 Problem 2: Primary Clustering  Different keys that hash into different addresses compete with each other in successive rehashes  i=h(key) = key mod 100  rh(i) = (i+1) mod 100  keys: 1990, 1991, 1992, 1993, 1994 => 94

10 E.G.M. PetrakisHashing10 Problem 3: Secondary Clustering  Different keys which hash to the same hash value have the same rehash sequence  i=h(key) = key mod 10  rh(i,j) = (i + j) mod 10 i. key 23 : h(23) = 3 rh = 4, 6, 9, 3, … ii. key 13 : h(13) = 3 rh = 4, 6, 9, 3, …

11 E.G.M. PetrakisHashing11 Linear Probing  Store the key into the next free position  h 0 = h(key) usually h 0 = key mod m  h i = (h i-1 + 1) mod m, i >= 1 S = {22, 35, 301, 99, 102, 452}

12 E.G.M. PetrakisHashing12 Observation 1  Different insertion sequences => different hash sequences  S 1 = {11,3,27,99,8,50,77,2 2,12,31,33,40,53}=>28 probes  S 2 ={53,40,33,31,12,22,7 7,50,8,99,27,3,11}=> 30 probes H(key) = key mod 13 number of probes

13 E.G.M. PetrakisHashing13 Observation 2  Deletions are not easy:  i=h(key) = key mod 10  rh(i) = (i+1) mod 10  Action: delete(65) and search(5)  Problem: search will stop at the empty position and will not find 5  Solution:  mark position as deleted rather than empty  the marked position can be reused

14 E.G.M. PetrakisHashing14 Observation 3  Linear probing tends to create long sequences of occupied positions  the longer a sequence is, the longer it tends to become  P: probability to use a position in the cluster Β m

15 E.G.M. PetrakisHashing15 Observation 4  Linear probing suffers from both primary and secondary clustering  Solution: double hashing  uses two hash functions h 1, h 2 and a  rehashing function rh

16 E.G.M. PetrakisHashing16 Double Hashing  Two hash functions and a rehashing function  primary hash function i=h 1 (key)= key mod m  secondary hash function h 2 (key)  rehashing function: rh(key) = (i + h 2 (key)) mod m  h 2 (m,key) is some function of m, key  helps rh in computing random positions in the hash table  h 2 is computed once for each key!

17 E.G.M. PetrakisHashing17 Example of Double Hashing i. hash function:  h 1 (key) = key mod m  q = (key div m) mod m ii. rehash function:  rh(i, key) = (i + h 2 (key)) mod m

18 E.G.M. PetrakisHashing18 Example (continued) A. m = 10,key = 23 h 1 (23) = 3, h 2 (23) = 2 rh(3,2)=(3+2) mod 10 = 5 rehash sequence: 5, 7, 9, 1, … m = 10, key = 13 h 1 (key)=3, h 2 (13)=1, rh(3,1)=(3+1)mod10=4 rehash sequence: 4, 5, 6,…

19 E.G.M. PetrakisHashing19 Performance of Open Addressing  Distinguish between  successful and  unsuccessful search  Assume a series of probes to random positions  independent events  load factor: λ = n/m  λ: probability to probe an occupied position  each position has the same probability P=1/m

20 E.G.M. PetrakisHashing20 Unsuccessful Search  The hash sequence is exhausted  let u be the expected number of probes  u equals the expected length of the hash sequence  P(k): probability to search k positions in the hash sequence

21 E.G.M. PetrakisHashing21

22 E.G.M. PetrakisHashing22 independent events u increases with λ => performance drops as λ increases

23 E.G.M. PetrakisHashing23 Successful Search  The hash sequence is not exhausted  the number of probes to find a key equals the number of probes s at the time the key was inserted plus 1  λ was less at that time  consider all values of λ increases with λ u: equivalent to unsuccessful search approximation

24 E.G.M. PetrakisHashing24 Performance  The performance drops as λ increases  the higher the value of λ is, the higher the probability of collisions  Unsuccessful search is more expensive than successful search  unsuccessful search exhausts the hash sequence

25 E.G.M. PetrakisHashing25 Experimental Results SUCCESSFUL UNSUCCESSFUL LOAD FACTOR LINEAR i + bkey DOUBLE LINEAR i + bkey DOUBLE 25% 1.17 1.16 1.15 1.39 1.37 1.33 50% 1.50 1.44 1.39 2.50 2.19 2.00 75% 2.50 2.01 1.85 8.50 4.64 4.00 90% 5.50 2.85 2.56 50.50 11.40 10.00 95% 10.50 3.52 3.15 200.50 22.04 20.00

26 E.G.M. PetrakisHashing26 Performance on Full Table

27 E.G.M. PetrakisHashing27 Separate Chaining  Keys hashing to the same hash value are stored in separate lists  one list per hash position  can store more than m records  easy to implement  the keys in each list can be ordered

28 E.G.M. PetrakisHashing28 h(key) = key mod m

29 E.G.M. PetrakisHashing29 Performance of Separate Chaining  Depends on the average chain size  insertions are independent events  let P(c,n,m): probability that a position has been selected c times after n insertions on a table of size m  P(c,n,m): probability that the chain has length c => binomial distribution p=1/m: success case q=1-p: failure case

30 E.G.M. PetrakisHashing30 => P(c,n,m)=(1/c!)λ c e -λ Poison =>

31 E.G.M. PetrakisHashing31 Unsuccessful Search  The entire chain is searched  the average number of comparisons equals its average length u

32 E.G.M. PetrakisHashing32 Successful Search  Not the whole chain is searched  the average number of comparisons equals the length s of the chain at time the key was inserted plus 1  the performance at the time a key was inserted equals that of unsuccessful search!

33 E.G.M. PetrakisHashing33 Performance  The performance drops with the length of the chains  worst case: all keys are stored in a single chain  worst case performance: O(N)  unsuccessful search performs better than successful search!! WHY ?  no problem with deletions!!

34 E.G.M. PetrakisHashing34 Coalesced Hashing  The hash sequence is implemented as a linked list within the hash table  no rehash function  the next hash position is the next available position in linked list  extra space for the list h(key) = key mod 10 keys: 19, 29, 49, 59

35 E.G.M. PetrakisHashing35 avail initially: avail = 9 h(key) = key mod 10 keys: 14,29,34,28,42,39,84,38 initialization List of empty positions Holds lists of rehashing positions and list of empty positions

36 E.G.M. PetrakisHashing36 Performance of Coalesced Hashing  Unsuccessful search  Successful search probes/search


Download ppt "E.G.M. PetrakisHashing1  Data organization in main memory or disk  sequential, binary trees, …  The location of a key depends on other keys => unnecessary."

Similar presentations


Ads by Google