Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Introduction to Computer Science 2 Hash.

Similar presentations


Presentation on theme: "© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Introduction to Computer Science 2 Hash."— Presentation transcript:

1 © Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group Introduction to Computer Science 2 Hash Tables (2) Prof. Neeraj Suri Dan Dobre

2 ICS-II Hash Tables (2)2 Overview  So far:  Direct hashing  Hash functions (folding, modulo etc.)  Collision resolution (linear & quadratic probing)  What’s next?  Collision resolution continued  Cost analysis of hashing  Hashing on external memory  Extendible (dynamic) hashing  Excursus: (pseudo-)random numbers and their application

3 ICS-II Hash Tables (2)3 Double/repeated Hashing  If a collision occurs the key is hashed a second time using another Hash function.  Can be generalized: if a collision occur, the key is hashed again using the next Hash function.  If the collision after using k Hash functions persists, another technique has to be applied.  Avoids collision accumulation, delete remains complex, accessibility of the entire memory space is problematic

4 ICS-II Hash Tables (2)4 Chaining of synonyms in the same HT  Members of a collision class are chained.  Each memory slot in HT must have an additional pointer.  Because there is no separate overflow area, collisions continue to occur due to foreign occupation.  Chaining doesn’t prevent the collisions, however it facilitates the search.  Delete becomes considerably easier, because only one pointer have to be reset.  Insert requires to follow the pointer list, until a free place is found.  If the home address is occupied by another key (which does not belong there), move it.

5 ICS-II Hash Tables (2)5 Chaining: Example  h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m  Insert: 11, 32, 8,

6 ICS-II Hash Tables (2)6 Chaining: Example  h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m  Now insert 12  Move 32: search left for pointer, then move further to position

7 ICS-II Hash Tables (2)7 Chaining: Example  h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m  Now insert 12 in its home address

8 ICS-II Hash Tables (2)8 Chaining: Example  h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m  Delete 11:  Follow chain until 25 is reached (4-0-6)  Move 25 to its home address 4  Delete pointer “6” in address

9 ICS-II Hash Tables (2)9 Chaining: Example  h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m  Collision chain until 32 is now broken (empty address 6)  But this is not a problem since pointers are used for chaining

10 ICS-II Hash Tables (2)10 Chaining with separate overflow  All records, which can not be stored in the own home address, are transferred to an overflow area.  Overflow area can be:  A single overflow for all synonyms with only one entry point simple, avoid having pointers in the Hash table possibly long synonym chains, therefore only suitable with small collision frequency  A single overflow with more than one entry point efficient, since only members of a collision class are browsed requires pointer for each entry in Hash table reference to synonym chain can be implemented using double Hashing  in the case of collisions synonyms (mostly few) of 2 collision classes are affected

11 ICS-II Hash Tables (2)11 Chaining with separate overflow  Separate overflow area can be assigned dynamically  HT can be restricted to the keys in the home address, all data can be stored in the dynamic overflow area.  Since pointers can refer to any address, this corresponds to a partition of the overflow  Chaining of synonyms is a preferred method Position Key Pointer 0 HAYDNHAENDELVIVALDI 1 BEETHOVENBACHBRAHMS 2 CORELLI 3 4 SCHUBERTLISZT 5 MOZART 6

12 ICS-II Hash Tables (2)12 Hashing: analysis of the costs  Cost measure: Number of steps (addressing attempts)  Assumption:  The same time effort for all h(K p ) and search steps  The Hash table is allocated with n keys  Search costs S n = delete costs without rearrangement  Insert costs = unsuccessful search U n  Delete costs = S n + rearrangement R n  Costs can be expressed as function of the allocation factor  = n/m

13 ICS-II Hash Tables (2)13 Hashing: analysis of the costs – extreme cases  Worst case:  S n = n  U n = n + 1  One collision class, access as in linear list  Best case:  S n = 1  U n = 1  No collisions

14 ICS-II Hash Tables (2)14 Hashing: analysis of the costs – average cases  Average case depends on overflow handling  Assumption:  h(K p ) distributes keys uniformly -> Probability, that a key a Hash value 0  i  m-1 has, is 1/m

15 ICS-II Hash Tables (2)15 Costs using linear probing  Example h i (k) = (h 0 (k)+i) mod m  In the case of small allocation of HT, no problem  In the case of higher allocation, drastic degradation  Probability p, that 7 will be allocated is 1/m because 6 is free  Probability that 14 will be allocated is 5/m (the p for 14 as home address plus the sum of the p for 10,11,12,13, which can produce an overflow on 14)  Long chains will be longer and chains can grow together (insert in 3 or 14)

16 ICS-II Hash Tables (2)16 Costs using linear probing  According to Knuth S n = 0.5(1 + 1/(1- )) with 0   = n/m < 1 U n = 0.5(1 + 1/(1- ) 2 ) SnSn UnUn Number of search steps increases drastically with higher allocation factor  Steps

17 ICS-II Hash Tables (2)17 Costs using optimal collision resolution  With optimal methods for collision resolution a uniform distribution can be approximately assumed despite collision  E.g. : rehashing, pseudo-random numbers etc.  Probability that a place is occupied/free depends on the number of the already allocated places (n) and on the ones, that are still available (m-n)  E.g. : P free = (m-n)/m  See script for details of the derivative

18 ICS-II Hash Tables (2)18 Costs using optimal collision resolution (2)  Approximately S n ~ |(1/ ) ln(1- )| with 0   = n/m < 1 U n ~ 1/(1- ) SnSn UnUn Number of search steps can improve drastically with independent allocation after collision resolution  Steps

19 ICS-II Hash Tables (2)19 Costs using separate overflow  Assumption: Uniform distribution of the keys over all chains  n/m =  Keys per chain, furthermore linear chaining (Q: how big is S n ?)  If key i is inserted in HT, then i-1 keys are in the table and in each chain (i-1)/m keys  Costs to find a free place are 1 step for home address plus (i-1)/m steps to reach end of the chain (must first see, if the key already exists in table or not)  Averaged over all n keys S n = 1/n  i=1...n (1 + (i-1)/m) = 1+(n-1)/2m ~ 1+ /2

20 ICS-II Hash Tables (2)20 Costs using separate overflow  For successful search half of the chain will be traversed in average  For unsuccessful search the entire chain has to be traversed  Chaining is superior to other methods, even with high overflow ( >1) good efficiency  S n U n

21 ICS-II Hash Tables (2)21 Hashing on external memory (b>1)  With bucket factor > 1, b records can be stored in one address  For both main and external memory suitable, particularly attractive with external memory  During collision the new record will simply be stored in the same bucket  First within b+1 entries bucket overflows  Having overflow the known methods for collision resolution can be applied  Overflow in primary area  Separate overflow area

22 ICS-II Hash Tables (2)22 Hashing on external memory  Overflow bucket can be assigned dynamically and interlinked with overflow address  An overflow bucket can serve for several home addresses as overflow area  Recommended: one chain per collision class  With b>1 is =n/bm  Sequence for storing records in bucket:  According to the insert sequence (sequential)  According to the sorting sequence (linked list)

23 ICS-II Hash Tables (2)23 Hashing on external memory  Typical bucket size:  Sector  Track  Page  Generally: Transfer unit (1 I/O per bucket)  Like B-Trees: I/O dominates (approx ms)  more complex Hash function justified  Relative search costs inside one bucket are low  Insert always at first free space in chain  While deletion, no need to bridge gaps (or only inside a page)  Empty overflow buckets are removed from chain

24 ICS-II Hash Tables (2)24 Example: b=2  b=2; h(k) = k mod 7  Insert: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4,

25 ICS-II Hash Tables (2)25 Example: b=2 (2)  Now: delete

26 ICS-II Hash Tables (2)26 Example: b=2 (3)  Chains will not be closed! Inside of a page will be rearranged if needed

27 ICS-II Hash Tables (2)27 Summary: Hashing on external memory  Primary buckets remain always assigned because of relative addressing  Overflow buckets will be assigned dynamically (append), delete empty buckets  With strong negative growth, buckets possibly understaffed (reorganization of the file, e.g. using rehashing of all entries stored in the hash table)

28 ICS-II Hash Tables (2)28 Approximate values for Hashing  Selected values for S n (b) and U n (b) as function of b and β  Rule of thumb: b is typically determined by data transfer unit, select β in such a way, that S ~ 1.05 to 1.08 holds

29 ICS-II Hash Tables (2)29 Hashing vs. B + -Tree  Access costs with a good designed Hash method better than B + -Tree (1.05 vs. path length)  Disadvantages:  no sorting of all keys (sequential output needs an obviously higher cost)  Hashing is static Not extendable, long chains lead to degenerations Consumes already with a small number of keys the complete designated memory space (can also be an advantage: the required memory space is defined to a large extent from the beginning)

30 ICS-II Hash Tables (2)30 Extendible Hashing  Disadvantages of static Hash methods with strongly growing volume of data  Primary area must be largely dimensioned from the beginning (  bad initial allocation)  If the capacity of the primary area is exceeded, the overflow chains grow fast  Run time behavior degrades  Reorganization requires to unload the entire volume of data and to load it again  interruption of the operation (often not possible, e.g., with 24x7 operation)

31 ICS-II Hash Tables (2)31 Extendible Hashing  Therefore we need a Hash method that  Permits dynamic growing and shrinking of the Hash area  Guarantees constant run time behavior independently of the size of data  Requires not more than 2 page accesses for finding a record  Avoids overflow mechanisms and total reorganization  Guarantees a high allocation of the memory independently of the growth of the key set

32 ICS-II Hash Tables (2)32 Extendible Hashing  Must avoid overflow buckets  Would like stability  are ready to pay for it, i.e., constantly 2 accesses  Available (known to us) techniques  Balancing the B-Trees (constant path length)  Addressing techniques via coding of the key from digital trees  Extendible Hashing uses these techniques in order to guarantee a stable access with exactly 2 I/O operations.

33 ICS-II Hash Tables (2)33 Extendible Hashing  Hash function transforms keys into binary strings (coding)  Only the first n bits are used if necessary (addressing like in the digital tree)  Additional indirection over container board  Having few keys, few bits are sufficient  With many keys additional bits are used  Containers are if necessary added or removed (balancing)  Container board is “doubled” if necessary  memory space costs, but not high intensive computations

34 ICS-II Hash Tables (2)34 Example: Extendible Hashing  Insertion sequence: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4, 27  11            

35 ICS-II Hash Tables (2)35 Extendible Hashing, b=2  Initial situation  Container board contains only a reference  To an empty container  Insert 11   works without problems

36 ICS-II Hash Tables (2)36 Extendible Hashing, b=2  Next key 8   Doesn’t fit anymore  Thus, doubling of the capacity through duplication of the container board (still no extra containers!)

37 ICS-II Hash Tables (2)37 Extendible Hashing, b=2  Blue numbers: implicit through addresses of the container board  Now: next key 8   Fits through partition of the boards

38 ICS-II Hash Tables (2)38 Extendible Hashing, b=2  Next key 25   Doesn’t fit in the first board, no other address available (for partition of the container)  container board has to be doubled

39 ICS-II Hash Tables (2)39 Extendible Hashing, b=2  Again: through doubling of the container board, no extra container is generated  Next key (still) 25 

40 ICS-II Hash Tables (2)40 Extendible Hashing, b=2  Additional container

41 ICS-II Hash Tables (2)41 Extendible Hashing, b=2  Next key 21   No problems

42 ICS-II Hash Tables (2)42 Extendible Hashing, b=2  Next key 15   Easy doubling of the container board

43 ICS-II Hash Tables (2)43 Extendible Hashing, b=2  Next key 15   Still not possible  Doubling again

44 ICS-II Hash Tables (2)44 Extendible Hashing, b=2  Next key 15   Now selectivity is sufficient big  container doubling

45 ICS-II Hash Tables (2)45 Extendible Hashing, b=2  Next key (straight-forward) 2      

46 ICS-II Hash Tables (2)46 Extendible Hashing, b=2  Finish

47 ICS-II Hash Tables (2)47 Extendible Hashing  Within the key the prefix doesn’t need to be used always, one can also use the postfix  Within keys which are not uniformly distributed, an internal hash function can be used to produce the bit string to utilize in extendible hashing

48 ICS-II Hash Tables (2)48 Summary, extendible Hashing  Key fragment with n bits  direct hashing (container board)  Container having a bucket factor b>1 (typically b>20)  Search  Look up the container address in the container board  Search in the container (e.g., binary search)

49 ICS-II Hash Tables (2)49 Summary, extendible Hashing  Insert  Look up the container address in the container board  Search in the container  If found  good, no further actions  If not found If there is a free slot in the container  insert If no free slot is there -Double the container board until the key fragment is selective enough to establish more containers (note: sometimes the container board doesn’t need to be doubled) -Add new containers and if needed, redistribute keys from the old container among the new containers

50 ICS-II Hash Tables (2)50 Summary, extendible Hashing  Delete  Look up the container address in the container board  Search in the container  If found  delete  If container is empty  delete the container, set pointer in the container board to the neighbor container

51 ICS-II Hash Tables (2)51 Extendible Hashing  In principle very similar to direct hashing using the first bits of the key (h(k) = k / 2 x )  BUT: Within direct hashing the doubling of the table if an overflow occurs is much more expensive. For extendible hashing, each pointer should only be set to two successive addresses, for direct hashing each address should be split.

52 ICS-II Hash Tables (2)52 Example Extendible hashingDirect Hashing (There is no container board in direct hashing, but we added it here for the sake of understanding)

53 ICS-II Hash Tables (2)53 Analysis, extendible Hashing  Search has a constant cost, two I/O operations  Delete is combined if needed with the deletion of the container, but still constant cost  For insert “usually” max. 5 operations (search, write to the container, if needed write to other containers, write to the container board)  BUT IN ADDITION: If needed reorganization of the container board (duplicate all pointers)

54 ICS-II Hash Tables (2)54 Analysis, extendible Hashing  Doubling of the container board occurs mainly in the main memory  low cost in comparison to I/O operations  A very successful and widely used method

55 ICS-II Hash Tables (2)55 Excursus: Pseudo-random numbers  A topic which is well related to hashing  Why “pseudo”-random numbers  Computer is a “good” computational menial  Algorithms are always executed reliably in a similar way  Consequence: generating random numbers is not a strength of computers!  Applications  Games  Simulation  Generating keys for cryptography  But specially also numerical solutions of problems

56 ICS-II Hash Tables (2)56 Example of an application  Computation of Pi  Surface of the unit circle (Pi)  Compute the surface of fourth of the circle (Pi/4) numerically and then multiply by 4  Pi

57 ICS-II Hash Tables (2)57 Compute Pi  Counting: 36 x 36 = 1296 small boxes  Or roll the dice!

58 ICS-II Hash Tables (2)58 Compute Pi  Particularly for computations of four-dimensional cases (e.g., physic systems with many degrees of freedom, computation of physic simulations, crash tests, …) it isn’t possible to go through all possible parameters systematically  The utilization of (good) multi-dimensional random numbers can lead to better results while using less values

59 ICS-II Hash Tables (2)59 Pseudo-random numbers  For this type of applications, pseudo-random numbers are even better than “real” random numbers  How works a normal pseudo-random generator?  Needs an initialization z 0  A random function computes starting from the last random number the next one: z n = Z(z n-1 )  Requirements are also like those of hash-/collision resolution functions:  Uniform distribution of the random numbers  All random numbers (from a specific interval) should eventually appear once in the sequence

60 ICS-II Hash Tables (2)60 Example: Mid-square-generator  Was implemented e.g., in Apple II  z n = middle_digits(z n-1 2 )  Example: z 0 = 42  42 x 42 = 1764; 76 x 76 = 5776 etc.  Sequence: 42 – 76 – 77 – 92 – 46 – 11 – 12 – 14 – 19 – 36 – 29 – 84 – 5 – 2 – 0 – 0 – 0 - …  Many sequences either ends with “0” or are repeated continuously (24 – 57 – 24 – 57 - …)  Very bad generator

61 ICS-II Hash Tables (2)61 Linear congruence-generator  Better: linear congruence- generator  Appears to be familiar to us  z n = (z n-1 * a + b) mod m  Example: z n = (z n-1 * ) mod 40 … generates an optimal sequence …

62 ICS-II Hash Tables (2)62 Linear congruence-generator  z n = (z n-1 * a + b) mod m  Parameter a, b, m determine the quality  Like in Hashing: it is reasonably easy to define the minimal requirements for a good quality e.g., a, m coprime  But: uniform distribution for multi-dimensions is hard  Example: 2, 7, 4, 9, 6, 1, 8, 3, 0, 5, …  One-dimension: uniformly distributed  Two-dimensions: (2, 7) (4, 9) (6, 1) (8, 3), (0, 5) located in two “lines” – not uniformly distributed

63 ICS-II Hash Tables (2)63 Linear congruence-generator  Separate research area in computer science and mathematics which is focused on finding good pseudo- random generators  For numerical applications pseudo-random numbers are often better than real random numbers  For cryptography this doesn’t apply anymore – there are plug-in cards which generate real random numbers because of quantum physics …

64 ICS-II Hash Tables (2)64 Thoughts: Hash / Random  Often, the computer produces apparently chaos  The computer can not do this really: if you look deeply it is always another way of ordering  “Chaotic” arrangement of data in hash tables and pseudo-random generators are good examples for this


Download ppt "© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Introduction to Computer Science 2 Hash."

Similar presentations


Ads by Google