Download presentation

Presentation is loading. Please wait.

Published byTrevion Reddell Modified over 2 years ago

1
© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Introduction to Computer Science 2 Hash Tables (2) Prof. Neeraj Suri Dan Dobre

2
ICS-II - 2008Hash Tables (2)2 Overview So far: Direct hashing Hash functions (folding, modulo etc.) Collision resolution (linear & quadratic probing) What’s next? Collision resolution continued Cost analysis of hashing Hashing on external memory Extendible (dynamic) hashing Excursus: (pseudo-)random numbers and their application

3
ICS-II - 2008Hash Tables (2)3 Double/repeated Hashing If a collision occurs the key is hashed a second time using another Hash function. Can be generalized: if a collision occur, the key is hashed again using the next Hash function. If the collision after using k Hash functions persists, another technique has to be applied. Avoids collision accumulation, delete remains complex, accessibility of the entire memory space is problematic

4
ICS-II - 2008Hash Tables (2)4 Chaining of synonyms in the same HT Members of a collision class are chained. Each memory slot in HT must have an additional pointer. Because there is no separate overflow area, collisions continue to occur due to foreign occupation. Chaining doesn’t prevent the collisions, however it facilitates the search. Delete becomes considerably easier, because only one pointer have to be reset. Insert requires to follow the pointer list, until a free place is found. If the home address is occupied by another key (which does not belong there), move it.

5
ICS-II - 2008Hash Tables (2)5 Chaining: Example h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m Insert: 11, 32, 8, 25 0123456 1132 5 8 25 6

6
ICS-II - 2008Hash Tables (2)6 Chaining: Example h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m Now insert 12 Move 32: search left for pointer, then move further to position 0. 0123456 8113225 56

7
ICS-II - 2008Hash Tables (2)7 Chaining: Example h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m Now insert 12 in its home address 0123456 3281125 60

8
ICS-II - 2008Hash Tables (2)8 Chaining: Example h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m Delete 11: Follow chain until 25 is reached (4-0-6) Move 25 to its home address 4 Delete pointer “6” in address 0 0123456 328111225 60

9
ICS-II - 2008Hash Tables (2)9 Chaining: Example h 0 (K) = K mod 7; h i (K) = (h 0 (K) + i) mod m Collision chain until 32 is now broken (empty address 6) But this is not a problem since pointers are used for chaining 0123456 3282512 0

10
ICS-II - 2008Hash Tables (2)10 Chaining with separate overflow All records, which can not be stored in the own home address, are transferred to an overflow area. Overflow area can be: A single overflow for all synonyms with only one entry point simple, avoid having pointers in the Hash table possibly long synonym chains, therefore only suitable with small collision frequency A single overflow with more than one entry point efficient, since only members of a collision class are browsed requires pointer for each entry in Hash table reference to synonym chain can be implemented using double Hashing in the case of collisions synonyms (mostly few) of 2 collision classes are affected

11
ICS-II - 2008Hash Tables (2)11 Chaining with separate overflow Separate overflow area can be assigned dynamically HT can be restricted to the keys in the home address, all data can be stored in the dynamic overflow area. Since pointers can refer to any address, this corresponds to a partition of the overflow Chaining of synonyms is a preferred method Position Key Pointer 0 HAYDNHAENDELVIVALDI 1 BEETHOVENBACHBRAHMS 2 CORELLI 3 4 SCHUBERTLISZT 5 MOZART 6

12
ICS-II - 2008Hash Tables (2)12 Hashing: analysis of the costs Cost measure: Number of steps (addressing attempts) Assumption: The same time effort for all h(K p ) and search steps The Hash table is allocated with n keys Search costs S n = delete costs without rearrangement Insert costs = unsuccessful search U n Delete costs = S n + rearrangement R n Costs can be expressed as function of the allocation factor = n/m

13
ICS-II - 2008Hash Tables (2)13 Hashing: analysis of the costs – extreme cases Worst case: S n = n U n = n + 1 One collision class, access as in linear list Best case: S n = 1 U n = 1 No collisions

14
ICS-II - 2008Hash Tables (2)14 Hashing: analysis of the costs – average cases Average case depends on overflow handling Assumption: h(K p ) distributes keys uniformly -> Probability, that a key a Hash value 0 i m-1 has, is 1/m

15
ICS-II - 2008Hash Tables (2)15 Costs using linear probing Example h i (k) = (h 0 (k)+i) mod m In the case of small allocation of HT, no problem In the case of higher allocation, drastic degradation Probability p, that 7 will be allocated is 1/m because 6 is free Probability that 14 will be allocated is 5/m (the p for 14 as home address plus the sum of the p for 10,11,12,13, which can produce an overflow on 14) Long chains will be longer and chains can grow together (insert in 3 or 14) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

16
ICS-II - 2008Hash Tables (2)16 Costs using linear probing According to Knuth S n = 0.5(1 + 1/(1- )) with 0 = n/m < 1 U n = 0.5(1 + 1/(1- ) 2 ) 0.1 0.3 0.5 0.7 0.9 8765432187654321 SnSn UnUn Number of search steps increases drastically with higher allocation factor Steps

17
ICS-II - 2008Hash Tables (2)17 Costs using optimal collision resolution With optimal methods for collision resolution a uniform distribution can be approximately assumed despite collision E.g. : rehashing, pseudo-random numbers etc. Probability that a place is occupied/free depends on the number of the already allocated places (n) and on the ones, that are still available (m-n) E.g. : P free = (m-n)/m See script for details of the derivative

18
ICS-II - 2008Hash Tables (2)18 Costs using optimal collision resolution (2) Approximately S n ~ |(1/ ) ln(1- )| with 0 = n/m < 1 U n ~ 1/(1- ) 0.1 0.3 0.5 0.7 0.9 8765432187654321 SnSn UnUn Number of search steps can improve drastically with independent allocation after collision resolution Steps

19
ICS-II - 2008Hash Tables (2)19 Costs using separate overflow Assumption: Uniform distribution of the keys over all chains n/m = Keys per chain, furthermore linear chaining (Q: how big is S n ?) If key i is inserted in HT, then i-1 keys are in the table and in each chain (i-1)/m keys Costs to find a free place are 1 step for home address plus (i-1)/m steps to reach end of the chain (must first see, if the key already exists in table or not) Averaged over all n keys S n = 1/n i=1...n (1 + (i-1)/m) = 1+(n-1)/2m ~ 1+ /2

20
ICS-II - 2008Hash Tables (2)20 Costs using separate overflow For successful search half of the chain will be traversed in average For unsuccessful search the entire chain has to be traversed Chaining is superior to other methods, even with high overflow ( >1) good efficiency 0.50.7511.52345 S n 1.251.371.51.7522.533.5 U n 1.111.221.371.722.143.054.025.01

21
ICS-II - 2008Hash Tables (2)21 Hashing on external memory (b>1) With bucket factor > 1, b records can be stored in one address For both main and external memory suitable, particularly attractive with external memory During collision the new record will simply be stored in the same bucket First within b+1 entries bucket overflows Having overflow the known methods for collision resolution can be applied Overflow in primary area Separate overflow area

22
ICS-II - 2008Hash Tables (2)22 Hashing on external memory Overflow bucket can be assigned dynamically and interlinked with overflow address An overflow bucket can serve for several home addresses as overflow area Recommended: one chain per collision class With b>1 is =n/bm Sequence for storing records in bucket: According to the insert sequence (sequential) According to the sorting sequence (linked list)

23
ICS-II - 2008Hash Tables (2)23 Hashing on external memory Typical bucket size: Sector Track Page Generally: Transfer unit (1 I/O per bucket) Like B-Trees: I/O dominates (approx. 6-10 ms) more complex Hash function justified Relative search costs inside one bucket are low Insert always at first free space in chain While deletion, no need to bridge gaps (or only inside a page) Empty overflow buckets are removed from chain

24
ICS-II - 2008Hash Tables (2)24 Example: b=2 b=2; h(k) = k mod 7 Insert: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4, 27 0123456

25
ICS-II - 2008Hash Tables (2)25 Example: b=2 (2) Now: delete 25 0123456 21821113 153220 2527 18 4

26
ICS-II - 2008Hash Tables (2)26 Example: b=2 (3) Chains will not be closed! Inside of a page will be rearranged if needed. 0123456 21821113 153220 1827 4

27
ICS-II - 2008Hash Tables (2)27 Summary: Hashing on external memory Primary buckets remain always assigned because of relative addressing Overflow buckets will be assigned dynamically (append), delete empty buckets With strong negative growth, buckets possibly understaffed (reorganization of the file, e.g. using rehashing of all entries stored in the hash table)

28
ICS-II - 2008Hash Tables (2)28 Approximate values for Hashing Selected values for S n (b) and U n (b) as function of b and β Rule of thumb: b is typically determined by data transfer unit, select β in such a way, that S ~ 1.05 to 1.08 holds

29
ICS-II - 2008Hash Tables (2)29 Hashing vs. B + -Tree Access costs with a good designed Hash method better than B + -Tree (1.05 vs. path length) Disadvantages: no sorting of all keys (sequential output needs an obviously higher cost) Hashing is static Not extendable, long chains lead to degenerations Consumes already with a small number of keys the complete designated memory space (can also be an advantage: the required memory space is defined to a large extent from the beginning)

30
ICS-II - 2008Hash Tables (2)30 Extendible Hashing Disadvantages of static Hash methods with strongly growing volume of data Primary area must be largely dimensioned from the beginning ( bad initial allocation) If the capacity of the primary area is exceeded, the overflow chains grow fast Run time behavior degrades Reorganization requires to unload the entire volume of data and to load it again interruption of the operation (often not possible, e.g., with 24x7 operation)

31
ICS-II - 2008Hash Tables (2)31 Extendible Hashing Therefore we need a Hash method that Permits dynamic growing and shrinking of the Hash area Guarantees constant run time behavior independently of the size of data Requires not more than 2 page accesses for finding a record Avoids overflow mechanisms and total reorganization Guarantees a high allocation of the memory independently of the growth of the key set

32
ICS-II - 2008Hash Tables (2)32 Extendible Hashing Must avoid overflow buckets Would like stability are ready to pay for it, i.e., constantly 2 accesses Available (known to us) techniques Balancing the B-Trees (constant path length) Addressing techniques via coding of the key from digital trees Extendible Hashing uses these techniques in order to guarantee a stable access with exactly 2 I/O operations.

33
ICS-II - 2008Hash Tables (2)33 Extendible Hashing Hash function transforms keys into binary strings (coding) Only the first n bits are used if necessary (addressing like in the digital tree) Additional indirection over container board Having few keys, few bits are sufficient With many keys additional bits are used Containers are if necessary added or removed (balancing) Container board is “doubled” if necessary memory space costs, but not high intensive computations

34
ICS-II - 2008Hash Tables (2)34 Example: Extendible Hashing Insertion sequence: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4, 27 11 0010112 000010 32 10000018 010010 8 00100013 001101 25 01100120 010100 21 0101014 000100 15 00111127 011011

35
ICS-II - 2008Hash Tables (2)35 Extendible Hashing, b=2 Initial situation Container board contains only a reference To an empty container Insert 11 001011 32 100000 works without problems

36
ICS-II - 2008Hash Tables (2)36 Extendible Hashing, b=2 Next key 8 001000 Doesn’t fit anymore Thus, doubling of the capacity through duplication of the container board (still no extra containers!)

37
ICS-II - 2008Hash Tables (2)37 Extendible Hashing, b=2 Blue numbers: implicit through addresses of the container board Now: next key 8 001000 Fits through partition of the boards

38
ICS-II - 2008Hash Tables (2)38 Extendible Hashing, b=2 Next key 25 011001 Doesn’t fit in the first board, no other address available (for partition of the container) container board has to be doubled

39
ICS-II - 2008Hash Tables (2)39 Extendible Hashing, b=2 Again: through doubling of the container board, no extra container is generated Next key (still) 25 011001

40
ICS-II - 2008Hash Tables (2)40 Extendible Hashing, b=2 Additional container

41
ICS-II - 2008Hash Tables (2)41 Extendible Hashing, b=2 Next key 21 010101 No problems

42
ICS-II - 2008Hash Tables (2)42 Extendible Hashing, b=2 Next key 15 001111 Easy doubling of the container board

43
ICS-II - 2008Hash Tables (2)43 Extendible Hashing, b=2 Next key 15 001111 Still not possible Doubling again

44
ICS-II - 2008Hash Tables (2)44 Extendible Hashing, b=2 Next key 15 001111 Now selectivity is sufficient big container doubling

45
ICS-II - 2008Hash Tables (2)45 Extendible Hashing, b=2 Next key (straight-forward) 2 000010 18 010010 13 001101 20 010100 4 000100 27 011011

46
ICS-II - 2008Hash Tables (2)46 Extendible Hashing, b=2 Finish

47
ICS-II - 2008Hash Tables (2)47 Extendible Hashing Within the key the prefix doesn’t need to be used always, one can also use the postfix Within keys which are not uniformly distributed, an internal hash function can be used to produce the bit string to utilize in extendible hashing

48
ICS-II - 2008Hash Tables (2)48 Summary, extendible Hashing Key fragment with n bits direct hashing (container board) Container having a bucket factor b>1 (typically b>20) Search Look up the container address in the container board Search in the container (e.g., binary search)

49
ICS-II - 2008Hash Tables (2)49 Summary, extendible Hashing Insert Look up the container address in the container board Search in the container If found good, no further actions If not found If there is a free slot in the container insert If no free slot is there -Double the container board until the key fragment is selective enough to establish more containers (note: sometimes the container board doesn’t need to be doubled) -Add new containers and if needed, redistribute keys from the old container among the new containers

50
ICS-II - 2008Hash Tables (2)50 Summary, extendible Hashing Delete Look up the container address in the container board Search in the container If found delete If container is empty delete the container, set pointer in the container board to the neighbor container

51
ICS-II - 2008Hash Tables (2)51 Extendible Hashing In principle very similar to direct hashing using the first bits of the key (h(k) = k / 2 x ) BUT: Within direct hashing the doubling of the table if an overflow occurs is much more expensive. For extendible hashing, each pointer should only be set to two successive addresses, for direct hashing each address should be split.

52
ICS-II - 2008Hash Tables (2)52 Example Extendible hashingDirect Hashing (There is no container board in direct hashing, but we added it here for the sake of understanding)

53
ICS-II - 2008Hash Tables (2)53 Analysis, extendible Hashing Search has a constant cost, two I/O operations Delete is combined if needed with the deletion of the container, but still constant cost For insert “usually” max. 5 operations (search, write to the container, if needed write to other containers, write to the container board) BUT IN ADDITION: If needed reorganization of the container board (duplicate all pointers)

54
ICS-II - 2008Hash Tables (2)54 Analysis, extendible Hashing Doubling of the container board occurs mainly in the main memory low cost in comparison to I/O operations A very successful and widely used method

55
ICS-II - 2008Hash Tables (2)55 Excursus: Pseudo-random numbers A topic which is well related to hashing Why “pseudo”-random numbers Computer is a “good” computational menial Algorithms are always executed reliably in a similar way Consequence: generating random numbers is not a strength of computers! Applications Games Simulation Generating keys for cryptography But specially also numerical solutions of problems

56
ICS-II - 2008Hash Tables (2)56 Example of an application Computation of Pi Surface of the unit circle (Pi) Compute the surface of fourth of the circle (Pi/4) numerically and then multiply by 4 Pi

57
ICS-II - 2008Hash Tables (2)57 Compute Pi Counting: 36 x 36 = 1296 small boxes Or roll the dice!

58
ICS-II - 2008Hash Tables (2)58 Compute Pi Particularly for computations of four-dimensional cases (e.g., physic systems with many degrees of freedom, computation of physic simulations, crash tests, …) it isn’t possible to go through all possible parameters systematically The utilization of (good) multi-dimensional random numbers can lead to better results while using less values

59
ICS-II - 2008Hash Tables (2)59 Pseudo-random numbers For this type of applications, pseudo-random numbers are even better than “real” random numbers How works a normal pseudo-random generator? Needs an initialization z 0 A random function computes starting from the last random number the next one: z n = Z(z n-1 ) Requirements are also like those of hash-/collision resolution functions: Uniform distribution of the random numbers All random numbers (from a specific interval) should eventually appear once in the sequence

60
ICS-II - 2008Hash Tables (2)60 Example: Mid-square-generator Was implemented e.g., in Apple II z n = middle_digits(z n-1 2 ) Example: z 0 = 42 42 x 42 = 1764; 76 x 76 = 5776 etc. Sequence: 42 – 76 – 77 – 92 – 46 – 11 – 12 – 14 – 19 – 36 – 29 – 84 – 5 – 2 – 0 – 0 – 0 - … Many sequences either ends with “0” or are repeated continuously (24 – 57 – 24 – 57 - …) Very bad generator

61
ICS-II - 2008Hash Tables (2)61 Linear congruence-generator Better: linear congruence- generator Appears to be familiar to us z n = (z n-1 * a + b) mod m Example: z n = (z n-1 * 21 + 17) mod 40 … generates an optimal sequence … 1 - 38 - 15 - 12 - 29 - 26 - 3 - 0 - 17 - 14 - 31 - 28 - 5 - 2 - 19 - 16 - 33 - 30 - 7 - 4 - 21 - 18 - 35 - 32 - 9 - 6 - 23 - 20 - 37 - 34 - 11 - 8 - 25 - 22 - 39 - 36 - 13 - 10 - 27 - 24 - 1

62
ICS-II - 2008Hash Tables (2)62 Linear congruence-generator z n = (z n-1 * a + b) mod m Parameter a, b, m determine the quality Like in Hashing: it is reasonably easy to define the minimal requirements for a good quality e.g., a, m coprime But: uniform distribution for multi-dimensions is hard Example: 2, 7, 4, 9, 6, 1, 8, 3, 0, 5, … One-dimension: uniformly distributed Two-dimensions: (2, 7) (4, 9) (6, 1) (8, 3), (0, 5) located in two “lines” – not uniformly distributed

63
ICS-II - 2008Hash Tables (2)63 Linear congruence-generator Separate research area in computer science and mathematics which is focused on finding good pseudo- random generators For numerical applications pseudo-random numbers are often better than real random numbers For cryptography this doesn’t apply anymore – there are plug-in cards which generate real random numbers because of quantum physics …

64
ICS-II - 2008Hash Tables (2)64 Thoughts: Hash / Random Often, the computer produces apparently chaos The computer can not do this really: if you look deeply it is always another way of ordering “Chaotic” arrangement of data in hash tables and pseudo-random generators are good examples for this

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google