Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 HashTable. 2 Dictionary A collection of data that is accessed by “key” values –The keys may be ordered or unordered –Multiple key values may/may-not.

Similar presentations


Presentation on theme: "1 HashTable. 2 Dictionary A collection of data that is accessed by “key” values –The keys may be ordered or unordered –Multiple key values may/may-not."— Presentation transcript:

1 1 HashTable

2 2 Dictionary A collection of data that is accessed by “key” values –The keys may be ordered or unordered –Multiple key values may/may-not be allowed Supports the following fundamental methods –void put(Object key, Object data) Inserts data into the dictionary using the specified key –Object get(Object key) Returns the data associated with the specified key An error occurs if the specified key is not in the dictionary –Object remove(Object key) Removes the data associated with the specified key and returns the data. An error occurs if the specified key is not in the dictionary

3 3 Abstract Dictionary Example ((5,A), (7,B), (2,C)) or ((5,A), (7,B), (2, Q))C or Qremove(2) ((5,A), (7,B), (2,C), (2, Q))Errorremove(Q) ((5,A), (7,B), (2,C), (2, Q))C or Qget(2) ((5,A), (7,B), (2,C), (2, Q))Noneput(2, Q) ((5,A), (7,B), (2,C))Bget(7) ((5,A), (7,B), (2,C))Errorget(A) ((5,A), (7,B), (2,C))Noneput(2,C) ((5,A), (7,B))Noneput(7, B) ((5,A))Noneput(5, A) DictionaryOutputOperation

4 4 What is a Hashtable? A hashtable is an unordered dictionary that uses an array to store data –Each data element is associated with a key –Each key is mapped into an array index using a hash function –The key AND the data are then stored in the array Hashtables are commonly used in the construction of compiler symbol tables.

5 5 Dictionaries AVL Trees vs. Hashtables O(1)O(N)O(Log N) remove O(1)O(N)O(Log N) get O(1)O(N)O(Log N) put Average Astounding! Worst Average Not Bad Worst HashtableAVL Method

6 6 Simple Example Insert data into the hashtable using characters as keys The hashtable is an array of “items” The hashtables’ capacity is 7 The hash function must take a character as input and convert it into a number between 0 and 6. Use the following hash function: Let P be the position of the character in the English alphabet (starting with 1). The hash function h(K) = P The function must be normalized in order to map into the appropriate range (0-6). The normalized hash function is h(K) % 7. 01234560123456

7 7 01234560123456 Example put(B 2, Data 1 ) put(S 19, Data 2 ) put(J 10, Data 3 ) put(N 14, Data 4 ) put(X 24, Data 5 ) put(W 23, Data 6 ) put(B 2, Data 7 ) get(X 24 ) get(W 23 ) (B 2, Data 1 ) (S 19, Data 2 ) (J 10, Data 3 ) (N 14, Data 4 ) (X 24, Data 5 ) ??? This is called a collision Collisions are handled via a “collision resolution policy”

8 8 From Keys to Indices The mapping of keys to indices of a hash table is called a hash function A hash function is usually the composition of two maps, a hash code map and a compression map. –An essential requirement of the hash function is to map equal keys to equal indices –A “good” hash function minimizes the probability of collisions

9 9 Popular Hash-Code Maps Integer cast: for numeric types with 32 bits or less, we can reinterpret the bits of the number as an int Component sum: for numeric types with more than 32 bits (e.g., long and double), we can add the 32-bit components. Polynomial accumulation: for strings of a natural language, combine the character values (ASCII or Unicode) a 0 a 1... a n-1 by viewing them as the coefficients of a polynomial: a 0 + a 1 x +...+ x n-1 a n-1 -The polynomial is computed with Horner’s rule, ignoring overflows, at a fixed value x: a 0 + x (a 1 +x (a 2 +... x (a n-2 + x a n-1 )... )) -The choice x = 33, 37, 39, or 41gives at most 6 collisions on a vocabulary of 50,000 English words Why is the component-sum hash code bad for strings?

10 10 Popular Compression Maps Division: h(k) = |k| mod N –the choice N = 2 k is bad because not all the bits are taken into account –the table size N is usually chosen as a prime number –certain patterns in the hash codes are propagated Multiply, Add, and Divide (MAD): h(k) = |ak + b| mod N

11 11 Details and Definitions Load factor is the size of the table divided by the capacity of the table Various means of “collision resolution” can be used. The collision resolution policy determines what is done when two keys map to the same array index. –Open Addressing: look for an open slot –Separate Chaining: keep a list of key/value pairs in a slot

12 12 Example put(B 2, Data 1 ) put(S 19, Data 2 ) put(J 10, Data 3 ) put(N 14, Data 4 ) put(X 24, Data 5 ) put(W 23, Data 6 ) get(X 24 ) get(W 23 ) 01234560123456 (B 2, Data 1 ) (S 19, Data 2 ) (J 10, Data 3 ) (N 14, Data 4 ) (X 24, Data 5 ) (W 23, Data 7 ) (X 24, Data 5 ) ??? Open Addressing: When a collision occurs, probe for an empty slot. In this case, use linear probing (looking “down”) until an empty slot is found.

13 13 Open Addressing Uses a “probe sequence” to look for an empty slot to use The first location examined is the “hash” address The sequence of locations examined when locating data is called the “probe sequence” The probe sequence {s(0), s(1), s(2), … } can be described as follows: s(i) = norm(h(K) + p(i)) –where h(K) is the “hash function” mapping K to an integer –p(i) is a “probing function” returning an offset for the i th probe –norm is the “normalizing function” (usually division modulo capacity)

14 14 Open Addressing Linear probing –use p(i) = i –The probe sequence becomes {norm(h(k)), norm(h(k)+1), norm(h(k)+2), …} Quadratic probing –use p(i) = i 2 –The probe sequence becomes {norm(h(k)), norm(h(k)+1), norm(h(k)+4),…} –Must be careful to allow full coverage of “empty” array slots –A theorem states that this method will find an empty slot if the table is not more that ½ full.

15 15 Linear Probing If the current location is used, try the next table location linear_probing_insert(K) if (table is full) error probe = h(K) while (table[probe] occupied) probe = (probe + 1) mod M table[probe] = K Lookups walk along table until the key or an empty slot is found Uses less memory than chaining. (Don’t have to store all those links) Slower than chaining. (May have to walk along table for a long way.) Deletion is more complex. (Either mark the deleted slot or fill in the slot by shifting some elements down.)

16 16 Linear Probing Example h(k) = k mod 13 Insert keys: 184122445932 4432 31 73

17 17 Linear Probing Example (cont.)

18 18 Keys h N  N  0 1 Linear probing h(key)

19 19 Keys h N  N  0 1 Linear probing (h(key) + 1) mod N

20 20 Keys h N  N  0 1 Linear probing (h(key) + 2) mod N

21 21 Keys h N  N  0 1 Linear probing (h(key) + 3) mod N

22 22 Keys h N  N  0 1 Linear probing (h(key) + 4) mod N

23 23 Keys h N  N  0 1 Quadratic probing h(key)

24 24 Keys h N  N  0 1 Quadratic probing (h(key) + 1) mod N

25 25 Keys h N  N  0 1 Quadratic probing (h(key) + 4) mod N

26 26 Keys h N  N  0 1 Quadratic probing (h(key) + 9) mod N

27 27 Keys h Quadratic probing h(key) N = 17 (prime) N  N  0 1 (h(key) + 121) mod N

28 28 Keys h Quadratic probing h(key) N = 17 (prime) N  N  0 1 (h(key) + 144) mod N

29 29 Quadratic probing h(key) N = 17 (prime) N  N  0 1 Theorem: If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty.

30 30 Quadratic probing h(key) N  N  0 1 Application: Probing visited only 9 of the 17 bins, but if the table is half empty, not all those 9 bins can be occupied, so we must be able to insert a new element in one of them. Theorem: If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty. N = 17 (prime)

31 31 Collisions Given N people in a room, what are the odds that at least two of them will have the same birthday? Table capacity of 365 After N insertions what are the odds of at least one collision? Who wants to be a Millionaire? Assume N = 23 (load factor is therefore 23/365 = 6.3%). What are the approximate odds that two of these people have the same birthday? 10%75% 25%90% 50%99%

32 32 Collisions Let Q(n) be the probability that when n people are in a room, nobody has the same birthday. Let P(n) be the probability that when n people are in a room, at least two of them have the same birthday. P(n) = 1 – Q(n) Consider that: Q(1) = 1 Q(2) = Odds that Q(1) don’t collide times the odds of one more person not “colliding” Q(2) = Q(1) * 364/365 Q(3) = Q(2) * 363/365 Q(4) = Q(3) * 362/365 … Q(n) = (365/365) * (364/365) * (363/365) * … * ((365-n+1)/365) Q(n) = 365! / (365 n * (365-n)!)

33 33 Collisions Number of people Odds of a collision Odds of Collision N 99.9999%100 94.1%45 89.1%40 70.1%30 50.7%23 25.3%15 11.7%10 2.7%5 Collisions are more frequent than you might expect, even for low load factors!

34 34 Hashcodes and table size Hashcodes should be fast/easy to compute Keys should evenly distribute across the table Hashtable capacities are usually kept at prime-values to avoid problems with probe sequences –Consider inserting into the table below using quadratic probing and a key object that hashes to index 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

35 35 We need to have a little talk How to remove an item from a hashtable that uses open addressing? Consider a table of size 11 with the following sequence of operations using h(k) = K%11 and linear probing (p(i) = i) –put(36, D1) –put(23, D2) –put(4, D3) –put(46, D4) –put(1, D5) –remove(23) –remove(36) –get(1)

36 36 Removal If an item is removed from the table, it could mess up gets on other items in the table. Fix the problem by using a “tombstone” marker to indicate that while the item has been removed from the array slot, the slot should be considered “occupied” for purposes of later gets.

37 37 Double Hashing Another probing strategy is to use “double hashing” The probe sequence becomes s(k,i) = norm(h(k) + i*h2(k)) The hash value is determined by “two” hash functions and is typically better than linear or quadratic probing.

38 38 Double Hashing Example h 1 (K) = K mod 13 h 2 (K) = 8 - K mod 8 we want h 2 to be an offset to add

39 39 Double Hashing Example (cont.)

40 40 Separate Chaining A way to “avoid” collisions Each array slot contains a list of data elements The fundamental methods then become: –PUT: hash into array and add to list –GET: hash into array and search the list –REMOVE: hash into array and remove from list The built-in HashMap and Hashtable classes use separate chaining

41 41 01234560123456 Chaining Example put(B 2, Data 1 ) put(S 19, Data 2 ) put(J 10, Data 3 ) put(N 14, Data 4 ) put(X 24, Data 5 ) put(W 23, Data 6 ) put(B 2, Data 7 ) get(X 24 ) get(W 23 ) (B 2, Data 1 ) (S 19, Data 2 ) (J 10, Data 3 ) (N 14, Data 4 ) (X 24, Data 5 ) ???

42 42 01234560123456 Chaining Example put(B 2, Data 1 ) put(S 19, Data 2 ) put(J 10, Data 3 ) put(N 14, Data 4 ) put(X 24, Data 5 ) put(W 23, Data 6 ) put(B 2, Data 7 ) get(X 24 ) get(W 23 ) (B 2, Data 1 ) (S 19, Data 2 ) (J 10, Data 3 ) (N 14, Data 4 ) I’m so relieved! (X 24, Data 5 )

43 43 Theoretical Results Let  = N/M the load factor: average number of keys per array index Analysis is probabilistic, rather than worst-case Expected Number of Probes foundNot found

44 44 Expected Number of Probes vs. Load Factor

45 45 Summary Dictionaries may be ordered or unordered –Unordered can be implemented with lists (array-based or linked) hashtables (best solution) –Ordered can be implemented with lists (array-based or linked) trees (avl (best solution), splay, bst)


Download ppt "1 HashTable. 2 Dictionary A collection of data that is accessed by “key” values –The keys may be ordered or unordered –Multiple key values may/may-not."

Similar presentations


Ads by Google