Dictionaries and Hash Tables Cmput Lecture 24 Department of Computing Science University of Alberta ©Duane Szafron 2000 Some code in this lecture is based on code from the book: Java Structures by Duane A. Bailey or the companion structure package Revised 3/28/00
©Duane Szafron About This Lecture In this lecture we will study a container interface called Dictionary and an implementation class called HashTable.
©Duane Szafron Outline Dictionary Interface HashTable Class Iterators External Chaining
©Duane Szafron Dictionary A Dictionary is an unordered container that contains key-value pairs. The keys are unique, but the values are not. 45 "Barney" "Wilma" "Betty" "Fred" keys values
©Duane Szafron Dictionary Hierarchy In Java.util, Dictionary is a class. In the structure package the Dictionary Interface as an extension of the Store Interface. The class HashTable will implement the Dictionary interface. Store Dictionary HashTable
©Duane Szafron Structure Interface - Store public interface Store { public int size(); //post: returns the number of elements contained in // the store. public boolean isEmpty(); // post: returns the true iff store is empty. public void clear(); // post: clears the store so that it contains no // elements. } code based on Bailey pg. 18
©Duane Szafron Structure Interface - Dictionary 1 public interface Dictionary extends Store { public Object put(Object key, Object value); // pre: key is non-null // post: puts the key-value pair in this Dictionary. If a // matching key was in this Dictionary, returns the old value. // Otherwise, returns null public Object get(Object key); // pre: key is non-null // post: returns the value with the given key or null if // no matching key is found public boolean contains(Object value); // pre: value is non-null // post: returns true iff the Dictionary contains the value code based on Bailey pg. 268
©Duane Szafron Structure Interface - Dictionary 2 public boolean containsKey(Object key); // pre: key is non-null // post: returns true iff the Dictionary contains the key public Object remove(Object key); // pre: key is non-null // post: removes a key-value pair whose key is “equal” to // the given key and returns the value. If no matching key // was found, then returns null public Iterator keys(); // post: returns an Iterator for traversing all keys public Iterator elements(); // post: returns an Iterator for traversing all values } code based on Bailey pg. 267
©Duane Szafron Dictionary - Obvious Implementations We could implement a Dictionary using two parallel containers (Arrays, Vectors, Lists etc.,) one for the keys and one for the values. We could also implement a Dictionary using a single container that holds Associations. In either case, the methods get(Object), put(Object, Object), contains(Object), containsKey(Object) and remove(Object) would each require O(n) calls to the equals(Object) method for Lists. If the keys are Comparable we can reduce the comparisons to log (n) for Arrays and Vectors. Can we do better?
©Duane Szafron Dictionary - Parcel Analogy Assume that you are about to leave a busy mall or amusement park and you are one of about a thousand people picking up a parcel at any time during the day. This is a Dictionary problem with names as keys and parcels as values. Assume the mall has 100 bins that each hold about 10 parcels. How should the mall organize these parcels to minimize waiting time?
©Duane Szafron Parcels - Using Bins When you buy your item, you are asked for the last two digits of your phone number and your parcel is sent to that bin. When you pick up your parcel the attendant asks for the last two digits of your phone number, goes to the correct bin ( ) and searches through the parcels (1-10) to get the one with your name. This is an example of hashing. Each item is assigned a hash number (or index) that is used to select a bin which contains a small number of items that can be searched for your item.
©Duane Szafron Selecting Bin Numbers Would the first two digits of a phone number be as good as the last two digits? There are only a few combinations of first two digits that most local residents share (for example, 42, 43, 44, 45, 46, 47, 48, 92, 96, 98 in Edmonton), so a few bins would overflow and others would be empty. What about using the first two or last two letters of the name of the person? This would take 26*26 = 676 bins but even so, some bins would fuller than others. For maximum efficiency, we want the keys to be uniformly distributed over the bin numbers.
©Duane Szafron Hash Functions A hash function maps keys to bin values. –It should map keys uniformly across all bins. –It should be fast to compute. –It should be applicable to all objects. h(“Paul”) = 28 When two keys map to the same bin, we have a hash collision. When a collision occurs, a collision resolution algorithm is used to establish the locations of the colliding keys in the bin. In some cases when we know all of the key values in advance we can construct a perfect hash function that maps each key to a different bin (no collisions).
©Duane Szafron Hash Tables A hash table is a container (usually an Array or Vector) whose elements are used as bins. In the basic implementation, each entry in the hash table is a bin that holds a single element. “longest” “to” “kiwi” “fifth” hash function = length % 7 “kiwi” 4
©Duane Szafron Hash Tables Collisions If there is a hash collision, the collision resolution algorithm selects a different bin for the new element to be inserted. This is called open addressing. “longest” “to” “kiwi” “fifth” hash function = length % 7 “poem” 4 ?
©Duane Szafron Linear Probing One open addressing algorithm is called linear probing: –Locations are checked from the hash location to the end of the table and the element is placed in the first empty slot. –If the bottom of the table is reached, checking “wraps around” to the start of the table. Collision resolution modifies how a search is done since the match for a search might not be at the hash location. For example, if linear probing is used, the search must continue down the table until a match or empty location is found.
©Duane Szafron Linear Probing Example “longest” “to” “kiwi” “fifth” “poem” hash function = length % 7 4 “poem”
©Duane Szafron Other Open Addressing Schemes Linear probing has an offset value of 1. Instead, we can use a second hash function to generate a different offset from the first hash location using double hashing. “longest” “to” “kiwi” “poem” “fifth” “fred” hash function = length % 7 4 hash function =value (firstChar) 6 (4 + 6) % 7 -> 3 “fred”
©Duane Szafron Element Deletion Problem Open addressing affects element removal. When an element is removed, the “hole” may prevent us from finding another element that hashed to the same location. hash function =length % 7 “poem” 4 “longest” “to” “kiwi” “poem” “fifth” stop before finding “poem”
©Duane Szafron Element Deletion Deletions can be handled in two ways: –Mark the deleted location as Reserved During insertion, a reserved location can be re-used. –Move all of the elements that hashed to the same location as the removed element “up” in the hash table after a deletion. “longest” “to” “kiwi” “poem” Reserved “longest” “to” “kiwi” “poem”
©Duane Szafron Efficiency of HashTables If the number of collisions is small, searching, adding and removing elements in a hash table requires O(C) time. To reduce the number of collisions, in addition to using a good hash function, we must make sure the table does not get too full. The load factor of a hash table is the ratio of full elements to empty elements. For best results, the load factor should not be above 0.6. If it gets higher, we should extend the hash table and re-hash all of its elements.
©Duane Szafron Implementation of HashTable We will use an array of Associations. We will use the Reserved strategy for deletions. We will grow the HashTable when the load factor gets too high. We will cache the logical size to make it easier to determine when the load factor is too high. The size of the HashTable should be a prime, but we will allow the user to specify the initial size and double this size and add one, when it must be grown. (e.g., run some experiments using size 97 vs 100)
©Duane Szafron Example Aho Hopcroft Backus Von Neuman Scott Jacobsen put “Aho”, prog-lang -> 1*3%11=3 put “Scott”, automata -> 19*5%11=7 put “Hopcroft”, automata -> 9 put “Backus”, prog-lang -> 1 put “von Neuman”, archit -> 10 put “Turing”, coding -> 10 put “Jacobsen”, softeng -> 3 Turing hash = value(first char of key)*length(key)
©Duane Szafron Example put “Aho”, prog-lang -> 1*3%23->3 put “Scott”, automata -> 19*5%23->3 put “Hopcroft”, automata -> 9 ->18 put “Backus”, prog-lang -> 1 -> 12 put “von Neuman”, archit -> 10 ->13 put “Turing”, coding -> 10 -> 5 put “Jacobsen”, softeng -> 3 -> Aho Hopcroft Backus Von Neuman Scott Jacobsen Turing hash = value(first char of key)*length(key) Scott Aho Hopcroft Backus von Neuman Turing Jacobsen McCarthy 11 put “McCarthy”,AI -> 13*8%23 ->12 rehash
©Duane Szafron HashTable - State and Constructors class HashTable implements Dictionary { protected static Association reserved = new Association(“reserved”, null); protected Association data[ ]; protected int count; protected int capacity; protected final double loadFactor = 0.6; public HashTable(int initialCapacity) { // pre: initialCapacity > 0 // post: constructs a HashTable with given initial size. this.data = new Association[initialCapacity]; this.capacity = initialCapacity; this.count = 0; } public HashTable() { // post: constructs a HashTable with a default size. this(997); } code based on Bailey pg. 270
©Duane Szafron HashTable - Store Interface /* Interface Store Methods */ public int size() { //post: returns the number of elements in the store. return this.count; } public boolean isEmpty() { // post: returns the true iff store is empty. return this.size() == 0; } public void clear(); // post: clears the store so that it contains no elements. for (index = 0; index < this.capacity; index++) this.data[index] = null; this.count = 0; } code based on Bailey SPackage
©Duane Szafron HashTable - get public Object get(Object key) { // pre: key is non-null // post: returns the value with the given key or null if // no matching key is found int index; Association found; index = this.locate(key); // locate does the work found = this.data[index]; if (found == null || found == reserved) return null; return found.value(); } code based on Bailey pg. 275
©Duane Szafron HashTable - put 1 public Object put(Object key, Object value); // pre: key is non-null // post: puts the key-value pair in this Dictionary. If a // matching key was in this Dictionary, returns the old value. // Otherwise, returns null int index; Association found; Object oldValue; if (count + 1 > this.loadFactor * capacity) this.rehash(); index = this.locate(key); // locate does the work found = this.data[index]; if (found == null || found == reserved) { // not found this.data[index] = new Association(key, value); this.count++; return null; } code based on Bailey pg. 274
©Duane Szafron HashTable - put 2 and containsKey else // found oldValue = found.value(); found.setValue(value); return oldValue; } public boolean containsKey(Object key) { // pre: key is non-null // post: returns true iff the Dictionary contains the key int index; index = this.locate(key); // locate does the work return this.data[index] != null && this.data[index] != reserved; } code based on Bailey pg. 275
©Duane Szafron HashTable - remove public Object remove(Object key); // pre: key is non-null // post: removes a key-value pair whose key is “equal” to // the given key and returns the value. If no matching key // was found, then returns null int index; Association found; Object oldValue; index = this.locate(key); // locate does the work found = this.data[index]; if (found == null || found == reserved) { // not found return null; this.count--; oldValue = found.value(); this.data[index] = reserved; return oldValue; } code based on Bailey pg. 276
©Duane Szafron HashTable - locate 1 protected int locate(Object key); // pre: key is non-null // post: returns ideal index of key in table int index; int reservedIndex; Association found; Object oldValue; index = Math.abs(key.hashCode() % this.capacity); reservedIndex = -1; code based on Bailey pg. 274
©Duane Szafron HashTable - locate 2 while (this.data[index] != null) { if (this.data[index] = reserved { if (reservedIndex == -1) reservedIndex = index; } else if (key.equals(this.data[index].key())) return index; // we have located the key index = (index + 1) % this.capacity; //probe linearly } if (reservedIndex == -1) return index; //haven’t hit reserved key so return index else //return first available (reserved) index return reservedIndex; } code based on Bailey pg. 274
©Duane Szafron HashTable - rehash protected void rehash() { // post: resizes table and re-hashes all elements Association association; Iterator iterator; iterator = new HashtableIterator(this.data); this.capacity = this.capacity * 2 + 1; this.data = new Association[this.capacity]; this.count = 0; while (iterator.hasMoreElements()) { association = (Association) iterator.nextElement(); put(association.key(), association.value()); } code based on Bailey SPackage
©Duane Szafron Iterators Create a HashTableIterator class whose elements are Associations. A HashTableIterator is used in rehash(). Also, let each KeyIterator or ValueIterator be a filter on a HashTableIterator (see textbook).
©Duane Szafron HashtableIterator - public 1 class HashtableIterator implements Iterator { protected int current; protected Association data[ ]; public HashtableIterator(Association[ ] table) { // post: constructs a new hash table iterator this.data = table; this.reset(); } public void reset() { // post: resets iterator to beginning of hash table this.current = 0; this.findNextElement(); } public boolean hasMoreElements() { // post: returns true if there are unvisited elements return this.current < this.data.length; } code based on Bailey SPackage
©Duane Szafron HashtableIterator - public 2 public Object nextElement() { // pre: hasMoreElements() // post: returns current element, increments iterator Object result; result = this.data[this.current]; this.findNextElement(); return result; } public Object value() // pre: hasMoreElements() // post: returns current element (key and value) return this.data[this.current]; } code based on Bailey SPackage
©Duane Szafron HashtableIterator - findNextElement protected void findNextElement() { // post: moves current index to the next real element while (this.current < this.data.length && (this.data[this.current] == null || this.data[this.current] == Hashtable.reserved)) this.current++; } code based on Bailey SPackage
©Duane Szafron External Chaining Instead of implementing a hash table whose entries are associations, we can have a hash table whose entries are containers for associations. Then when there is a hashing collision, we put all elements that collided into a common container “longest” “to” “kiwi” “fifth” “largest” “there” “fred”“association”
©Duane Szafron Some Principles from the Textbook 25. Provide a method for hashing the objects you implement. 26. Equivalent Objects should return equal hash codes. principles from Bailey ch. 13