Chapter 12 Hash Table
● So far, the best worst-case time for searching is O(log n). ● Hash tables average search time of O(1). worst case search time of O(n).
Learning Objectives ● Develop the motivation for hashing. ● Study hash functions. ● Understand collision resolution and compare and contrast various collision resolution schemes. ● Summarize the average running times for hashing under various collision resolution schemes. ● Explore the java.util.HashMap class.
12.1 Motivation ● Let's design a data structure using an array for which the indices could be the keys of entries. ● Suppose we wanted to store the keys 1, 3, 5, 8, 10, with a guaranteed one-step access to any of these.
12.1 Motivation ● The space consumption does not depend on the actual number of entries stored. It depends on the range of keys. ● What if we wanted to store strings? For each string, we would first have to compute a numeric key that is equivalent to it. java.lang.String.hashCode() computes the numeric equivalent (or hashcode) of a string by an arithmetic manipulation involving its individual characters.
12.1 Motivation ● Using numeric keys directly as indices is out of the question for most applications. There isn't enough space
12.1 Motivation
12.2 Hashing ● A simple hash function table size of 10 h(k) = k mod 10
12.2 Hashing ● ear collides with cat at position 4. ● There is empty space in the table, and it is up to the collision resolution scheme to find an appropriate position for this string. ● A better mapping function ● For any hash function one could devise, there are always hashcodes that could force the mapping function to be ineffective by generating lots of collisions.
12.2 Hashing
12.3 Collision Resolution ● There are two ways to resolve collisions. open addressing ● Find another location for the colliding key within the hash table. closed addressing ● store all keys that hash to the same location in a data structure that “hangs off” that location.
Linear Probing
● As more and more entries are hashed into the table, they tend to form clusters that get bigger and bigger. ● The number of probes on collisions gradually increases, thus slowing down the hash time to a crawl.
Linear Probing ● Insert "cat", "ear", "sad", and "aid"
Linear Probing ● Clustering is the downfall of linear probing, so we need to look to another method of collision resolution that avoids clustering.
Quadratic Probing
● Avoids Clustering ● When the probing stops with a failure to find an empty spot, as many as half the locations of the table may still be unoccupied. ● A hash to 2,3,6,0,7, and 5 are endlessly repeated, and an insertion is not done, even though half the table is empty.
Quadratic Probing ● For any given prime N, once a location is examined twice, all locations that are examined thereafter are also ones that have been already examined.
Chaining ● If a collision occurs at location i of the hash table, it simply adds the colliding entry to a linked list that is built at that location.
Running times ● We assume that the hashing process itself (hashcode and mapping) takes O(1). Running time of insertion is determined by the collision resolution scheme.
12.4 The java.util.HashMap Class ● Consider a university-wide database that stores student records. Every student is assigned a unique id (key), with which is associated several pieces of information such as name, address, credits, gpa, etc. These pieces of information constitute the value.
12.4 The java.util.HashMap Class ● A StudentInfo dictionary that stores (id, info) pairs for all the students enrolled in the university. ● The operations corresponding to this relationship can be found in hava.util.Map
12.4 The java.util.HashMap Class ● The Map interface also provides operations to enumerate all the keys, enumerate all the values, get the size of the dictionary, check whether the dictionary is empty, and so on. ● The java.util.HashMap implements the dictionary abstraction as specified by the java.util.Map interface. It resolves collisions using chaining.
Table and Load Factor ● When the no-arg constructor is used Default initial capacity 16 Default load factor of ● The table size is defined as the actual number of key-value mappings in the has table.
Table and Load Factor ● We can choose an initial capacity Only uses capacities that are powers of 2. ● 101 becomes 128
Table and Load Factor ● An initial capacity of 128.
Storage of Entries ● Relevant fields in the HashMap class. threshold is the size threshold ● Product of the capacity and the threshold load factor (N* t)
Storage of Entries ● Entry[] table sets up an array of chains. Map.Entry is defined inside the Map interface. next holds a reference to the next Entry in its linked list.
Adding an Entry ● Example Name serves as a key to the phone number value.
Adding an Entry
● If the key argument is null, a special object, NULL_KEY is returned, otherwise the argument key is returned as is.
Adding an Entry
● Example h = 25 and length = 16 The binary representation of h and length-1 (11001 and 01111).
Adding an Entry ● Since length is a power of 2, the binary representation of length will be with k zeros. ● Any h is expressible as 2 c * k + r. r is a result of the bit-wise and, since the 2 c * k part is a higher order bit that will be zeroed out in the process.
Adding an Entry
● The if statement triggers a rehashing process if the size is equal to or greater than the threshold.
Rehashing
Searching
12.5 Quadratic Probing: Repetition of Probe Locations ● Quadratic probing only examines N/2 locations of the table before starting to repeat locations. ● Suppose a key is hashed to location h, where there is a collision. Following locations are examined.
12.5 Quadratic Probing: Repetition of Probe Locations ● If two different probes (i and j) end up at the same location?
12.5 Quadratic Probing: Repetition of Probe Locations ● Since N is a prime number, it must divide one of the factors (i + j) or (i - j). ● N divides (i - j) only when at least N probes have been made already. ● N divides (i + j) when (i + j = N), at the very least. ● j = N - i
12.6 Summary ● A hash table implements the dictionary operations of insert, search, and delete on (key, value) pairs. ● Given a key, a hash function for a given hash table computes an index into the table as a function of the key by first obtaining a numeric hashcode, and then mapping this hashcode to a table location.
12.6 Summary ● When a new key hashes to a location in the hash table that is already occupied, it is said to collide with the occupying key. ● Collision resolution is the process used upon collision to determine an unoccupied location in the hash table where the colliding key may be inserted. ● In searching for a key, the same hash function and collision resolution scheme must be used as for its insertion.
12.6 Summary ● A good hash function must be O(1) time and must distribute entries uniformly over the hash table. ● Open addressing relocates a colliding entry in the hash table itself. Closed addressing stores all entries that hash to a location, in a data structure that “hangs off” that location. ● Linear probing and quadratic probing are instances of open addressing, while chaining is an instance of closed addressing.
12.6 Summary ● Linear probing leads to clustering of entries with the clusters becoming increasingly larger as more and more collisions occur. Clustering degrades performance significantly. ● Quadratic probing attempts to reduce clustering. On the other hand, quadratic probing may leave as many as half the hash table empty while reporting failure to insert a new entry.
12.6 Summary ● Chaining is the simplest way to resolve collisions and also results in better performance than linear probing or quadratic probing. ● The worst-case search time for linear probing, quadratic probing, and chaining is O(n). ● The load factor of a hash table is the ratio of the number of keys, n, to the capacity, N.
12.6 Summary ● The average performance of chaining depends on the load factor. For a perfect hash function that always distributes keys uniformly, the average search time for chaining is O(1).