Hashing
Motivating Applications Large collection of datasets Datasets are dynamic (insert, delete) Goal: efficient searching/insertion/deletion Hashing is ONLY applicable for exact-match searching
Direct Address Tables If the keys domain is U Create an array T of size U For each key K add the object to T[K] Supports insertion/deletion/searching in O(1)
Solution is to use hashing tables Direct Address Tables Alg.: DIRECT-ADDRESS-SEARCH(T, k) return T[k] Alg.: DIRECT-ADDRESS-INSERT(T, x) T[key[x]] ← x Alg.: DIRECT-ADDRESS-DELETE(T, x) T[key[x]] ← NIL Running time for these operations: O(1) Solution is to use hashing tables Drawbacks >> If U is large, e.g., the domain of integers, then T is large (sometimes infeasible) >> Limited to integer values and does not support duplication
Direct Access Tables: Example U is the domain K is the actual number of keys
Hashing A data structure that maps values from a certain domain or range to another domain or range Hash function 3 15 Domain: String values 20 55 Domain: Integer values
Hashing A data structure that maps values from a certain domain or range to another domain or range Hash function Student IDs 950000 ….. 960000 Range ….. 10000 Domain: numbers [950,000 … 960,000] Domain: numbers [0 … 10,000]
Hash Tables When K is much smaller than U, a hash table requires much less space than a direct-address table Can reduce storage requirements to |K| Can still get O(1) search time, but on the average case, not the worst case
Hash Tables: Main Idea Use a hash function h to compute the slot for each key k Store the element in slot h(k) Maintain a hash table of size m T [0…m-1] A hash function h transforms a key into an index in a hash table T[0…m-1]: h : U → {0, 1, . . . , m - 1} We say that k hashes to slot h(k)
Hash Tables: Main Idea Hash Table (of size m) U (universe of keys) U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 h(k2) = h(k5) k5 k3 h(k3) m - 1 >> m is much smaller that U (m <<U) >> m can be even smaller than |K|
Example Back to the example of 100 students, each with 9-digit SSN All what we need is a hash table of size 100
What About Collisions Collisions! U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 h(k2) = h(k5) Collisions! k5 k3 h(k3) m - 1 Collision means two or more keys will go to the same slot
Handling Collisions Many ways to handle it Chaining Open addressing Linear probing Quadratic probing Double hashing
Chaining: Main Idea Put all elements that hash to the same slot into a linked list (Chain) Slot j contains a pointer to the head of the list of all elements that hash to j
Chaining - Discussion Choosing the size of the hash table Small enough not to waste space Large enough such that lists remain short Typically 10% -20% of the total number of elements How should we keep the lists: ordered or not? Usually each list is unsorted linked list
Insertion in Hash Tables Alg.: CHAINED-HASH-INSERT(T, x) insert x at the head of list T[h(key[x])] Worst-case running time is O(1) May or may not allow duplication based on the application
Deletion in Hash Tables Alg.: CHAINED-HASH-DELETE(T, x) delete x from the list T[h(key[x])] Need to find the element to be deleted. Worst-case running time: Deletion depends on searching the corresponding list
Searching in Hash Tables Alg.: CHAINED-HASH-SEARCH(T, k) search for an element with key k in list T[h(k)] Running time is proportional to the length of the list of elements in slot h(k) What is the worst case and average case??
Analysis of Hashing with Chaining: Worst Case m - 1 T chain All keys will go to only one chain Chain size is O(n) Searching is O(n) + time to apply h(k)
Analysis of Hashing with Chaining: Average Case m - 1 T chain With good hash function and uniform distribution of keys Any given element is equally likely to hash into any of the m slots All chain will have similar sizes Assume n (total # of keys), m is the hash table size Average chain size O (n/m) Average Search Time O(n/m): The common case
Analysis of Hashing with Chaining: Average Case If m (# of slots) is proportional to n (# of keys): m = O(n) n/m = O(1) Searching takes constant time on average
Hash Functions
Hash Functions A hash function transforms a key (k) into a table address (0…m-1) What makes a good hash function? (1) Easy to compute (2) Approximates a random function: for every input, every output is equally likely (simple uniform hashing) (3) Reduces the number of collisions
Hash Functions Make table size (m) a prime number Common function Goal: Map a key k into one of the m slots in the hash table Make table size (m) a prime number Avoids even and power-of-2 numbers Common function h(k) = F(k) mod m Some function or operation on K (usually generates an integer) The output of the “mod” is number [0…m-1]
Examples of Hash Functions Collection of images F(k): Sum of the pixels colors h(k) = F(k) mod m Collection of strings F(k): Sum of the ascii values h(k) = F(k) mod m Collection of numbers F(k): just return k h(k) = F(k) mod m