# CSE Lecture 16 – Hashing & Tables

## Presentation on theme: "CSE Lecture 16 – Hashing & Tables"— Presentation transcript:

CSE 30331 Lecture 16 – Hashing & Tables
Binary Search Tree vs. Hash Table Hash Function (quick intro) Collision Coping with Collisions Open addressing & linear probing Chaining with separate lists Hash Functions What works C++ Function Objects Hash Iterators Efficiency of Hash Methods

BST vs Hash Table Both used to implement Sets & Maps
Binary Search Tree – ordered associative container Order (log N) access (average & worst) Hash Table – unordered associative container Order(1) access (average case)

Hash Function A hash function converts a key into a numeric (unsigned int) table index Ideal hash functions uniformly distribute keys to all available indices When two keys hash to the same index a collision occurs Keys are not in any particular order (numeric, alphabetical, ...) within the table

Example Hash Function hf(n) = n, the identity function
index = hf(n)%m, where m is table size

Collision Given keys p and q, and table size m
hf(p)%m and hf(q)%m produce the same index hf(36)= %7 = 1

Coping with Collisions
Three primary methods exist for coping with collisions Rehashing: use same key but different hash function Linear Probing: examine successive locations (index, index+1, index+2, ...) Chaining: implement table with separate list at each table[index] location Note: Except for the last case, the table is a fixed size.

Hash Table Using Linear Probing – Open Addressing

Linear Probing PseudoCode
// insert item into table of size n using hashFunc() to // calculate index. this assumes no duplicate keys, and some // method of indicating that a hash table location is empty int index = hashFunc(item) % n; int origIndex = index; do { if table[index] is empty insert item as table[index] and return else if table[index] matches item return index = (index+1) % n; // this is next location to probe } while (index != origIndex); throw overflowError; // if we get here, table is full & does // not contain item

Problems with Linear Probing
Clustering of items occurs as number of items approaches size of table Colliding items fill in gaps between other entries This forms runs or clusters within the table Items in the cluster are a mix of items that hash to different indices Degraded performance results Long sequences of repeated probes are required to find what is sought

Chaining – Uses Lists or Buckets
Implement the hash table as a vector of lists Each list (bucket, chain, ...) contains all items that hash to the associated table location Buckets are not mixed like clusters in linear probing Table size can grow easily by expanding individual buckets as necessary The number of buckets stays constant Within a bucket, items are unordered and must be searched linearly

Chaining with Separate Lists Example

C++ Function Objects Function object is an instance of a class that contains only a single function – operator() Function objects are easily passed as parameters to other functions Commonly used to implement hash functions and comparison operations template <typename T> class greaterThan { public: bool operator() (const T& x, const T& y) const { return x > y; } };

Using a function object
Here is a template function that swaps two parameters only IF the comparison is true template<typename T, typename Compare> void swap(T& a, T& b, Compare comp) { if (comp(a,b)) T temp = a; a = b; b = temp; } Here is a sample call swap(x, y, greaterThan);

Reasonable Hash Functions
Integer key: Identity function Good distribution if key or a portion of it is random class hfIntKey { public: bool operator() (int key) const { return key; } };

Reasonable Hash Functions
Integer key: Midsquare technique Extracts middle two bytes of 4 byte square of key Works well with random and non-random keys class hfMidSq { public: bool operator() (int key) const { unsigned int n = key; return ((n*n)/256) % 65536; // ^16-1 } };

Reasonable Hash Functions
String key: string-to-number Simple function uses ASCII codes for the string characters to build n-digit unsigned integers out of n-digit strings class hfString { public: bool operator() (string key) const { unsigned int prime = ; int n(0); for (int i=0; i < key.length(); i++) n = n*8 + key[i]; return (n > 0 ? (n % prime) : (-n % prime) ); } };

Reasonable Hash Functions
String key: folding Uses substrings as numbers and combines them by addition or multiplication or … Example: Sum of the 3 character substrings of a SSN Assuming no dashes in SSN … “ ”  = 1962 class hfSSN { public: bool operator() (string ssn) const { return ( atoi(ssn.substr(0,3).c_str()) + atoi(ssn.substr(3,3).c_str()) + atoi(ssn.substr(6,3).c_str()) ); } };

Hash Class – not in STL See headers in Ford & Topp include folder
d_hash.h – for the hash table using buckets d_hashf.h – for hash function object d_uset.h – for unordered set based on hash class d_hiter.h – for hash class iterator and const_iterator

Hash Class template <typename T, typename HashFunc> class hash {
public : hash (int nbuckets, const HashFunc& hfunc = HashFunc()); hash (T *first, T *last, int nbuckets, const HashFunc& hfunc = HashFunc()); bool empty() const; int size() const; iterator find(const T& item); pair<iterator,bool> insert(const T& item); int erase(const T& item); void erase(iterator pos); void erase(iterator first, iterator last); iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const; private: int numBuckets; // number of buckets vector<list<T> > bucket; // table is vector of lists HashFunc hf; // hash function int hashtableSize; // number of elements };

Hash::find(item) template <typename T, typename HashFunc>
hash<T,HashFunc>::iterator hash<T,HashFunc>::find(const T& item) { int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; // traverse list and look for a match with item bucketIter = myBucket.begin(); while(bucketIter != myBucket.end()) if (*bucketIter == item) // return iterator to found item return iterator(this, hashIndex, bucketIter); bucketIter++; } // did not find item, so return iterator to table end return end();

Hash::insert(item) template <typename T, typename HashFunc>
pair<hash<T, HashFunc>::iterator,bool> hash<T, HashFunc>::insert(const T& item) { int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; bool success; bucketIter = myBucket.begin(); while (bucketIter != myBucket.end()) if (*bucketIter == item) break; // found the item already in bucket else bucketIter++; if (bucketIter == myBucket.end()) { bucketIter = myBucket.insert(bucketIter, item); success = true; hashtableSize++; } success = false; // item already in table return pair<iterator,bool> (iterator(this,hashIndex,bucketIter), success);

Hash Iterator hIter Referencing Element 22 in Table ht

Determining Performance
The Load Factor (λ) measures the table density Where (m = size of table, n = items in table) Linear addressing (m = size of vector, maxitems) Chaining (m = number of buckets) Worst case (all items hash to same table location or bucket) Linear search is O(n) Making table size prime helps prevent nonuniform distribution causing this worst case

Average Case - Chaining
Finding bucket is O(1) – using hash function Uniform hashing implies each bucket has n/m items Assuming uniform hash distribution The ith item was inserted at the end of its bucket when the previous (i-1) items were spread evenly over the m buckets To find this item takes 1+(i-1)/m comparisons since there are (on average) (i-1)/m items ahead of it in its bucket Average performance of search for an arbitrary item is the average of the number of comparisons required to find each item in the list

Efficiency of Hash Methods
Hash table size = m, Number of elements in hash table = n, Load factor  = n/m Average Probes for Successful Search for Unsuccessful Search Open Probe Chaining

Final Variations Universal Hashing Perfect Hashing
Choose hf(n) randomly before execution from set of hash functions Prevents same clustering of collisions each time given set of data is used in hash table Efficiency is more likely to be to be Θ(1), even worst case Perfect Hashing Two tier approach (requires static set of keys) Uses two hash functions from universal hf(n) set Like chaining, but with secondary hash tables instead of chains Size of secondary hash tables is square of number of items hashing to that table using first hash function Second hash function is chosen so no collisions occur in each secondary table Efficiency is guaranteed to be Θ(1)

Summary Hash Table Simulates the fastest searching technique, knowing the index of the required value in a vector and array and apply the index to access the value, by applying a hash function that converts the data to an integer After obtaining an index by dividing the value from the hash function by the table size and taking the remainder, access the table. Normally, the number of elements in the table is much smaller than the number of distinct data values, so collisions occur. To handle collisions, we must place a value that collides with an existing table element into the table in such a way that we can efficiently access it later. Average running time for a search of a hash table is Θ(1) Worst case is Θ(n)

Summary Collision Resolution Linear open probe addressing
the table is a vector or array of static size After using the hash function to compute a table index, look up the entry in the table. If the values match, perform an update if necessary. If the table entry is empty, insert the value in the table.

Summary Collision Resolution (Cont…) Chaining with separate lists.
The hash table is a vector of list objects Each list is a sequence of colliding items. After applying the hash function to compute the table index, search the list for the data value. If it is found, update its value; otherwise, insert the value at the back of the list. You search only items that collided at the same table location There is no limitation on the number of values in the table, and deleting an item from the table involves only erasing it from its corresponding list

Download ppt "CSE Lecture 16 – Hashing & Tables"

Similar presentations