Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.

Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Plan  Today  Seat assignments  Hash functions  Reading:  For today and next time: Sedgewick Chapter 14  Reminder: HW0 due on Thursday

Hash Tables An Alternative Representation for Dictionaries

Dictionary Interface An Abstract Data Type that maintains a dynamic set is a Dictionary. Crucial operations:  Insert  Find  Remove Standard operations: create, destroy, copy,…

Dictionary Interface insert: may or may not allow multiple occurrences find: membership query, often also retrieve associated information remove: may use deferred actions for speed up amortized running time

Small Universe  Suppose we have a small universe U = {0,1,2,…,M-1} of items.  We want to maintain a subset A of U.  Ease: Use an array of bits (boolean) of size M.  Insert: A[k] = 1  Find: return A[k] != 0  Remove: A[k] = 0 Operations are constant time.

Direct Access Tables  In most applications we do not store simple items but pairs (key, object).  Use an array of pointers (references to objects).  Insert: A[key] = object  Find: return A[key]  Remove: A[key] = null Again operations are constant time.

Large Universe  But what if the universe U of keys is large (and the subset is small)? e.g., names, symbol table of a compiler.  Even when the identifiers are at most 16 long there are some 10 28 possibilities.

Hashing – the Idea  Map keys into integers in the range 0.. m-1, m<<M and m is the table size.  Pick a “good” mapping from keys to integers:  Easy to compute  Even distribution into the table 0 1 2 3 4 5 6 7 8 9 10 a b c d e f l h i j k l m n o p q r s t u v w x y z

Hashing – Terminology  The array in which we store the objects is the hash table.  To enter an object into the table, we compute an index from the key.  The map from the key to the index is a hash function h: h(key) = index

Space-Time Tradeoff  A direct table has O(1) operations in the worse case. But space may be prohibitive.  Minimize space by using a sequential search.  Hashing balances space and time (on average) by changing the size of the hash table.

Problem - Collisions  Fundamental problem: Some keys map to the same location, a collision: h(x) = h(y).  Can we prevent collisions?

Pigeonhole Principal  There is no way to avoid collisions.  Since m << M there must be at least two keys that map to the same index.  The famous Pigeonhole Principle: If you put more than k items into k bins, then at least one bin contains more than one item.

Problem - Hash Function  Second problem: How do we find a suitable hash function?  Ideally, we want to distribute the keys uniformly over the hash table to minimize collisions.  That is, we want h to appear random, as though “hashing” the keys.

Hash Functions

Hashing-Efficiency  We also need to make sure h(k) is easy to compute.  Note that k could be a fairly complicated data structure. How do you turn an array of integers into a single integer? Or how about a tree?  Goal: All operations should be constant time.  But things can go badly wrong on rare occasions.

Division method  Assume wlog the keys are integers.  A simple hash function is h(k) = k mod m, where m is the table size.  The choice of m is crucial.  Good choice: m prime.

Division method  Primes are fairly dense, so this is no great restriction on the table size.  In fact, we can nearly double the hash table: 31, 61, 127,251, 509, 1021, 2039,…  Store these values in a table; don’t try to compute on the fly.

Multipication Method  Another hash function is h(x) = floor( m ( k r mod 1) ) where 0 < r < 1 is cleverly chosen.  Advantage: the choice of m is not critical  Ideally should be irrational, then the values (i r mod 1), i = 1, 2,...,M are very evenly distributed over [0,1].  Of course, there is a little problem here.

Random Input  Note that good hash functions are easy to come by if the input is random (as a bit pattern). Then we can take simply a few bits from the input (say, the first or last 16 bits).  However, such a method would fail miserably if the input shows some regularity. No good for general use.

Integer keys?  The assumption objects in U are integers has to be taken with a grain of salt.  Often we have to massage things a bit to extract numbers.  Of course, in the end everything is just one (possibly huge) number written in binary. This can be used in some languages like C to directly extract hash values from these bits.

Example: Strings public int hashCode(String key, int m) { int h = 0; for (int i=0; i<key. length(); i++) h = 37 * h + key.charAt(i); // 37 is magic number h %= m; if (h < 0) // overflow? h += m; return h; } This is really an interpretation of the string as a number in base 37 (not ordinary radix notation, though.)

Hash functions  Desired properties  Approximates a random distribution  Over the range of table index values  Efficient calculation  Approaches  Modular arithmetic  Many  Perfect hashing  When full set of input keys known in advance

Next time: Collisions

Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.

Similar presentations

Presentation on theme: "Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.

Similar presentations

Presentation on theme: "Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005."— Presentation transcript:

Similar presentations

About project

Feedback