Maps and Hashing.

Maps and Hashing

Part I: Basic Concepts Eric Roberts CS 106B

An Illustrative Mapping Application
Suppose that you want to write a program that displays the name of a state given its two-letter postal abbreviation. This program is an ideal application for the Map class because what you need is a map between two-letter codes and state names. Each two-letter code uniquely identifies a particular state and therefore serves as a key for a Map; the state names are the corresponding values. To implement this program in C++, you need to perform the following steps, which are illustrated on the following slide: Create a Map<string,string> containing the key/value pairs. 1. Read in the two-letter abbreviation to translate. 2. Call get on the Map to find the state name. 3. Print out the name of the state. 4.

The PostalLookup Program
int main() { Map<string,string> stateMap; initStateMap(stateMap); while (true) { cout << "Enter two-letter state abbreviation: "; string code = getLine(); if (code == "") break; if (stateMap.containsKey(code)) { cout << code << " = " << stateMap.get(code) << endl; } else { cout << code << " = ???" << endl; } stateMap code int main() { Map<string,string> stateMap; initStateMap(stateMap); while (true) { cout << "Enter two-letter state abbreviation: "; string code = getLine(); if (code == "") break; if (stateMap.containsKey(code)) { cout << code << " = " << stateMap.get(code) << endl; } else { cout << code << " = ???" << endl; } void initStateMap(Map<string,string> & map) { map.put("AL", "Alabama"); map.put("AK", "Alaska"); map.put("AZ", "Arizona"); map.put("FL", "Florida"); map.put("GA", "Georgia"); map.put("HI", "Hawaii"); map.put("WI", "Wisconsin"); map.put("WY", "Wyoming"); } map . . . code stateMap VE HI WI PostalLookup AL=Alabama Enter two-letter state abbreviation: HI AK=Alaska HI = Hawaii AZ=Arizona . . . WI Enter two-letter state abbreviation: FL=Florida WI = Wisconsin GA=Georgia VE Enter two-letter state abbreviation: HI=Hawaii . . . VE = ??? Enter two-letter state abbreviation: WI=Wisconsin WY=Wyoming skip simulation

A Simplified Version of the Map Class
map.size() Returns the number of key/value pairs in the map. map.isEmpty() Returns true if the map is empty. map.clear() Removes all key/value pairs from the map. map.put(key, value) Makes an association between key and value, discarding any existing one. map.get(key) Returns the most recent value associated with key. map.containsKey(key) Returns true if there is a value associated with key.

Implementation Strategies for Maps
There are several strategies you might choose to implement the map operations get and put. Those strategies include: Linear search. Keep track of all the name/value pairs in an array. In this model, both the get and put operations run in O(N) time. 1.

Exercise: Linear Search Map
As our programming exercise, we’ll build the linear search version of the map. We start with the PostalLookup.cpp program and the map.h interface. What we need to write are the following files: The mappriv.h file that describes the representation. The mapimpl.cpp file that implements the public methods.

Implementation Strategies for Maps
There are several strategies you might choose to implement the map operations get and put. Those strategies include: Linear search. Keep track of all the name/value pairs in an array. In this model, both the get and put operations run in O(N) time. 1. Binary search. If you keep the array sorted by the two-character code, you can use binary search to find the key. Using this strategy improves the performance of get to O(log N). 2. Table lookup in a grid. In this specific example, you can store the state names in a 26 x 26 Grid<string> in which the first and second indices correspond to the two letters in the code. Because you can now find any code in a single step, this strategy is O(1), although this performance comes at a cost in memory space. 3.

The Idea of Hashing The third strategy on the preceding slide shows that one can make the get and put operations run very quickly, even to the point that the cost of finding a key is independent of the number of keys in the table. This O(1) performance is possible only if you know where to look for a particular key. To get a sense of how you might achieve this goal in practice, it helps to think about how you find a word in a dictionary. Most dictionaries have thumb tabs that indicate where each letter appear. Words starting with A are in the A section, and so on. The most common implementations of maps use a strategy called hashing, which is conceptually similar to the thumb tabs in a dictionary. The critical idea is that you can improve performance enormously if you use the key to figure out where to look.

Hash Codes The rest of today’s lecture focuses on the implementation of the HashMap class in the Stanford libraries, which uses the hashing strategy. The HashMap class requires the existence of a free function called hashCode that transforms a key into a nonnegative integer. The hash code tells the implementation where it should look for a particular key, thereby reducing the search time dramatically. For today, I’ll focus on the case when the keys are strings. The important things to remember about hash codes are: Every string has a hash code, even if you don’t know what it is. 1. The hash code for any particular string is always the same. 2. If two strings are equal (i.e., they contain the same characters), they have the same hash code. 3.

The hashCode Function for Strings
const int HASH_SEED = 5381; /* Starting point for first cycle */ const int HASH_MULTIPLIER = 33; /* Multiplier for each cycle */ const int HASH_MASK = unsigned(-1) >> 1; /* All bits except the sign */ /* * Function: hashCode * Usage: int code = hashCode(key); * * This function takes a string key and uses it to derive a hash code, * which is nonnegative integer related to the key by a deterministic * function that distributes keys well across the space of integers. * The general method is called linear congruence, which is also used * in random-number generators. The specific algorithm used here is * called djb2 after the initials of its inventor, Daniel J. Bernstein, * Professor of Mathematics at the University of Illinois at Chicago. */ int hashCode(string str) { unsigned hash = HASH_SEED; int nchars = str.length(); for (int i = 0; i < nchars; i++) { hash = HASH_MULTIPLIER * hash + str[i]; } return (hash & HASH_MASK);

The Bucket Hashing Strategy
One common strategy for implementing a map is to use the hash code for each key to select an index into an array that will contain all the keys with that hash code. Each element of that array is conventionally called a bucket. In practice, the array of buckets is smaller than the number of hash codes, making it necessary to convert the hash code into a bucket index, typically by executing a statement like int index = hashCode(key) % nBuckets; The value in each element of the bucket array cannot be a single key/value pair given the chance that different keys fall into the same bucket. Such situations are called collisions. To take account of the possibility of collisions, each elements of the bucket array is a linked list of the keys that fall into that bucket, as illustrated on the next slide.

Simulating Bucket Hashing
ID Idaho CA California null CO Colorado DE Delaware KS Kansas MT Montana NJ New Jersey NC North Carolina DE Delaware CA California null CO Colorado CA California null ID Idaho CA California null CO Colorado DE Delaware KS Kansas MT Montana NJ New Jersey NC North Carolina WY Wyoming CA California null CO Colorado ID Idaho CA California null CO Colorado DE Delaware ID Idaho CA California null CO Colorado DE Delaware KS Kansas MT Montana NJ New Jersey ID Idaho CA California null CO Colorado DE Delaware KS Kansas ID Idaho CA California null CO Colorado DE Delaware KS Kansas MT Montana stateMap.put("AL", "Alabama") The rest of the keys are added similarly. stateMap.put("AZ", "Arizona") stateMap.put("AK", "Alaska") 1 hashCode("AZ")  2105 hashCode("AL")  2091 hashCode("AK")  2090 2 3 IL Illinois null MN Minnesota NY New York ND North Dakota OH Ohio IL Illinois null MN Minnesota NY New York ND North Dakota OH Ohio SC South Carolina TN Tennessee IL Illinois null MN Minnesota NY New York ND North Dakota OH Ohio SC South Carolina TN Tennessee VA Virginia IL Illinois null MN Minnesota NY New York ND North Dakota IL Illinois null MN Minnesota NY New York ND North Dakota OH Ohio SC South Carolina IL Illinois null MN Minnesota IL Illinois null MN Minnesota NY New York IL Illinois null 2090 % 7  4 2105 % 7  5 2091 % 7  5 4 The key "AL" therefore goes in bucket 5. The key "AK" therefore goes in bucket 4. The key "AZ" therefore goes in bucket 5. 5 6 HI Hawaii null MA Massachusetts SD South Dakota HI Hawaii null MA Massachusetts MO Missouri NE Nebraska HI Hawaii null HI Hawaii null MA Massachusetts MO Missouri HI Hawaii null MA Massachusetts MO Missouri NE Nebraska SD South Dakota NE Nebraska MO Missouri MA Massachusetts HI Hawaii null Because bucket 5 already contains "AL", the "AZ" must be added to the chain. Suppose you call stateMap.get("HI") hashCode("HI")  2305 2305 % 7  2 IN Indiana null MI Michigan NM New Mexico UT Utah IN Indiana null MI Michigan IN Indiana null IN Indiana null MI Michigan NM New Mexico The key "HI" must therefore be in bucket 2 and can be located by searching the chain. OK Oklahoma AK Alaska null AR Arkansas IA Iowa OR Oregon PA Pennsylvania OK Oklahoma AK Alaska null AR Arkansas IA Iowa OR Oregon PA Pennsylvania RI Rhode Island OK Oklahoma AK Alaska null AR Arkansas IA Iowa OR Oregon PA Pennsylvania RI Rhode Island TX Texas OK Oklahoma AK Alaska null AR Arkansas IA Iowa OR Oregon OK Oklahoma AK Alaska null AR Arkansas IA Iowa AK Alaska null AR Arkansas OK Oklahoma AK Alaska null AR Arkansas IA Iowa OR Oregon PA Pennsylvania RI Rhode Island TX Texas WA Washington AK Alaska null AR Arkansas IA Iowa AK Alaska null WV West Virginia OK Oklahoma AK Alaska null AR Arkansas IA Iowa OR Oregon PA Pennsylvania RI Rhode Island TX Texas WA Washington GA Georgia AL Alabama null AZ Arizona CT Connecticut MD Maryland NV Nevada GA Georgia AL Alabama null AZ Arizona CT Connecticut MD Maryland NV Nevada NH New Hampshire CT Connecticut AL Alabama null AZ Arizona GA Georgia AL Alabama null AZ Arizona CT Connecticut MD Maryland GA Georgia AL Alabama null AZ Arizona CT Connecticut AL Alabama null AL Alabama null AZ Arizona GA Georgia AL Alabama null AZ Arizona CT Connecticut MD Maryland NV Nevada NH New Hampshire WI Wisconsin FL Florida null KY Kentucky FL Florida null KY Kentucky LA Louisiana ME Maine MS Mississippi VT Vermont FL Florida null KY Kentucky LA Louisiana ME Maine MS Mississippi FL Florida null KY Kentucky LA Louisiana ME Maine FL Florida null FL Florida null KY Kentucky LA Louisiana skip simulation

Achieving O(1) Performance
The simulation on the previous side uses only seven buckets to emphasize what happens when collisions occur: the smaller the number of buckets, the more likely collisions become. In practice, the implementation of HashMap uses a much larger value for nBuckets to minimize the opportunity for collisions. If the number of buckets is considerably larger than the number of keys, most of the bucket chains will either be empty or contain exactly one key/value pair. The ratio of the number of keys to the number of buckets is called the load factor of the map. Because a map achieves O(1) performance only if the load factor is small, the library implementation of HashMap increases the number of buckets when the table becomes too full. This process is called rehashing.

The hashmap.h Interface
/* * File: hashmap.h * * This interface exports the HashMap class, which maintains a collection * of key/value pairs using a hashtable as the underlying data structure. */ #ifndef _hashmap_h #define _hashmap_h #include <string> * Class: HashMap<KeyType,ValueType> * * The HashMap class maintains an association between keys and values. * The types used for keys and values are specified as template parameters, * which makes it possible to use this structure with any data type. template <typename KeyType, typename ValueType> class HashMap { public:

/* * Constructor: HashMap * Usage: HashMap<KeyType,ValueType> map; * * Initializes a new empty map that associates keys and values of the * specified types. The type used for the key must define the == operator, * and there must be a free function with the following signature: * * int hashCode(KeyType key); * that returns a positive integer determined by the key. This interface * exports hashCode functions for string and the C++ primitive types. */ HashMap(); * Destructor: ~HashMap * Usage: (usually implicit) * * Frees any heap storage associated with this map. ~HashMap(); /* * File: hashmap.h * * This interface exports the HashMap class, which maintains a collection * of key/value pairs using a hashtable as the underlying data structure. */ #ifndef _hashmap_h #define _hashmap_h #include <string> * Class: HashMap<KeyType,ValueType> * * The HashMap class maintains an association between keys and values. * The types used for keys and values are specified as template parameters, * which makes it possible to use this structure with any data type. template <typename KeyType, typename ValueType> class HashMap { public:

/* * Method: size * Usage: int nEntries = map.size(); * * Returns the number of entries in this map. */ int size(); * Method: isEmpty * Usage: if (map.isEmpty()) . . . * * Returns true if this map contains no entries. bool isEmpty(); * Method: clear * Usage: map.clear(); * * Removes all entries from this map. void clear(); /* * Constructor: HashMap * Usage: HashMap<KeyType,ValueType> map; * * Initializes a new empty map that associates keys and values of the * specified types. The type used for the key must define the == operator, * and there must be a free function with the following signature: * * int hashCode(KeyType key); * that returns a positive integer determined by the key. This interface * exports hashCode functions for string and the C++ primitive types. */ HashMap(); * Destructor: ~HashMap * Usage: (usually implicit) * * Frees any heap storage associated with this map. ~HashMap();

/* * Method: put * Usage: map.put(key, value); * * Associates key with value in this map. Any previous value associated * with key is replaced by the new value. */ void put(KeyType key, ValueType value); * Method: get * Usage: ValueType value = map.get(key); * * Returns the value associated with key in this map. If key is not * found, the get method signals an error. ValueType get(KeyType key); * Method: containsKey * Usage: if (map.containsKey(key)) . . . * Returns true if there is an entry for key in this map. bool containsKey(KeyType key); /* * Method: size * Usage: int nEntries = map.size(); * * Returns the number of entries in this map. */ int size(); * Method: isEmpty * Usage: if (map.isEmpty()) . . . * * Returns true if this map contains no entries. bool isEmpty(); * Method: clear * Usage: map.clear(); * * Removes all entries from this map. void clear();

#include "hashmappriv.h" }; #include "hashmapimpl.cpp" /* * Function: hashCode * Usage: int hash = hashCode(key); * * Returns a hash code for the specified key, which is always a * nonnegative integer. This function is overloaded to support * all of the primitive types and the C++ <code>string</code> type. */ int hashCode(std::string key); int hashCode(int key); int hashCode(char key); int hashCode(long key); int hashCode(double key); #endif /* * Method: put * Usage: map.put(key, value); * * Associates key with value in this map. Any previous value associated * with key is replaced by the new value. */ void put(KeyType key, ValueType value); * Method: get * Usage: ValueType value = map.get(key); * * Returns the value associated with key in this map. If key is not * found, the get method signals an error. ValueType get(KeyType key); * Method: containsKey * Usage: if (map.containsKey(key)) . . . * Returns true if there is an entry for key in this map. bool containsKey(KeyType key);

The hashmappriv.h File /* * File: hashmappriv.h * -------------------
* This file contains the private section of the hashmap.h interface. */ private: /* Type definition for cells in the bucket chain */ struct Cell { KeyType key; ValueType value; Cell *link; }; /* Instance variables */ Cell **buckets; /* Dynamic array of pointers to cells */ int nBuckets; /* The number of buckets in the array */ int count; /* The number of entries in the map */

The hashmappriv.h File /* /* * File: hashmappriv.h
* Private method: findCell * Usage: Cell *cp = findCell(bucket, key); * * Finds a cell in the chain for the specified bucket that matches key. * If a match is found, the return value is a pointer to the cell containing * the matching key. If no match is found, the function returns NULL. */ Cell *findCell(int bucket, ValueType key) { Cell *cp = buckets[bucket]; while (cp != NULL && key != cp->key) { cp = cp->link; } return cp; /* * File: hashmappriv.h * * This file contains the private section of the hashmap.h interface. */ private: /* Type definition for cells in the bucket chain */ struct Cell { KeyType key; ValueType value; Cell *link; }; /* Instance variables */ Cell **buckets; /* Dynamic array of pointers to cells */ int nBuckets; /* The number of buckets in the array */ int count; /* The number of entries in the map */

The hashmapimpl.cpp Implementation
/* * File: hashmapimpl.cpp * * This file contains the private section of the hashmap.cpp interface. * Because of the way C++ compiles templates, this code must be * available to the compiler when it reads the header file. */ #ifdef _hashmap_h #include "error.h" * Implementation notes: HashMap class * * In this map implementation, the entries are stored in a hashtable. * The hashtable keeps an array of buckets, where each bucket is a * linked list of elements that share the same hash code. If two or * more keys have the same hash code (which is called a "collision"), * each of those keys will be on the same list. Ideally, however, the * number of such collisions will be small, so that all of the operations * can run in constant time. To achieve that goal, it is necessary to * expand the number of buckets when the lists start to fill up. That * operation (called "rehashing") is not implemented here and is instead * left as an exercise.

/* Constant definitions */ const int INITIAL_BUCKET_COUNT = 101; /* * Implementation notes: constructor and destructor * * The constructor allocates the array of buckets and initializes * each bucket to the empty list. The destructor must free the memory, * but the clear function already takes care of that. */ template <typename KeyType,typename ValueType> HashMap<KeyType,ValueType>::HashMap() { nBuckets = INITIAL_BUCKET_COUNT; buckets = new Cell*[nBuckets]; for (int i = 0; i < nBuckets; i++) { buckets[i] = NULL; } count = 0; HashMap<KeyType,ValueType>::~HashMap() { clear(); /* * File: hashmapimpl.cpp * * This file contains the private section of the hashmap.cpp interface. * Because of the way C++ compiles templates, this code must be * available to the compiler when it reads the header file. */ #ifdef _hashmap_h #include "error.h" * Implementation notes: HashMap class * * In this map implementation, the entries are stored in a hashtable. * The hashtable keeps an array of buckets, where each bucket is a * linked list of elements that share the same hash code. If two or * more keys have the same hash code (which is called a "collision"), * each of those keys will be on the same list. Ideally, however, the * number of such collisions will be small, so that all of the operations * can run in constant time. To achieve that goal, it is necessary to * expand the number of buckets when the lists start to fill up. That * operation (called "rehashing") is not implemented here and is instead * left as an exercise.

template <typename KeyType,typename ValueType> int HashMap<KeyType,ValueType>::size() { return count; } bool HashMap<KeyType,ValueType>::isEmpty() { return count == 0; void HashMap<KeyType,ValueType>::clear() { for (int i = 0; i < nBuckets; i++) { Cell *cp = buckets[i]; while (cp != NULL) { Cell *oldCell = cp; cp = cp->link; delete oldCell; count = 0; /* Constant definitions */ const int INITIAL_BUCKET_COUNT = 101; /* * Implementation notes: constructor and destructor * * The constructor allocates the array of buckets and initializes * each bucket to the empty list. The destructor must free the memory, * but the clear function already takes care of that. */ template <typename KeyType,typename ValueType> HashMap<KeyType,ValueType>::HashMap() { nBuckets = INITIAL_BUCKET_COUNT; buckets = new Cell*[nBuckets]; for (int i = 0; i < nBuckets; i++) { buckets[i] = NULL; } count = 0; HashMap<KeyType,ValueType>::~HashMap() { clear();

template <typename KeyType,typename ValueType> void HashMap<KeyType,ValueType>::put(KeyType key, ValueType value) { int bucket = hashCode(key) % nBuckets; Cell *cp = findCell(bucket, key); if (cp == NULL) { cp = new Cell; cp->key = key; cp->link = buckets[bucket]; buckets[bucket] = cp; } cp->value = value; ValueType HashMap<KeyType,ValueType>::get(KeyType key) { if (cp == NULL) error("get: No value for key"); return cp->value; bool HashMap<KeyType,ValueType>::containsKey(KeyType key) { return findCell(bucket, key) != NULL; #endif template <typename KeyType,typename ValueType> int HashMap<KeyType,ValueType>::size() { return count; } bool HashMap<KeyType,ValueType>::isEmpty() { return count == 0; void HashMap<KeyType,ValueType>::clear() { for (int i = 0; i < nBuckets; i++) { Cell *cp = buckets[i]; while (cp != NULL) { Cell *oldCell = cp; cp = cp->link; delete oldCell; count = 0;

Java Review public boolean equals(Object obj) { return (this == obj); }

Part II: Advanced Concepts
Jingyu Zhou

Map A very useful abstraction Stroe key-value pairs
Any kind of dictionary, lookup table, index, databases, etc. Stroe key-value pairs Fast access via key Operations to optimize: put, get

Basic Idea Use hash function to map keys into positions in a hash table Ideally If element e has key k and h is hash function, then e is stored in position h(k) of table To search for e, compute h(k) to locate position. If no element, dictionary does not contain e.

Example Dictionary Student Records
Keys are ID numbers ( ), no more than 100 students Hash function: h(k) = k maps ID into distinct table positions array table[1001] hash table ... 1 2 3 1000 buckets

Analysis (Ideal Case) O(b) time to initialize hash table (b number of positions or buckets in hash table) O(1) time to perform insert, remove, search

Ideal Case is Unrealistic
Works for implementing dictionaries, but many applications have key ranges that are too large to have 1-1 mapping between buckets and keys! Example: Suppose key can take on values from ,535 (2 byte unsigned int) Expect  1,000 records at any given time Impractical to use hash table with 65,536 slots!

h(k1) =  = h(k2): k1 and k2 have collision at slot 
Hash Functions If key range too large, use hash table with fewer buckets and a hash function which maps multiple keys to same bucket: h(k1) =  = h(k2): k1 and k2 have collision at slot  Popular hash functions: hashing by division h(k) = k%D, where D number of buckets in hash table Example: hash table with 11 buckets h(k) = k%11 80  3 (80%11= 3), 40  7, 65  10 58  3 collision!

Collision Resolution Policies
Two classes: Open hashing, a.k.a. separate chaining Closed hashing, a.k.a. open addressing Difference has to do with whether collisions are stored outside the table (open hashing) or whether collisions result in storing one of the records at another slot in the table (closed hashing)

Closed Hashing Associated with closed hashing is a rehash strategy:
“If we try to place x in bucket h(x) and find it occupied, find alternative location h1(x), h2(x), etc. Try each in order, if none empty table is full,” h(x) is called home bucket Simplest rehash strategy is called linear hashing hi(x) = (h(x) + i) % D In general, collision resolution strategy is to generate a sequence of hash table slots (probe sequence) that can hold the record; test each slot until find empty one (probing)

Example Linear (Closed) Hashing
D=8, keys a,b,c,d have hash values h(a)=3, h(b)=0, h(c)=4, h(d)=3 Where do we insert d? 3 already filled Probe sequence using linear hashing: h1(d) = (h(d)+1)%8 = 4%8 = 4 h2(d) = (h(d)+2)%8 = 5%8 = 5* h3(d) = (h(d)+3)%8 = 6%8 = 6 etc. 7, 0, 1, 2 Wraps around the beginning of the table! b 1 2 3 a c 4 d 5 6 7

Operations Using Linear Hashing
Test for membership: findItem Examine h(k), h1(k), h2(k), …, until we find k or an empty bucket or home bucket If no deletions possible, strategy works! What if deletions? If we reach empty bucket, cannot be sure that k is not somewhere else and empty bucket was occupied when k was inserted Need special placeholder deleted, to distinguish bucket that was never used from one that once held a value May need to reorganize table after many deletions

Performance Analysis - Worst Case
Initialization: O(b), b# of buckets Insert and search: O(n), n number of elements in table; all n key values have same home bucket No better than linear list for maintaining dictionary!

Performance Analysis - Avg Case
Distinguish between successful and unsuccessful searches Delete = successful search for record to be deleted Insert = unsuccessful search along its probe sequence Expected cost of hashing is a function of how full the table is: load factor  = n/m

Hash Analysis Open Addressing Search：
Assuming: uniform hashing , the length is m n+1 i=1 q i : not hit in the hash table for the ith probing。 = × …... 1 m n-1 m-1 m-2 n-2 n-i+2 m-i+2 n-(i-1) m-(i-1) 1- and：1 <= i <= n+1 So ： Un = ∑ i × qi ＝ 1/(1-n/(m+1)) ≈ 1－α α = n/m n Unsuccessful Search : Insert Un = ∑ i × qi 1/1-a = 1+a+a^2+a^3+…, if a = 0.5, then Un= 2, if a= 0.9 , then Un = 10.

Hash Analysis Open Addressing Search： Assuming: uniform hashing , the length is m Successful Search: A search for a key k follows the same probe sequence as was followed when the element with key k was inserted. If k was the (i + 1)st key inserted into the hash table, the expected number of probes made in a search for k is at most 1 / (1 - i/m) = m/(m - i). Averaging over all n keys in the hash table gives us the average number of probes in a successful search: a = 0.5, then Sn= 1.387 A = 0.9 then Sn = 2.559

Improved Collision Resolution
Linear probing: hi(x) = (h(x) + i) % D All buckets in table will be candidates for inserting a new record before the probe sequence returns to home position Clustering of records, leads to long probing sequences Linear probing with skipping: hi(x) = (h(x) + ic) % D c constant other than 1 Records with adjacent home buckets will not follow same probe sequence (Pseudo)Random probing: hi(x) = (h(x) + ri) % D ri is the ith value in a random permutation of numbers from 1 to D-1 Insertions and searches use the same sequence of “random” numbers

h(k) = k%11 Example insert 1052 (h.b. 7) I II 1 2 3 4 5 6 7 8 9 10 1001 9537 3016 9874 2009 9875 1052 1 2 3 4 5 6 7 8 9 10 1001 9537 3016 9874 2009 9875 1. What if next element has home bucket 0?  go to bucket 3 Same for elements with home bucket 1 or 2! Only a record with home position 3 will stay.  p = 4/11 that next record will go to bucket 3 2. Similarly, records hashing to 7,8,9 will end up in 10 3. Only records hashing to 4 will end up in 4 (p=1/11); same for 5 and 6 next element in bucket 3 with p = 8/11

Hash Functions - Numerical Values
Consider: h(x) = x%16 poor distribution, not very random depends solely on least significant four bits of key Better, mid-square method if keys are integers in range 0,1,…,K , pick integer C such that DC2 about equal to K2, then h(x) = x2/C % D extracts middle r bits of x2, where 2r=D (a base-D digit) better, because most or all of bits of key contribute to result

Hash Function – Strings of Characters
Folding Method: int h(String x, int D) { int i, sum; for (sum=0, i=0; i<x.length(); i++) sum+= (int)x.charAt(i); return (sum%D); } Sums the ASCII values of the letters in the string ASCII value for “A” =65; sum will be in range for 10 upper-case letters; good when D around 100, for example Order of chars in string has no effect

Hash Function – Strings of Characters
Much better: Cyclic Shift static long hashCode(String key, int D) { int h=0; for (int i=0, i<key.length(); i++){ h = (h << 4) | ( h >> 27); h += (int) key.charAt(i); } return h%D;

Open Hashing Each bucket in the hash table is the head of a linked list All elements that hash to a particular bucket are placed on that bucket’s linked list Records within a bucket can be ordered in several ways By order of insertion, by key value order, or by frequency of access order

Open Hashing Data Organization
... 1 ... 2 3 4 ... D-1

Analysis Open hashing is most appropriate when the hash table is kept in main memory, implemented with a standard in-memory linked list We hope that number of elements per bucket roughly equal in size, so that the lists will be short If there are n elements in set, then each bucket will have roughly n/D If we can estimate n and choose D to be roughly as large, then the average bucket will have only one or two members

Analysis Cont’d Average time per dictionary operation:
D buckets, n elements in dictionary  average n/D elements per bucket insert, search, remove operation take O(1+n/D) time each If we can choose D to be about n, constant time Assuming each element is likely to be hashed to any bucket, running time constant, independent of n

Comparison with Closed Hashing
Worst case performance is O(n) for both Number of operations for hashing D=9 h(x) = x % D

Hashing Problem Draw the 11 entry hashtable for hashing the keys 12, 44, 13, 88, 23, 94, 11, 39, 20 using the function (2i+5) mod 11, closed hashing, linear probing Pseudo-code for listing all identifiers in a hashtable in lexicographic order, using open hashing, the hash function h(x) = first character of x. What is the running time?

Review: Expected Number of Probes in Searches
Let λ = n / m (load factor) Unsuccessful Search Successful Search Chaining 1+α 1 + α (1 + average number before element in chain) Open Addressing ( assuming uniform hashing ) 1 / (1 – α)

Hash Functions A hash function is denoted by h: {0, 1}*  {0, 1}n
where n is a security parameter, say 128, 160, 256 or 512. In English: A function which is applicable to data of any size. Its produces a fixed length output, usually short. 3 Types of Security Requirements: One-way: given an output z, it is difficult to find x such that z = h(x). Weak collision-resistant: given x, it is difficult to find y  x such that h(y) = h(x). Strong collision-resistant: it is difficult to find any pair (x, y) such that h(x) = h(y). Note: Strong collision-resistant  Weak collision-resistant  One-way Let m be some message. h(x) is called the message digest.

Birthday Attack Birthday Paradox:
If there are 23 people in a room, the probability that at least two people have the same birthday is slightly more than 50%. If there are 30, the probability is around 70%. This process is analogous to throwing k balls randomly into n bins and checking to see if some bin contains at least two balls. For having more than half chance of finding at least two balls in one bin, k  1.17 n1/2 E.g. n = 365  k  23

Birthday Attack Against a Hash Function
Finding collisions of a hash function using Birthday Paradox. randomly chooses k messages, x1, x2, …, xk search if there is a pair of messages, say xi and xj such that h(xi) = h(xj). If so, one collision is found. This birthday attack imposes a lower bound on the size of message digests. e.g. 40-bit message digest would be very insecure, since a collision could be found with probability at least ½ after doing slight over 220 (about a million) random hashes.

Size of a Message Digest / Hash Value
h : {0,1}*{0,1}n If n = 64, the probability of finding one collision will be higher than half after slightly more than 232 random hashes being tried. If there exists a machine which can carry out 100,000 hashes per second, it takes 12 hours for finding the first collision with probability higher than half. Recommended message digest lengths (in bits): 128 (MD5), 160 (SHA-1), 256 (SHA-256) or 512 (SHA-512) For those recommended lengths, because the number of possible hashes is so large, the odds of finding one by chance is negligibly small (one in 280 for SHA-1).

Examples: MD5 and SHA-1 MD5
MD – Message Digest, designed by Ron Rivest in 1992. Available at Output length: 128 bits A Birthday Attack can be launched using 264 trials. SHA-1 Developed by NIST based on MD4, a precursor to MD5, in 1995 Available at Output length: 160 bits More difficult to launch a birthday attack: needs 280 trials. SHA-2 (SHA 256/384/512) Based on SHA-1 with a longer hash value

Security Updates of Hash Functions
MD5 In Aug 2004, Wang, et al. showed that it is “easy” to find collisions in MD5. They found many collisions in very short time (in minutes) SHA-1 In Feb 2005, Wang, et al. showed that collisions can be found in SHA-1 with an estimated effort of 269 hash computations. Less than 280 hash computations by birthday attack. Impacts Hurts digital signatures Does not affect HMAC where collisions aren’t important. For applications require underlying hash functions should be collision resistant, it’s time to migrate away from SHA-1. Start using new standards SHA-256 and SHA-512.

Some Details about Finding Collisions in SHA-1
Q: How hard would it be to find collisions in SHA-1? A: The reported attacks require an estimated work factor of 269 (approximately 590 billion billion) hash computations. While this is well beyond what is currently feasible using a normal computer, this is potentially feasible for attackers who have specialized hardware. For example, with 10,000 custom ASICs that can each perform 2 billion hash operations per second, the attack would take about one year. Computing improvements predicted by Moore 's Law will make the attack more practical over time, e.g. making it possible for a wide-spread Internet virus to use compromised computers to mount such attacks as well. Once a collision has been found, additional collisions can be found trivially by concatenating data to the matching messages. Borrowed from v0.1

SHA-1 Broken Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu (mostly from Shandong University in China) collisions in the the full SHA-1 in 2**69 hash operations, much less than the brute-force attack of 2**80 operations based on the hash length. collisions in SHA-0 in 2**39 operations. collisions in 58-round SHA-1 in 2**33 operations.

The Art of Computer Programming
III Insert Search

Reference PAC Chapter 15, DS Chapter 7.4

Next Expression Tree PAC Chapter 17

The End

Maps and Hashing.

Similar presentations

Presentation on theme: "Maps and Hashing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Maps and Hashing.

Similar presentations

Presentation on theme: "Maps and Hashing."— Presentation transcript:

Similar presentations

About project

Feedback