Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cpt S 223 – Advanced Data Structures Hashing

Similar presentations


Presentation on theme: "Cpt S 223 – Advanced Data Structures Hashing"— Presentation transcript:

1 Cpt S 223 – Advanced Data Structures Hashing
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University

2 Overview Hashing Hash table ADT
Technique supporting insertion, deletion, and search in average-case constant time Operations requiring elements to be sorted (e.g. find minimum) are not efficiently supported Hash table ADT Implementations Analysis Applications

3 Hash Table One approach
Hash table is an array of fixed size (called TableSize) Array elements indexed by a key, which is mapped to an array index (0…TableSize – 1) Mapping (or hash function) h from key to index E.g. h(“john”) = 3 Element value Key

4 Factors Affecting Hash Table Design
Hash function Table size Usually fixed at the start Collision handling schemes

5 Hash Table (cont’d) Insert Delete Search What if h(“john”) = h(“joe”)?
Hash key Insert T[h(“john”)] = <“john”, 25000> Delete T[h(“john”)] = NULL i.e., T[3] = NULL Search Return T[h(“john”)] What if h(“john”) = h(“joe”)? Data record

6 h(key) ==> hash table index
Hash Function Mapping from key to array index is called a hash function Typically, many-to-one mapping Different keys map to different indices Distributes keys evenly over table Collision occurs when hash function maps two keys to same array index h(key) ==> hash table index

7 Hash Function (cont’d)
Simple hash function h(key) = key mod TableSize Assumes integer keys For random keys, h() distributes keys evenly over table What if TableSize = 100 and keys are multiples of 10? Better if TableSize is a prime number Not too close to powers of 2 or 10

8 Hash Function for String Keys
Add up character ASCII values (0-127) to produce integer keys E.g. “abcd” = = 394 h(“abcd”) = 394 mod TableSize Small strings may not use all of table strlen(s) * 127 < TableSize Suppose TableSize = 10,007 (prime number) and keys are 8 or fewer characters long ASCII Values lie between (0 to 127*8) => (0 to 1,016) Not an equitable distribution Anagrams will map to the same index h(“abcd”) = h(“dbac”)

9 Hash Function for String Keys
Treat first 3 characters of string as base-27 integer (26 letters plus space) key = S[0] + (27 * S[1]) + (272 * S[2]) Assumes first 3 characters randomly distributed Not true for English 263 = 17, 576 possible combinations of 3 characters, a check on a large dictionary reveals that # of different combinations is 2,851 Assume none of these combinations collide, only 28% of the table (TableSize = 10,007 ) can actually be hashed Function is easily computable, but not appropriate if the hash table is reasonably large

10 Hash Function for String Keys (cont’d)
Use all N characters of string as N-digit base-K integer Choose K to be prime number larger than number of different digits (characters) E.g. K = 29, 31, 37 If L = length of string s, then Use Horner’s rule to compute h(s) Computes a polynomial function (of 37) hk = k0 + 37k k2 Limit L for long strings Keys could be a complete street address Include a couple of characters from the street address Use characters from the odd spaces

11 Collision Resolution What happens when h(k1) = h(k2)?
Collision resolution strategies Separate Chaining Store colliding keys in a linked list at the same hash table index Open addressing Store colliding keys elsewhere in the table

12 Collision Resolution Approach #1
Chaining Collision Resolution Approach #1

13 Collision Resolution by Chaining
Hash table T is a vector of lists Only singly-linked lists needed if memory is tight Key k is stored in list at T[h(k)] E.g. TableSize = 10 h(k) = k mod 10 Insert first 10 perfect squares Insertion sequence = 0, 1, 4, 9, 16, 25, 36, 49, 64, 81

14 Operations on Chaining
Search Use a hash function to determine which list to traverse Insert Check the appropriate list to see if the element is already in place If it is new, can be inserted at the front of the list

15 Implementation of Chaining Hash Table
Hash table stores an array of linked lists Generic hash functions for integer and string keys

16 Implementation of Chaining Hash Table (cont’d)

17 Implementation of Chaining Hash Table (cont’d)
STL algorithm: find Each of these operations takes time linear in the length of the list

18 Implementation of Chaining Hash Table (cont’d)
No duplicates Doubles size of table and reinserts current elements (more on this later)

19 Implementation of Chaining Hash Table (cont’d)
All hash objects must define == and != operators Hash function to handle Employee object type

20 Collision Resolution by Chaining: Analysis
Load factor  of a hash table T N = number of elements in T M = size of T  = N / M Average length of a chain is  Unsuccessful search O() Successful search O( / 2) Ideally, we want   1 (not a function of N) i.e., TableSize = number of elements you expect to store in the table Keep the TableSize prime to ensure a good distribution

21 Conclusion: Collision Resolution by Chaining
To resolve the collision besides Linked Lists Binary Search Tree (BST) Another Hash Table Table is large and Hash Function is good All the lists should be short Not worthwhile to try anything complicated

22 Collision Resolution Approach #2
Open Addressing Collision Resolution Approach #2

23 Collision Resolution by Open Addressing
When a collision occurs, look elsewhere in the table for an empty slot Advantages over chaining No need for additional list structures No need to allocate/deallocate memory during insertion/deletion (slow) Disadvantages Slower insertion – may need several attempts to find an empty slot Table needs to be bigger (than chaining-based table) to achieve average-case constant-time performance Load factor   0.5 Probing Hash Tables

24 Collision Resolution by Open Addressing
Probe sequence Sequence of slots in hash table to search h0(x), h1(x), h2(x), … Needs to visit each slot exactly once Needs to be repeatable (so we can find/delete what we’ve inserted) Hash function hi(x) = (hash(x) + f(i)) mod TableSize f(0) = 0 ==> first try Three common collision resolution strategies Linear Probing Quadratic Probing Double Hashing

25 Linear Probing f(i) is a linear function of i
E.g. f(i) = i The function, f, is known as collision resolution strategy Example: hash(x) = x mod TableSize h0(89) = (hash(89) + f(0)) mod 10 = 9 h0(18) = (hash(18) + f(0)) mod 10 = 8 h0(49) = (hash(49) + f(0)) mod 10 = 9 (X) h1(49) = (hash(49) + f(1)) mod 10 = 0

26 Linear Probing Example
Insert sequence: 89, 18, 49, 58, 69

27 Linear Probing: Analysis
Probe sequences can get long Primary clustering Any Key that hashes into the cluster will require several attempts to resolve the collision Keys tend to cluster in one part of table Keys that hash into the cluster will be added to the end of the cluster (making it even bigger) Even if the table is relatively empty, blocks of occupied cells start forming---this effect is known as primary clustering

28 Linear Probing: Analysis (cont’d)
Expected number of probes for insertion or unsuccessful search Expected number of probes for successful search Example ( = 0.5) Insert/unsuccessful search 2.5 probes Successful search 1.5 probes Example ( = 0.9) 50.5 probes 5.5 probes

29 Random Probing: Analysis
Random probing does not suffer from clustering Large table Probes are independent to each other Expected number of probes for insertion or unsuccessful search: Example  = 0.5: 1.4 probes  = 0.9: 2.6 probes

30 Linear vs. Random Probing
Linear probing Random probing U – unsuccessful search S – successful search I – insert # of probes Load factor  Linear Probing Example ( = 0.9) Insert/unsuccessful search (50.5 probes) Successful search (5.5 probes) Random Probing Example ( = 0.9) Insert/unsuccessful search (10 probes) Successful search (4 probes)

31 Quadratic Probing Avoids primary clustering f(i) is quadratic in i
E.g., f(i) = i2 Example h0(58) = (hash(58) + f(0)) mod 10 = 8 (X) h1(58) = (hash(58) + f(1)) mod 10 = 9 (X) h2(58) = (hash(58) + f(2)) mod 10 = 2

32 Quadratic Probing Example
Insert sequence: 89, 18, 49, 58, 69 Question: Delete 49, find 49, is there a problem?

33 Quadratic Probing: Analysis
Difficult to analyze Theorem 5.1: New element can always be inserted into a table that is at least half empty and TableSize is prime Proof. Table Size is an odd prime > 3 and 0 ≤ i,j ≤ TableSize/2 . Show that TableSize/2  alternative locations are all distinct Start with 2 locations; h(x) + i2 (mod TableSize) and h(x) + j2 (mod TableSize) Prove it with contradiction Otherwise, may never find an empty slot, even if one exists Ensure table never gets half full If close, then expand it

34 Quadratic Probing (cont’d)
Only M (M is TableSize) different probe sequences Elements that hash to same position will probe the same alternative cells ---known as “secondary clustering” Deletion (Remove 89) Emptying slots can break probe sequences Lazy deletion Differentiate between empty and deleted slot Skip deleted slots Slows operations (effectively increases )

35 Quadratic Probing: Implementation
Class Interface for Hash Table using probing strategies

36 Quadratic Probing: Implementation (cont’d)
Lazy deletion

37 Quadratic Probing: Implementation (cont’d)
Ensure table size is prime

38 Quadratic Probing: Implementation (cont’d)
Find Skip DELETED; No duplicates Quadratic probe sequence f(i) = f(i-1) + (2i – 1)

39 Quadratic Probing: Implementation (cont’d)
Insert No duplicates Remove No deallocation needed

40 Double Hashing Quadratic probing problems: secondary clustering
elements that hash to the same position will probe the same alternative cells Combine two different hash functions f(i) = i * hash2(x) Good choices for hash2(x)? Should never evaluate to 0 hash2(x) = R – (x mod R) R is a prime number less than TableSize Previous example with R = 7 h0(49) = (hash(49) + f(0)) mod 10 = 9 (X) h1(49) = (hash(49) + (7 – 49 mod 7)) mod 10 = 6 f(1)

41 Double Hashing Example

42 Double Hashing: Analysis
Imperative that TableSize is prime E.g., insert 23 into previous table Empirical tests show double hashing close to random hashing Extra hash function takes extra time to compute

43 Rehashing Increase the size of the hash table when load factor too high Typically expand the table to twice its size (but still prime) Reinsert existing elements into new hash table

44 Rehashing Example Hash Table with linear probing with input 13, 15, 6, 24 h(x) = x mod 7  = 0.57 h(x) = x mod 17  = 0.29 Rehashing Insert 23  = 0.71

45 Rehashing Analysis Rehashing takes O(N) time But happens infrequently
Specifically Must have been N/2 insertions since last rehash Amortizing the O(N) cost over the N/2 prior insertions yields only constant additional time per insertion

46 Rehashing Implementation
When to rehash When table is half full ( = 0.5) When an insertion fails When load factor reaches some threshold Works for chaining and open addressing

47 Rehashing for Chaining

48 Rehashing for Quadratic Probing

49 Hash Tables in C++ STL Hash tables are not part of the C++ Standard library Some implementations of STL have hash tables (e.g., SGI’s STL) hash_set hash_map

50 Hash Set in SGI’s STL Key Hash function Key equality test
#include <hash_set> struct eqstr { bool operator()(const char* s1, const char* s2) const { return strcmp(s1, s2) == 0; } }; void lookup(const hash_set<const char*, hash<const char*>, eqstr>& Set, const char* word) { hash_set<const char*, hash<const char*>, eqstr>::const_iterator it = Set.find(word); cout << word << ": " << (it != Set.end() ? "present" : "not present") << endl; int main() { hash_set<const char*, hash<const char*>, eqstr> Set; Set.insert("kiwi"); lookup(Set, “kiwi"); Key Hash function Key equality test

51 Hash Map in SGI’s STL #include <hash_map> struct eqstr {
bool operator() (const char* s1, const char* s2) const { return strcmp(s1, s2) == 0; } }; int main() { hash_map<const char*, int, hash<const char*>, eqstr> months; months["january"] = 31; months["february"] = 28; months["december"] = 31; cout << “january -> " << months[“january"] << endl; Key Data Hash function Key equality test

52 Problem with Large Tables
What if hash table is too large to store in main memory? Solution: Store hash table on disk Minimize disk accesses But… Collisions require disk accesses Rehashing requires a lot of disk accesses Solution: Extendible hashing

53 Extendible Hashing Store hash table in a depth–1 tree
Every search takes 2 disk accesses Insertions require few disk accesses Hash the keys to a long integer (“extendible”) Use first few bits of extended keys as the keys in the root node (“directory”) Leaf nodes contain all extended keys starting with the bits in the associated root node key

54 Extendible Hashing Example
Extendible hash table Contains N = 12 data elements First D = 2 bits of key used by root node keys 2D entries in directory Each leaf contains up to M = 4 data elements As determined by disk page size Each leaf stores number of common starting bits (dL)

55 Extendible Hashing Example (cont’d)
After inserting 100100 Directory split and rewritten Leaves not involved in split now pointed to by two adjacent directory entries. These leaves are not accessed.

56 Extendible Hashing Example (cont’d)
After inserting 000000 One leaf splits Only two pointer change in directory

57 Extendible Hashing Analysis
Expected number of leaves is (N/M) * log2e = (N/M) * 1.44 Average leaf is (ln 2) = 0.69 full Same as for B-trees Expected size of directory is O(N(1+1/M)/M) O(N/M) for large M (elements per leaf)

58 Hash Table Applications
Maintaining symbol table in compilers Accessing tree or graph nodes by name E.g., city names in Google maps Maintaining a transposition table in games Remember previous game situations and the move taken (avoid re-computation) Dictionary lookups Spelling checkers Natural language understanding (word sense)

59 Summary Hash tables support fast insert and search
O(1) average case performance Deletion possible, but degrades performance Not good if need to maintain ordering over elements Many applications

60 Points to Remember – Hash Tables
Table size prime Table size much larger than number of inputs (to maintain  closer to 0 or < 0.5) Tradeoffs between chaining vs. probing Collision chances decrease in this order: linear probing, quadratic probing, {random probing, double hashing} Rehashing required to resize hash table at time when  exceeds 0.5 Good for searching. Not good if there is some order implied by data


Download ppt "Cpt S 223 – Advanced Data Structures Hashing"

Similar presentations


Ads by Google