1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can.

1 Joe Meehean 1

 BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can we do better in the average-case? 2

 “Dictionary” ADT average-case time O(1) for lookup, insert, and delete  Idea stores keys (and associated values) in an array compute each key’s array index as a function of its value take advantage of array’s fast random access  Alternative implementation for sets and maps 3

 Goal Store info about a company’s 50 employees Each employee has a unique employee ID  in range 100-200  Approach use an array of size 101 (range of IDs) store employee E’s info in array[E-100]  Result insert, lookup, delete each O(1) Wasted space, 51 locations 4

 Less functionality than trees  Hash tables cannot efficiently find min find max print entire table in sorted order  Must be very careful how we use them 5

 Hashtable the underlying array  Hash function function that converts a key to an index in example: hash(x) = x – 100  TableSize size of underlying array or vector  Bucket single cell of a hash table array  Collision when two keys hash to the same bucket 6

 Keys we are using have a hash function or we can define good hash functions for them  Keys overload the following operators == != 7

 How do we make a good hash function?  What should we do about collisions?  How large should we make our hash table? 8

 Hash function should be fast  Keys should be evenly distributed different keys should have different hash values  Should reduce space needed e.g., student IDs are 10 digits do not need an array size of 10,000,000,000 there are only ~3,000 students 9

 Convert key to an int n scramble up the data ensure the data spreads over the entire integer space  Return n % TableSize ensures that n doesn’t fall off the end of the table 10

 Method 1 convert each char to an int sum them return sum % TableSize  Advantages simple time is O(key length) 11

 Method 1 convert each char to an int sum them return sum % TableSize  Problems short keys may not reach end of table  sum of characters < TableSize (by a lot) maps all permutations to same hash  hash(“able”) = hash(“bale”) Time is O(key length) 12

 Method 2 Multiply individual chars by different values Then sum a[0] * 37 n + a[1] * 37 n-1 + … + a[n-1] * 37 a[i] * 37 n-i  Advantages produces big range of values permutations hash to different values 13

 Method 2 Multiply individual chars by different values Then sum  Disadvantages relies on integer overflow need to worry about negative hashes  Handling negative hash hash = hash % TableSize if(hash < 0) hash += TableSize 14

 Fast hash vs. evenly distributed hash often faster leads to less evenly distributed even distribution leads to slower  String example could use only some of the characters faster, but more collisions likely 15

 What if two keys hash to the same bucket (array entry)?  Array entries are linked lists (or trees) different keys with same hash value stored in same list (or tree) commonly called chained bucket hashing, or just chaining 17

 TableSize = 10  keys: 10 digit student IDs  hashfn = sum of digits % TableSize ID (Key)ValueSumHash Code 9014638161A399 9103287648B488 4757414352C422 8377690440D488 9031397831E444 18

ID (Key)ValueSumHash Code 9014638161A399 9103287648B488 4757414352C422 8377690440D488 9031397831E444 19 1 2 3 40 5 6 7 8 9 C C E E B B D D A A

 During a lookup  How can we tell which value we want if there are > 1 entries in the bucket?  Compare the keys buckets store keys and values 20

 Related to hashing function  Some hashing functions lead to data clustered together  Using a prime TableSize helps resolve this issue hashing function not like to share factor with table size 23

 If number of keys known in advance make the hash table a little larger prime near 1.25 * the number of keys a little room to avoid collisions trades space for potentially faster lookup  If number of keys not known in advance plan to expand array as needed coming up in another lecture 24

 Lookup Key k 1. compute h = hash(k) 2. see if k is in the list in hashtable[h]  Insert Key k 1. Compute h = hash(k) 2. Make sure k is not already in hashtable[h] 3. Add k to the list in hashtable[h]  Delete Key k 1. Compute h = hash(k) 2. Remove k from list in hashtable[h] 25

26 template class HashSet{ private: vector > table; int currentSize; Hash hashfn; public: … bool contains(const K&) const; void insert(const K&); void remove(const K&); };

 Recall chaining hash tables array cells stored linked lists 2 keys with same hash end up in same list  Chaining hash tables require 2 data structures hash table and linked list  Can we solve collisions with more hashing? use just one data structure 28

 No linked list in array cells  Collisions handled using alternative hash try cells h 0 (x), h 1 (x), h 2 (x),… until an empty cell is found h i (x) = hash(x) + f(i) f(i) is collision resolution strategy  Probing looking for alternative hash locations 29

 f(i) is a linear function often f(i) = i  If a collision occurs, look in the next cell hash(x) + 1 keep looking until an empty cell is found hash(x) + 2, hash(x) + 3, … use modulus to wrap around table  Should eventually find an empty cell if the table is not full 31

 Simple hash: h(x) = x % TableSize 32 1 2 3 40 5 6 7 8 89 9 Insert 89 h0(x)h0(x)

 Simple hash: h(x) = x % TableSize 33 1 2 3 40 5 6 7 18 8 89 9 Insert 18 h0(x)h0(x)

 Simple hash: h(x) = x % TableSize 34 1 2 3 40 5 6 7 18 8 89 9 Insert 49 h0(x)h0(x)

 Simple hash: h(x) = x % TableSize 35 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 49 h1(x)h1(x)

 Simple hash: h(x) = x % TableSize 39 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h3(x)h3(x)

 Advantages no need for list collision resolution function is fast  Disadvantages requires more book keeping primary clustering 40

 What if an entry is deleted and we try to lookup another entry that collided with it? 41 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Delete 89 h0(x)h0(x)

 What if an entry is deleted an we try to lookup another entry that collided with it? 42 58 1 2 3 4 49 0 5 6 7 18 8 9 Lookup 49 h0(x)h0(x)

 Need extra information per cell  Differentiate between states ACTIVE: cell contains a valid key EMPTY: cell never contained a valid key DELETED: previously contained a valid key  All cells start EMPTY  Lookup keep looking until you find key or EMPTY cell 43

44 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Delete 89 h0(x)h0(x) A E E EA E E E A A

45 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Delete 89 h0(x)h0(x) A E E EA E E E A D

46 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Lookup 49 h0(x)h0(x) A E E EA E E E A D

47 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Lookup 49 h1(x)h1(x) A E E EA E E E A D

 Should we? 48

 Inserting into deleted cells  Insert find 1 st empty cell to prevent duplicates find 1 st empty or deleted cell to insert doubles run time  Special case insert a key, delete, reinsert can insert into deleted cell previously occupied lookup knows item is not in table when finds deleted entry 49

50 template class HashSet{ private: vector table; int currentSize;... }; class HashEntry{ public: enum EntryType{ACTIVE, EMPTY, DELETED}; private: K element; EntryType info; friend class HashSet; };

 No more bucket lists  Use collision resolution strategy h i (x) = hash(x) + f(i)  If collision occurs, try the next cell f(i) = i repeat until you find an empty cell  Need extra book keeping ACTIVE, EMPTY, DELETED 52

 What could go wrong?  How can we fix it? Professor Meehean, you haven’t told us what “it” is yet. 53

 Clusters of data requires several attempts to resolve collisions makes cluster even bigger too many 9’s eat up all of 8’s space then the 8’s eat up 7’s space, etc…  Inserting keys in space that should be empty results in collisions clusters have overrun the whole chunks of the hash table 54

55 58 1 29 2 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h0(x)h0(x)

56 58 1 29 2 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h1(x)h1(x)

57 58 1 29 2 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h2(x)h2(x)

58 1 29 2 30 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h3(x)h3(x)

 Only gets worse as load factor gets larger  As memory use gets more efficient  Performance gets worse 59

 Primary clustering caused by linear nature of linear probing collision end up right next to each other  What if we jumped farther away on a collision? f(i) = i 2  If a collision occurs… hash(x) + 1, hash(x) + 4, hash(x) + 9, … 60

 h i (x) = h(x) + i 2 62 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h0(x)h0(x)

 h 1 (x) = h(x) + 1 63 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h1(x)h1(x)

 h 2 (x) = h(x) + 4 64 1 58 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h2(x)h2(x)

 Quadratic probing eliminates primary clustering  Keys with the same hash… probe the same alternative cells clusters still exist per bucket just spread out  Called secondary clustering  Can we beat secondary clustering? 65

 If the first hashing function causes a collision, try a second hashing function h i (x) = hash(x) + f(i) f(i) = i hash 2 (x) h 0 (x) = hash(x) h 1 (x) = hash(x) + hash 2 (x) h 2 (x) = hash(x) + 2 hash 2 (x) h 3 (x) = hash(x) + 3 hash 2 (x) 66

 hash 2 (x) must be carefully selected  It can never be 0 h 1 (x) = hash(x) + 1 0 h 2 (x) = hash(x) + 2 0 h 1 (x) = h 2 (x) = h 3 (x) = h n (x)  It must eventually probe all cells quadratic probed half requires TableSize to be prime 67

 hash 2 (x) = R – (x % R)  where R is a prime smaller than TableSize previous value of TableSize? 68

h i (x) = hash(x) + i hash 2 (x) hash 2 (x) = R – (x % R) R = 7 69 1 2 3 40 5 6 7 18 8 89 9 Insert 49 h0(x)h0(x)

h 1 (x) = 9+ 1 hash 2 (x) hash 2 (x) = 7 – (49 % 7) = 7 – 0 = 7 h 1 (x) = 16 70 1 2 3 40 5 49 6 7 18 8 89 9 Insert 49 h1(x)h1(x)

Why prime TableSize is important 71 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9

Why prime TableSize is important h i (x) = (x % TableSize )+ i hash 2 (x)) % TableSize hash 2 (x) = 7 – (x % 7) 72 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23

Why prime TableSize is important h i (x) = (3 + i 5) % 10 hash 2 (x) = 7 – (23 % 7) = 7 – 2 = 5 73 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h0(x)h0(x)

Why prime TableSize is important h i (x) = (3 + i 5) % 10 h 1 (x) = (3 + 1 5) % 10 = 8 74 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h1(x)h1(x)

Why prime TableSize is important h i (x) = (3 + i 5) % 10 h 2 (x) = (3 + 2 5) % 10 = 3 75 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h2(x)h2(x)

Why prime TableSize is important h i (x) = (3 + i 5) % 10 h3(x) = (3 + 3 5) % 10 = 8 76 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h2(x)h2(x)

Why prime TableSize is important h i (x) = (x % TableSize )+ i hash 2 (x)) % TableSize hash 2 (x) = 7 – (23 % 7) = 7 – 2 = 5 h i (x) = (3 + i 5) % 10 5 is a factor of 10 hash function will wrap infinitely, landing on same buckets if TableSize is prime, result of hash 2 (x) can never be factor 77

 What to do when hash table gets too full? problem for both chained an probing HTs degrades performance may cause insert failure for quadratic probing 79

 Create another table 2x the size really, nearest prime 2x table size  Scan original table compute new hash for valid entries insert into new table 80

 Assume quadratic probing  hash(x) = x % TableSize 81 1 2 58 3 49 40 Insert 23 A E A E E h0(x)h0(x)

82 1 2 58 3 49 40 Insert 23 A E A E E h1(x)h1(x)  Assume quadratic probing  hash(x) = x % TableSize

83 1 2 58 3 49 40 Insert 23 A E A E E h2(x)h2(x)  Assume quadratic probing  hash(x) = x % TableSize

84 1 23 2 58 3 49 40 Insert 23 A A A E E h2(x)h2(x)  Assume quadratic probing  hash(x) = x % TableSize

85 1 23 2 58 3 49 40 A A A E E

86 1 23 2 58 3 49 40 A A A E E 1 2 3 40 6 7 8 95 10 E E E E E E E E E E E i

87 1 23 2 58 3 49 40 A A A E E 1 2 3 40 6 7 8 95 10 E E E E E E E E E E E i

88 1 23 2 58 3 49 40 A A A E E 23 1 2 3 40 6 7 8 95 10 E A E E E E E E E E E i

89 1 23 2 58 3 49 40 A A A E E 23 1 2 58 3 40 6 7 8 95 10 E A E A E E E E E E E i

90 1 23 2 58 3 49 40 A A A E E 23 1 2 58 3 40 6 7 8 9 49 5 10 E A E A E A E E E E E i

 O(N)  Initialization or offline (batch) cost is amortized at least N/2 inserts between rehash  Interactive can cause periodic unresponsiveness program is snappy for N/2 – 1 operations N/2th causes rehash 91

 How do we use C++ hash_maps and hash_sets?  When should we use a map… backed by a hash table backed by a tree (e.g., BST, B+)  When should we use a set backed by a hash table backed by a tree (e.g., BST, B+) 95

 unordered_map and unordered_set alternative implementation of map and set use a hash table requires a hash unary functor requires an equals predicate functor  C++11 only 96

 Lookup Key k 1. compute h = hash(k) 2. see if k is in the list in hashtable[h]  Time for lookup Time for step 1 + step 2  Worst-case for step 2 All keys hash to same index O(# keys in table) = O(N) 97

 If hash function distributes keys uniformly probability that hash(k) = h is 1 /TableSize for all h in range 0 to TableSize  Then probability of a collision = N/TableSize if N ≤ TableSize, then p(collision) ≤ 1 98

 Loophole compacts to… If hash function distributes keys uniformly AND, subset of keys distributes uniformly AND, # of keys ≤ TableSize AND, hash function is O(1) Then, average time for lookup is O(1) 99

 Insert Key k compute h = hash(k) put k in table at or near table[h]  Complexity hash function: should be O(1) collision resolution: O(N)  chained: must check all keys in list  probing: probe may hit every other filled cell Worst case: O(N) Loophole average case: O(1) 100

 Delete Key k compute h = hash(k) remove k from at or near table[h]  Complexity same as lookup and insert O(N) in the worst case O(1) in the loophole average case 101

 Loophole limited collisions O(1) average complexity for lookup, insert, and delete  Worst case times insert: O(N)  even with loophole rehash makes this possible lookup, delete: O(N) 103

 Alternative implementation for sets and maps, but…  Balanced tree, all operations are: O(LogN) safe middle of the road performance  Gamble on hash implementations potential O(1) operations potential O(N) operations  Some operations are not efficient print in sorted order find largest/smallest 104

 Must be positive there will be a small # of hash key collisions not just small probability an actual worst-case small # of collisions 1. All keys are known in advance and hashing doesn’t cause a large # of collisions 2. The map/set will always store all keys no collisions due to modulus no key similarities due to select sample 105

1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can.

Similar presentations

Presentation on theme: "1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can.

Similar presentations

Presentation on theme: "1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can."— Presentation transcript:

Similar presentations

About project

Feedback