Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.

Similar presentations


Presentation on theme: "Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1."— Presentation transcript:

1 Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1

2 The Problem Given an array of 20,000,000 elements How do I : Rapidly identify if a searched for element exists? Add elements randomly but maintain rapid access to those elements? Delete elements without needing to re- organize? © John Urrutia 2014, All Rights Reserved2

3 Consider this problem Your task: Create an associative array that counts the number of times each word has been used in a novel. Your task: How would you find out if the current word is in the array? How long would it take? There are over 750,000 words in the English language. How big would the array need to be? Would it be possible to achieve O(1) or close to this? © John Urrutia 2014, All Rights Reserved3

4 The Answer: Hash Tables Definition A hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Definition © John Urrutia 2014, All Rights Reserved4

5 The Answer: Hash Tables Perfect Solution The associative array will have enough entries to match all possible values of key. Perfect Solution Actual Implementation Because in most situations only a small number of entries are actually used we don’t need every possibility. An associative array where a hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought. Actual Implementation © John Urrutia 2014, All Rights Reserved5

6 The Hash Function Needs to create a unique key based on the data to be stored. Needs to provide a random distribution of keys Needs to avoid hashing multiple keys to the same slot. © John Urrutia 2014, All Rights Reserved6

7 The Hash Function A simple Hash algorithm A simple Hash algorithm 1. Sum the char values of the data to be stored 2. Perform modulo arithmetic using the size of the array, (# of buckets) to be filled. Ex. String = “BZ” the char ‘ B ’ has a value of 66 and the char ‘ Z ’ has a value of 90. The sum is 156 if the array can contain only 26 entries The modulo of 156 \ 26 is 0 The method follows on the next slide. © John Urrutia 2014, All Rights Reserved7

8 Simple Hash Method private static long SimpleHashCalc(string x) {long hashLimit = x.length(); long calcHashVal = 0; char[] z = x.ToCharArray(); foreach (char ch in z) { calcHashVal += (int)ch; } calcHashVal %= hashLimit+1; return calcHashVal; } © John Urrutia 2014, All Rights Reserved8

9 Collisions!! Using the previous method what’s returned with: “BZ” or “ZB” “CY” or “YC” “$8.00” Answer: 0 Which means each of these will occupy the same array element. © John Urrutia 2014, All Rights Reserved9

10 Collision Avoidance Create a perfect hash function to eliminate all collisions (but there is no such thing). private static long hashCalc(string x, int limit) {long calcHashVal = 0; char[] z = x.ToCharArray(); foreach (char ch in z) { calcHashVal = (calcHashVal * limit + (int)ch) % limit; } return calcHashVal; } © John Urrutia 2014, All Rights Reserved10

11 Collisions!! Using the improved method now what’s returned assuming the size is 65,536? “BZ” or “ZB” “CY” or “YC” Answer: 90, 66, 89 and 67 © John Urrutia 2014, All Rights Reserved11

12 Collision Avoidance As data is added to the array collisions increase making the process less efficient Simple changes help Double the size of the array Base array size on nearest prime number greater than the expected size 65536  65537 © John Urrutia 2014, All Rights Reserved12

13 Collision Avoidance Calculating prime numbers private int getPrime(int min) { for(int j = min+1; true; j++) if( isPrime(j) ) return j; } private bool isPrime(int n)// is n prime? { for(int j=2; j*j <= n; j++)// for all j if( n % j == 0)// divides by j? return false;// yes, so not prime return true; // no, so prime } © John Urrutia 2014, All Rights Reserved13

14 Collision Handling All collisions can’t be avoided How can we handle them? Open Addressing Look for the next available slot Separate Chaining Use a list for each hash © John Urrutia 2014, All Rights Reserved14

15 Open addressing Searching for the next available slot Linear probing Increment the index until: an empty element is found we run out of slots to look in © John Urrutia 2014, All Rights Reserved15

16 Open addressing Linear probing Advantages Simple to code Disadvantages Causes keys to cluster which promotes more collisions to occur. (primary clustering) © John Urrutia 2014, All Rights Reserved16

17 Open addressing Linear probing (one step at a time) © John Urrutia 2014, All Rights Reserved17 private static long findNext(string[] intData, long intCurrIndx) { for (long indx = intCurrIndx+1; indx!=intCurrIndx; indx++ ) { indx %= intData.Length; //correct index for wrap-around if(intData[indx]=="") return indx; //indx = next empty slot } return -1; }

18 Open addressing Quadratic probing Advantages Try's to avoid primary clustering Disadvantages The number of steps to locate slot expand quadraticly © John Urrutia 2014, All Rights Reserved18

19 Open addressing Quadratic probing (1 2,2 2,3 2,4 2,5 2, … steps at a time) © John Urrutia 2014, All Rights Reserved19 private static long findNext(string[] intData, long intCurrIndx) { int i = 0; for (long indx = intCurrIndx+1; indx!=intCurrIndx; indx += ++i*i) { indx %= intData.Length; //correct index for wrap-around if(intData[indx]=="") return indx; //indx = next empty slot } return -1; }

20 Open addressing Double hashing Advantages Eliminates both primary & secondary clustering Disadvantages Requires coding two hashing algorithms © John Urrutia 2014, All Rights Reserved20

21 Open addressing Double hashing © John Urrutia 2014, All Rights Reserved21 private static long findNext(string[] intData, long lngCurrIndx, long lngHashConst=5) { long step = lngHashConst - (lngCurrIndx % lngHashConst); long count = 0;//Elements looked at for (long indx = lngCurrIndx; ++count <= intData.Length; indx += step) { indx %= intData.Length; //correct index for wrap-around if(intData[indx]=="") return indx; //indx = next empty slot } return -1; //Array is full }

22 Collisions!! Separate chaining. Each element of the array is a list For every collision an item is added to the list Advantages No need for additional coding or handling of collision. Disadvantages Additional resources required for each item. Storage must be dynamically allocated. © John Urrutia 2014, All Rights Reserved22

23 Separate Chaining Hash table definition Adding to Hash table © John Urrutia 2014, All Rights Reserved23 ArrayList[] strHashTab = new ArrayList[37]; for(int i=0; i< 37; i++) strHashTab[i] = new ArrayList(); foreach (string x in strInputData) { intHashIndx = hashCalc(x, strHashTab.Length); strHashTab[intHashIndx].Add(x); }

24 Separate Chaining Displaying Hash table © John Urrutia 2014, All Rights Reserved24 private static void showHashTable(ArrayList[] strShowTable) { int i = 0; string strMsgText = ""; foreach(ArrayList mySL in strShowTable) { for(int j=0; j<mySL.Count; j++) strMsgText += String.Format( "Index= {0:00} Item={1:00} Value = {2}\n", i,j, mySL[j]); i++; } MessageBox.Show(strMsgText, "Hash Table", MessageBoxButtons.OK, MessageBoxIcon.Information); }

25 Efficiency of the big 3 Average Probe Length Load Factor Linear ProbingQuadratic Probing Separate Chaining

26 Summary A hash table is based on an array. The range of key values is usually greater than the size of the array. A key value is hashed to an array index by a hash function. An English-language dictionary is a typical example of a database that can be efficiently handled with a hash table. The hashing of a key to an already-filled array cell is called a collision. © John Urrutia 2014, All Rights Reserved26

27 Summary Collisions can be handled in two major ways: open addressing and separate chaining. Open addressing Items hashing to a full slot are placed in another slot. 3 types of addressing: linear probing – the step size is always 1. if the hash index is x, the probe goes from x, x+1, x+2, x+3, … quadratic probing – the offset from x is the square of the step number, the probe goes from x, x+1, x+4, x+9, x+16, … double hashing – the step size depends on the key and is obtained from a secondary hash function and the probe goes to x, x+s, x+2s, x+3s, x+4s, … © John Urrutia 2014, All Rights Reserved27

28 Summary Open addressing Probe length is the steps required to find a specified item Primary clusters occur when Linear probing, fills contiguous sequences. These reduce performance. Quadratic probing eliminates primary clustering but suffers from the less severe secondary clustering. Secondary clustering occurs when keys hash to the same value and follow the same sequence during a probe © John Urrutia 2014, All Rights Reserved28

29 Summary Separate chaining Each element is a linked list Items hashing to the same index are inserted in that list. © John Urrutia 2014, All Rights Reserved29

30 Summary The number of such steps required to find a specified item is called the probe length. In linear probing, contiguous sequences of filled cells appear. They are called primary clusters, and they reduce performance. Quadratic probing eliminates primary clustering but suffers from the less severe secondary clustering. Secondary clustering occurs because all the keys that hash to the same value follow the same sequence of steps during a probe. © John Urrutia 2014, All Rights Reserved30

31 Summary All keys that hash to the same value follow the same probe sequence because the step size does not depend on the key, but only on the hash value. If the secondary hash function returns a value s in double hashing, the probe goes to x, x+s, x+2s, x+3s, x+4s, and so on, where s depends on the key but remains constant during the probe. The load factor is the ratio of data items in a hash table to the array size. The maximum load factor in open addressing should be around 0.5. For double hashing at this load factor, searches will have an average probe length of 2. © John Urrutia 2014, All Rights Reserved31

32 Summary The maximum load factor in open addressing should be around 0.5. For double hashing at this load factor, searches will have an average probe length of 2. Search times go to infinity as load factors approach 1.0 in open addressing. It’s crucial that an open-addressing hash table does not become too full. A load factor of 1.0 is appropriate for separate chaining. At this load factor a successful search has an average probe length of 1.5, and an unsuccessful search, 2.0. © John Urrutia 2014, All Rights Reserved32

33 Summary Probe lengths in separate chaining increase linearly with load factor. A string can be hashed by multiplying each character by a different power of a constant, adding the products, and using the modulo operator (%) to reduce the result to the size of the hash table. To avoid overflow, we can apply the modulo operator at each step in the process, if the polynomial is expressed using Horner’s method. Hash table sizes should generally be prime numbers. This is especially important in quadratic probing and separate chaining. Hash tables can be used for external storage. One way to do this is to have the elements in the hash table contain disk-file block numbers © John Urrutia 2014, All Rights Reserved33

34 Summary Collisions can be handled in two major ways: open addressing and separate chaining. In open addressing, data items that hash to a full array cell are placed in another cell in the array. In separate chaining, each array element consists of a linked list. All data items hashing to a given array index are inserted in that list. We discussed three kinds of open addressing: linear probing, quadratic probing, and double hashing. In linear probing the step size is always 1, so if x is the array index calculated by the hash function, the probe goes to x, x+1, x+2, x+3, and so on. The number of such steps required to find a specified item is called the probe length. In linear probing, contiguous sequences of filled cells appear. They are called primary clusters, and they reduce performance. In quadratic probing the offset from x is the square of the step number, so the probe goes to x, x+1, x+4, x+9, x+16, and so on. Quadratic probing eliminates primary clustering but suffers from the less severe secondary clustering. Secondary clustering occurs because all the keys that hash to the same value follow the same sequence of steps during a probe. All keys that hash to the same value follow the same probe sequence because the step size does not depend on the key, but only on the hash value. In double hashing the step size depends on the key and is obtained from a secondary hash function. If the secondary hash function returns a value s in double hashing, the probe goes to x, x+s, x+2s, x+3s, x+4s, and so on, where s depends on the key but remains constant during the probe. The load factor is the ratio of data items in a hash table to the array size. The maximum load factor in open addressing should be around 0.5. For double hashing at this load factor, searches will have an average probe length of 2. Search times go to infinity as load factors approach 1.0 in open addressing. It’s crucial that an open-addressing hash table does not become too full. A load factor of 1.0 is appropriate for separate chaining. At this load factor a successful search has an average probe length of 1.5, and an unsuccessful search, 2.0. Probe lengths in separate chaining increase linearly with load factor. A string can be hashed by multiplying each character by a different power of a constant, adding the products, and using the modulo operator (%) to reduce the result to the size of the hash table. To avoid overflow, we can apply the modulo operator at each step in the process, if the polynomial is expressed using Horner’s method. Hash table sizes should generally be prime numbers. This is especially important in quadratic probing and separate chaining. Hash tables can be used for external storage. One way to do this is to have the elements in the hash table contain disk-file block numbers © John Urrutia 2014, All Rights Reserved34


Download ppt "Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1."

Similar presentations


Ads by Google