# The ADT Hash Table What is a table?

## Presentation on theme: "The ADT Hash Table What is a table?"— Presentation transcript:

The ADT Hash Table What is a table?
A collection of items, each including several pieces of information. One of these pieces of information is a search key. A table is another example of ADT whose insertion, deletion and retrieval of items is made by value and not by position. City Contry Population Cairo London Paris Egypt England France Rome Italy  items may be in order with respect to the search key.  items may or may not have the same search key  “City” is the search key So far we have seen some examples of ADTs where the insertion/ deletion/ retrieval of items is performed by value and not by position. They include ordered linked lists, binary search trees and , AVL trees and heaps. Search in ordered linked lists is performed sequentially, in binary search tree is performed using a binary search approach, and in heaps is just restricted to the root of the tree. In this lecture we are going to see another type of ADT, called Hash Table, which also allows insertion, deletion and retrieval by value. In this type of ADTs search is performed via hash functions. We will first briefly introduce what a table is, and then give the definition of a Hash table, as a particular type of tables, and examples of hash functions. A table, or also called dictionary or map, is a collection of entries, where each entry has two parts: a keyword (or search key) and a value associated with that key. For example in the table given above, each entry has City as a keyword and Country and Population as value. The search key facilitates more efficient insertion, retrieval and deletion operations for a particular given items. For instance, in this slide we have given a small example of a table. Each item includes three main attributes. We can assume that City is the search key for each item in the table, and that all the items are stored in alphabetic order with respect to this search key. Suppose now that we want to get the following information “The population of London”. Then we could scan the column City and look for London, and then return the value of the population for this object in the table. Since the item in the table are in order with respect to City, we could for instance think of using binary search for finding more efficiently the item with city London, by starting in the middle and then move to the appropriate half of the table. Entries in a dictionary can either have different (unique) search keys or have same search key for different entry values. For example, an English dictionary will have different entries with the same key because same word could have different meaning, whereas a dictionary of student records will have a single entry for each key value taken to be the student identification number, as this is unique for each student. Hash tables are particular types of dictionaries that allow for a more efficient operation of search and retrieve. Efficient retrieval of items, if based on search key value: e.g. “Retrieve the population of London”. Hash Tables

Access Procedures for Tables
createTable( ) // post: Creates an empty table. isEmpty( ) // post: Determines whether a table is empty. getSize( ) // post: Returns the number of entries in the table. add(key, newElem) // post: Adds the pair (key, newElem) to the table in its proper order according to // its key. remove(key) // post: Removes from the table the entry with given key. Returns either the value // post: associated with the given key or null if such entry does not exist. These are examples of basic access procedures that we could implement when we define an ADT table. Note that we’ll see in this lecture the implementation details of these access procedures only for a particular type of tables, namely Hash Tables. Different type of data structured can be deployed for implementing an ADT table. Among these, the most relevant are sorted arrays, sorted linked list, and binary search trees. But is there a way for inserting, searching, deleting data that is more efficient than the approaches seen so far? Of course the choice of the data structure and implementation of the above access procedures depends on whether we need tables with duplicate keys or with distinct (unique) keys. In the first case the implementation of the operation add would be simpler but the implementation of remove would be more complex. What would remove do in case of duplicates? Remove the first entry with the given key or remove all the entries with a given key. Similarly what would the operation getValue(key) do? Return the first entry found with the given key or return a list of all the entries with a given key? In the case of distinct keys, the operation of add would need to change. Adding a new entry with a key that already exists could either mean replace the current value associated with the given key with the new given value or throw an exception saying that key already exists. So, we could in principle define different type of implementation of a table depending on these given assumption. For simplicity, we assume here that key are distinct. For each different type of implementations (e.g. sorted linked lists as dynamic type of implementations, or array-based or vector-based types of static implementations) we could opt for different types of underlying data structures. For instance in the case of a sorted linked lists, the node could include a reference variable to an object entry, which includes key and value, or a parallel (i.e. two chains) of nodes, one referring to key and the other to value, or a node entry that includes three reference variables: key, value and next. getValue(key) // post: Retrieves the value that corresponds to a given key from a table. Returns null // post: if no such entry exists in the table. Hash Tables

Binary Search Tree: a particular type of binary tree that enables easy search of specific items. But: • efficiency of the search method depends on the form of the binary search trees, • in the best case scenario (balanced trees), for trees with 10,000 items, operations of “retrieve”, “insert”, and “delete” would still require O(log 10,000) = 13 steps. Is there a more efficient way of storing, searching and retrieving data? Hashing: the basic idea add(searchKey, newElem){ i = the array index that the address calculator gives us for the searchKey of newElem. table[i] = newElem } method “add” is of the order O(1) (it requires constant time). The add operation . 1 2 3 4 5 Address calculator Elem (or search key) In the previous lectures we have introduced the binary search tree abstract data type, as a structure for storing and retrieving information more efficiently than what an ordinary binary tree would allow. The efficiency is due to the ordering on the value of the nodes that define a binary search tree. Because of this ordering, operations of retrieve, insert and delete can all be performed within the order of log2(n), where n is the number of nodes in the tree. But for large n, the value of log2 n could still be too big for some particular applications. Suppose, for instance, the 999 emergency system that detects the telephone number of an emergency call and has to search a directory database for the caller’s address. In situations like this we definitely need a very fast, virtually instantaneous way for searching data. The main question we are addressing in this lecture is whether there is a way for inserting, searching, deleting data that is more efficient than the approaches seen so far. The answer is yes. It consists of a completely new strategy for locating (and inserting or deleting) an item almost instantaneously, called “hashing”. Tables that use this strategy are called Hash Tables. Imagine an array “table” of n items, with each array slot capable of holding a single table elem, and a seemingly magical box called an “address calculator”. Whenever we have a new element that we want to insert in the table, the address calculator will tell us where we should place it in the array. This means that it will return us an array index value of the location in which the element should be stored. The add operation would then look like the pseudocode sketched in this slide. The advantage of this strategy is that the insertion operation will virtually be of the order O(1), assuming that the time spent for calculating the address is small. This means that it requires a constant time that is the same independently on the size of the table! Similarly, the retrieve time would be of the order O(1). In principle you could think of a very simple has function, given by the identify function. This means using as key for an entry in a table the array index of where the entry is stored. In this case insertion and retrieval will be very efficient (O(1)). But this is not realistic when we have a large database of table. We need to think of clever functions that are not too difficult to compute and still allow us to identify unambiguously the array index of where the record is stored. The same address calculator could be used for “getValue” and “remove” operations, as illustrated in the next slide. Hash Tables

The retrieve operation The remove operation
getValue(searchKey) // post: Returns the element that has a matching // post: searchKey. If not found, it returns null i = the array index that the address calculator gives us, for element whose search key is equal to searchKey if (table[i].getKey( ) equals searchKey) return table[i].getValue(); else return null; pseudocode remove(searchKey) // post: Deletes element that has a matching searchKey, // post: and returns true; otherwise returns false i = the array index that the address calculator gives us for the given searchKey if (table[i].getKey( ) equals searchKey){ delete element from table[i]; return true; } else return false pseudocode If we have such a ideal mechanism for computing the position in an array for a given search key we would also be able to provide a very efficient implementation of the operations getValue(searchKey) and remove(searchKey), as shown in this slide. We could simply ask the address calculator to tell us where it would insert such an element. Because we would have inserted the element earlier by using the add operation procedure given in the previous slide, if the desired element is present in the table, it will be in the array location “i” specified by the address calculator. Similarly for the deletion operation. In this latter case, we will also have to delete the element. Looking at these three different operations, it thus appears that we can perform the operations getValue(searchKey), add(searchKey, newElem) and remove(searchKey) virtually instantaneously. We never have to search for an element through the data structure. Instead we simply let the address calculator to determine where the element should be. The amount of time required to carry out these operations is of the order O(1) and depends only on how quickly the address calculator can perform its computation. The address calculator is usually referred to as hash function, the approach is known as hashing, and the array table is called hash table. In the next slide we summarise this concept with a definition. getValue(key) and remove(key) are also operations of the order O(1) (require constant time). Hash Tables

Definition The ADT Hash Table is an array (i.e. array table) of elements (possibly associated with a search key unique for each element), together with an hash function and access procedures. The access procedures include insertion, deletion, and retrieval of an element by means of the hash function. A hash table can also be empty – no element is stored in the array table. The hash function determines the location in the table of a new element, using its value (or search key if any). In a similar way, permits to locate the position of an existing element. The hash function takes a search key and maps it into an integer array index. A perfect hash function is an ideal function that maps each search key into a unique array index. Hash Tables

Understanding how Hash Functions work
Example: the table is a Directory Database with Less than 10,000,000 people Each person with his/her telephone number, as search key, The telephone number is of type int: e.g., Store the person with number in table[ ] 10,000,000 memory locations to spare! Numbers are regional. Store the person with number in table[4567] 10,000 memory locations to spare! Hash function:  4567 To understand how an hash function works, we can consider again the emergency number system 999 mentioned earlier. If for each person the system had a search key given by the person’s telephone number we could store the record for a person whose telephone number is num, in the array table location table[num]. This approach would be fine if we were prepared to spare millions of memory locations for array table! However, we could think of using the fact that telephone numbers are regional, and therefore decide that if a person number was we would store the person’s record at location table[4567] . This would require an array table of only 10,000 elements! This transformation of into an array index 4567 is a simple example of a hash function. In this case a hash function “h” must take an arbitrary integer x and map it into an integer that we can use as an array index. In our example the index is an integer in the range 0 to 9999, and the hash function is h(x) = i, where x is any integer and i is an integer in the range 0 to 9999. Having a hash function would allow us to implement the step i = the array index that the address calculator gives us for element whose search key is equal to searchKey simply as i = h(searchKey). Note that in this example we have the benefit that we know a priori all the possible values that our search key can assume. Even when we add a new person with a new number, we know what are the telephone numbers available (I.e. unallocated), and this person will have a search key value given by one of these numbers. Takes a value of the search key and transforms into an integer used as array index value. Note: the above is an example of a perfect hash function Hash Tables

The collision problem A perfect hash function is an hash function that maps each search key into a unique array index. It is possible if we know in advance all the possible search key values. In practice, we don’t know all possible search key values. An hash function can map two or more search keys into the same integer array index: x  y and h(x) = h(y) = i. A collision is when two or more elements with search keys x and y are told by the hash function to be stored in the same array location table[i], where i = h(x) = h(y). The two search keys x and y are said to have collided. The previous example is relatively simple because we know a priori all the possible telephone numbers, those assigned and those available for a given region. In this case, it is possible, as we just did, to define a perfect hash function, i.e. a hash function that assigns a unique array index value for each search key value. In practice, we don’t know the possible values of the search keys. For instance, consider an air traffic control system that keeps a table of all current flights, by using the flight number as search key. Suppose that the flight number is of four digits as well, i.e. flight number 4567, not knowing a priori what flight would be a current flight the system should in principle have an array of 10,000 locations, even if only 50 flight are allowed to be current flights. The advantage of the hash function is that we could for instance decide to have only an array table of at most 101 flights (i.e with index ranging from 0 to 100) and use an hash function that is able to map any four digits flight number into an integer in the range 0 through 100. But can this be done without problems? Is hashing really as good as it sounds? Ideally, we want the hash function to map each element’s search key x into a unique integer i. But in practice, when we don’t know the possible values of the search keys, as in the air traffic control system, an hash function can well map two or more search keys x and y into the same integer. That is, the hash function can tell us to store two different elements into the same array table location, so causing what is known as a collision. This can happen even when we have only few elements stored in the array table and the array has still enough empty slots to allocate more new elements. The collision is therefore the main problem associated with hash tables. One possible way of avoiding collision is to allow the array table to be big enough so to provide an empty location for each possible search key value. This would again mean allocation of a vast amount of memory not necessarily used at the same time. A better solution would be to provide some collision-resolution schemes. These resolution schemes usually require that the hash function place items evenly throughout the hash table. We’ll see why when we present some of these schemes A way for solving collisions is to provide appropriate collision-resolution schemes. Basic requirements for a “good” hash function: - be easy and fast to compute - place elements evenly throughout the hash table (i.e. minimizes collisions) Hash Tables

Examples of Hash Functions (1)
Assume hash functions have integers as search keys. Selecting digits: given a search key number composed of a certain number of digits the hash function picks digits at specific places in the search key number: e.g. h( ) = (select the forth and the last digit) Simple and fast, Generally, does not evenly distribute the elements in the hash table Folding: given a search key number, the function defines the index by adding up all the digits in the search key. e.g. h( ) = (add the digits) Or by first grouping the digits and then adding them up. e.g. h( ) = = (group the digits and add them up) It is sufficient to consider arbitrary numbers as arguments of the hash function. Search keys that are not integer can be transformed into integers before applying the hash function. Given that the result of the hash function has to be within a integer range (array table index range), what we can think of is of ways for converting an arbitrary integer into an integer within a given range. There are many different ways to do so, and we have listed in this slide some common transformations or hash functions. The first is by selecting digits at a specific positions in the given integer. For instance, given a search key , we could use the hash function that take the fourth and last digits from this number so mapping it into the integer 35. But this type of mapping would not allow an even distribution of the elements over the array table. In fact it will cause many collisions. Another example of hash function is called folding. This consists of taking an arbitrary integer and adding all its digits to define its associated index. An example is given in the slide. This would cause however, lots of collision because for big integer search key values, the addition of digits would give only a limited range of indexes. For instance if the search key value is an integer with 9 digits, the hash function 0  h(x)  81. An alternative would be to first group the digits so creating other intermediate numbers and then add these intermediate numbers together, as shown in the slide. In this case for the same type of search key, the hash function would range 0  h(x)  3 * 999 = 2997. Of course if 2997 is larger than the size of the table we want, we could still think of applying more than one hash function to the same search key value. In the latter case, we could for instance applying folding again and get 29+97= 126. Note: you can apply more than one hash function to a search key Hash Tables

Examples of Hash Functions (2)
Modulo arithmetic: given a search key number, the function defines the index to be the modulo arithmetic of the search value with some fix number. e.g. h( ) = mod tableSize We get lots of collision We can more evenly distribute the elements in the table, if tableSize is prime Converting character string to an integer: given a search key is a string, we could first convert it into an integer, and then apply the hash function. We could think of using different ways of converting strings into a number to get better hash function results. Alternatively we could use as hash function the modulo arithmetic. In this case, to have less collisions we should have a reasonable large array table and tablesize. This method would place elements in the table more evenly if the fixed number is a prime number. Again it spreads elements in the table more evenly than the previous methods do, so it’s often preferable. Finally, if our search key is not an integer but a character string, we could think of using or trying different methods for converting the string into an integer, so to have better results once we apply the hash function to these generated integers. For instance, given a search key “NOTE” we could either transform each character into a number and then add them up; or consider the position each letter has in the alphabet and then add the resulting numbers up; or take the position each letter has in the alphabet, transform them into binary numbers and concatenate then into a long big binary number, and then get its decimal value and apply again the modulo arithmetic hash function. This would give less collisions. To summarise, the basic assumptions to take into account when defining an hash function are: 1) fast computation of the hash value 2) uniform distribution of hash values across the table (or array) 3) not many different search keys should hash to the same array index. However, given that the size of the table cannot be too big, even when we are able to define an hash function that fulfils the above criteria, there will still be a case of collision that needs to be solved. This is because it is not always possible to know a priori all the possible values of searchkeys. e.g. h(“NOTE”) = , using the ASCII values of the letters h(“TONE”) = Hash Tables

Collision-resolution schemes
Two main approaches: 1. Assign another location within the hash table to the new collided element. 2. Change the structure of the hash table: each table[i] can accommodate more than one element. 1. Open addressing schemes: In case of collision, probe for some other empty location to place the element in. The probe sequence of locations used by the add procedure has to be efficiently reproducible by the delete and retrieve procedures. Linear probing: . 22 23 24 25 7597 4567 0628 3658 i=7597 mod 101=22 i+1 i+2 i+3 Table locations have to be defined to be in one of three states: empty, deleted, occupied; otherwise, after deletion, the retrieve operation might stop prematurely. Two general approaches to collision resolution are common. The first, called open addressing approach, assigns another location within the hash table to the new element. The second approach consists of restructuring the hash table so that each location table[i] can accommodate more than one element. In the open addressing, when there is a collision, the idea is to probe for some other empty location. The sequence of locations that we examine is called probe sequence. The definition of this probe sequence, by the insert procedure, has to be such that the retrieve and delete operations can easily reproduce it in order to easily locate existing items. One example is linear probing. In this case to resolve a collision, the hash table is searched sequentially, starting from the original hash location, for empty locations. Starting for instance, from table[h(searchkey)], which is already occupied, we check table[h(searchkey)+1], table[h(searchkey)+2], etc. The retrieve procedure can behave in the same way. Get the hash function and continue searching for the element in the locations after that. We might need to wrap the increment of the index value to continue from the beginning of the table. If it finds an empty location and no element is found then the result is null. But what happens when we delete an element and then we want to retrieve another that has the same hash value. The deletion would make one of these locations in the probe sequence equal to empty. Then the retrieve operation might stop prematurely when searching for an element because it encounters an empty location, even though the probe sequence continues after that. So a way would be to define each location of the table to be either “empty”, or “occupied”, or “deleted”. This would solve the problem with the retrieve procedure. However, it is still the case that such method causes large clusters of elements in the table, making the hash function less efficient. Elements tend to cluster together. Parts of the table might be too dense and others relatively empty, making the hashing less efficient. Hash Tables

Double hashing: probe sequence is not sequential, but defined using the given search key. It uses the hash function “h” to calculate the initial index, and a second function “h' ” to calculate the size of the probing step, using the same search key. The function h' has the following properties: - h' (key)  h'  h Example: h (key)= key mod 11 h' (key) = 7 – (key mod 7) . 3 58 h(58) = 3 14 10 h(14) collision h(14) = 3 h'(14) = 7, i=3+7 6 91 h(91) collision h(91) = 3 h'(91) = 7, i=(3+7 +7)%11 Double hashing, instead drastically reduces clustering. It is also an open addressing scheme and as such it uses also the concept of probe sequence. But instead of using a sequential, linear probing, it defines the step of the probe sequence using the the same search key. To do so, the scheme uses a hash function to calculate the index and a second hash function to calculate the step size of the probe sequence. For example, if we have a tablesize=11 and the hash function h(key)= key mod 11. We could define the second function as h' (key)=7- (key mod 7). In this way the step size will never be zero and it is different from the hash function. In this slide we have given an example in which the hash table has size 11, and the two functions are defined as above. We start first with inserting the element 58. Now the hash function gives h(58)=3, so since 3 is empty at the beginning , we store 58 at the index table[3]. We then try to insert the item with search key 14. h(14)=3 as well, but this causes the first collision. So the second function is used, which gives a step of 7. Since the location 3+7 is empty then 14 is stored at table[10]. We finally want to insert the element 91. Also h(91)=3, and in particular h’(91) = 7 – (91 mod 7) = 7. But now the location table[10] is already occupied. So the method wraps the table indexes around and in the following way: the index table[(10+7)] would not be a correct index of the table, so to “wrap it up” we have to consider table[(10+7) mod 11]. One of the problems with Open addressing strategy for collision resolution is that each table location is in one of the three states: occupied, empty or deleted. It is possible that we reach a situation whereby no location in the array is null or no addition can be made, however good the hash function is, and however many of few entries are actually in the table. The next approach, based on restructuring of the hash table does not have this problem. Hash Tables

Restructuring the Hash Table
Alter the structure of the hash table so to allow more than one element to be stored at the same location. The array table is defined so that each location table[i] is itself an array, called bucket. Buckets Limitation: how to choose the size of each bucket? The array table is defined as an array of linked lists. Each location table[i] is a reference to a linked list, called the chain, of all the elements that have collided to the same integer i. Separate Chaining public class ChainNode{ private keyedElem elem; private ChainNode next; ……. } public class HashTable{ private final int TABLESIZE=101 private ChainNode[ ] table; private int size; ….. } Another way of resolving collisions is to change the structure of the array table - hash table – so that it can accommodate more than one element in the same location. We describe here two ways of altering the hash table. The Buckets is not very efficient because we might end up using either too much memory or not have enough space for the collisions. The separate chaining is the best solution. The slide gives the basic class declarations of an Hash table that uses separate chaining as collision strategy. Hash Tables

A Separate Chaining Structure
. Size-1 1 2 3 Table …. Each location of the hash table contains a reference to a linked list Hash Tables

Implementing Hash Table with separate chaining
getValue(searchKey){ i = hashIndex(searchKey); node = table[i]; while((node null) && (node.getElem( ).getKey( ) searchKey)) { node = node.getNext( );} if (node != null){ return node.getElem( ); } else return null; } pseudocode add(key, newElem){ searchKey = key; i = hashIndex(searchKey); node = reference to a new node containing newElem; node.setNext(table[i]); table[i] = node; } pseudocode Example of pseudocode of the two access procedures add(key, newElem) and getValue(key) defined at the beginning of this lecture are given here, but assuming the use of a separate chaining strategy for resolving collisions. Note that the remove access procedure is very similar to the procedure getValue(key), so it’s left to you as exercise. The method hashIndex is the method responsible for calculating the hash function. This method can implement any of the hash function described before, or any other you might be able to find. The example implementation given here assumes an unsorted chaining of nodes. Hence the addition at the beginning of the chain is the most efficient. We could also have a sorted chaining. In this case the insertion would have to respect the ordering in the chain. “hashIndex” is a protected procedure of the class HashTable. Hash Tables

Summary Hashing is the process that calculates where in an array a data element should be, rather then searching for it. It allows efficient retrievals, insertions and deletions. Hash function should be easy to compute and it should scatter the elements evenly throughout the table. Collisions occur when two different search keys hash into same array location. Two strategies to resolve collisions, using probing and chaining respectively. Hash Tables

Lists Stacks Queues Trees Heaps
Conclusion What is an Abstract Data Type Introduce individual ADTs Lists Stacks Queues Trees Heaps AVL Trees Hash Tables Understand the data type abstractly Define the specification of the data type Use the data type in small applications, basing solely on its specification Implement the data type Static approach Dynamic approach We can now conclude this second part of the course with the similar overview slide I gave you in the introduction lecture, but this time as a summary slide of what we have seen and what you are supposed to know. Note that this slide includes AVL Trees. Some fundamental algorithms for some ADTs: pre-order, in-order and post-order traversal, heapsort Hash Tables

The End Hash Tables

Similar presentations