Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Structures Hash Tables

Similar presentations


Presentation on theme: "Data Structures Hash Tables"— Presentation transcript:

1 Data Structures Hash Tables
Phil Tayco Slide version 1.0 May 4, 2015

2 Hash Tables Storage space revisited
A common argument in recent computing is the lower costs of acquiring large amounts of disk space Situations can then be adjusted that treat using large amounts as not as critical This implies the use of arrays for managing data sets

3 Hash Tables Sorted data
If we are okay with using arrays, then certain situations using them could be identified Sorted data leads to O(log n) performance Sorting the data is at best O(n log n) using quicksort and O(n) if we kept the order while performing maintenance Performance is strong if the data is sorted but maintaining it can be costly

4 Hash Tables Unless we don’t need to sort
Sorted data helps when presenting parts or all of the data (such as a web page report) If there isn’t a need to show sorted data (such as an employee management system where records are maintained one at a time), then the need to sort the data is removed Unsorted data, however, is O(n) so we are now looking for a structure that helps with O(log n) maintenance performance (or better) that does not need the sorting (and we are okay with using arrays)

5 Hash Tables Array index as key
To take advantage of this, we need to take advantage of the fact that arrays allow for direct access to array elements Direct access is achieved by using the array index number The question is how to maximize use of the array index when performing the maintenance functions?

6 Hash Tables An ideal example
Consider a company of 1,000 employees and perhaps this particular company is very unlikely to exceed 100,000 Storage is not an issue and memory capacity can easily accommodate 100,000 records The program to maintain these records does not have functionality that requires showing the employee records in any sorted way This is all great because an array can be used with a large amount of space that can handle the worst case 100,000 records

7 Hash Tables Index representation
To take full advantage of the array, we treat the array index as a key value to identifying an employee Sequential employee id numbers make the perfect key (Employee 15 is employees[14]) On a larger scale, employee SSN can be used in the same way (assuming you can hold up to 999,999,999 records!) Each employee id is a unique index value so there would never be overlap (unless you reused employee ids after they left the company)

8 Hash Tables Ideal efficiency
Just how fast does this performance lead to? Search: you know the id number, you know the array index and you have direct access Insert: maintaining the last known employee id number is easy enough to take advantage of adding new employees Update/Date: is a search followed by an appropriate change Each one of these ends up at O(1)!

9 Hash Tables Reality Such ideal situations are in fact that: ideal
Some situations tend to lose out on some factor: Not quite enough storage space requiring a smaller array size ID values may not be a unique number Can we reduce the array size and find a way to line up a unique record ID with an array index?

10 Hash Tables Hashing Hashing involves deriving an index value through some logical calculation Derivation is applied to a field or combination of fields of the record that calculate an index Typical example: Adding all ASCII values of some field like first and last name and using mod to calculate the index

11 Hash Tables Calculations
Example: “Phil Tayco” as the name of the record Add all ASCII character values = 367 for “Phil” = 482 for “Tayco Total = 849 Say we only allow for 500 array elements. We can also mod this value by the array size 849 % 500 = array index 349 Utilizing this approach means we have a consistent formula to derive an index value

12 Hash Tables Limitations
Challenges immediately come to mind when looking at this example: Eventually, an index value calculation for 2 different records will derive the same value (called a “collision”) A calculation that guarantees a unique value often leads to a large amount of space required with heavy under utilization We need to keep the capacity of the array reasonable while handling the inevitable collisions

13 Hash Tables Collisions
Multiple approaches for handling collisions when hashing Open addressing uses the strategy to find another open element in the array following a search-like algorithm Assumption is that there will be enough space for all entries (i.e. the estimated maximum capacity of the hash array is adequate

14 Hash Tables Linear Probing
Linear probing is the basic open address agorithm If a collision occurs, look in the next immediate spot in the array If it is open, place the next item there If it is not, continue looking in the next array index (wrapping to index 0 if needed) until an open spot is found This is an issue only if the capacity is reached (making the initial estimate important

15 Hash Tables Linear Probe Search
If the hash array utilizes this form of collision handling on insert, the other functions must follow suit Search uses the hash function to find if a given record is at the hash location If it is “empty” at that location, the search if over If it is there, then the record is found Otherwise, the search continues with the next array element “Empty”, however, must be defined such a predetermined record value. Why…?

16 Hash Tables Linear Probe Delete
Because a delete cannot simply mean to perform the search and if the record is found, remove it from the array This would leave an empty spot in the array that may be interpreted as a record not found during a search Instead, the array element is changed to another pre-determined value of “deleted” Search does not treat this as an empty spot

17 Hash Tables Example: Records “T”, “Y” and “R” have been hashed into the array T Y R

18 Hash Tables New record “D” comes in and the hash function calculates its index as index [3] T Y R D

19 Hash Tables Record “D” collides with record “T”. Linear probe means try the next index T Y R D

20 Hash Tables However, record “Y” is already there, so we try the next one. It is open, so that’s where “D” goes T Y D R

21 Hash Tables Later on, record “Y” is called for deletion. When “Y” is hashed, its index value is [4]. “Y” is there, so the deletion is performed T Y D R

22 Hash Tables However, if we remove it, that creates an empty space… T D

23 Hash Tables If we left it this way, when search for record “D” begins, its original hash value is still [3] T D R

24 Hash Tables Since index [3] is not “empty”, search goes to index [4] which is empty and then incorrectly returns “not found” T D R

25 Hash Tables Solution is instead of removing the record, put in a designated “deleted” value (such as -1) T -1 D R

26 Hash Tables Now when search for record “D” is performed, the linear probe will treat the “-1” as not empty and continue the search correctly T -1 D R

27 Hash Tables Linear Probe Efficiency
As records start to fill up the array, you can infer that the efficiency of the algorithm degrades to O(n) The degradation is dependent on the complexity of the hash function (more spaced out locations) and nature of the data (does the selected fields of data result in spaced out hash values) Other methods of probing exist Quadratic probing Double hashing

28 Hash Tables The bottom line
Whatever the hash function and open addressing probe approach you take, the logic and strategy is the same: Determine an appropriate field(s) for hash use Develop a hash function that generates reasonably spaced index values Design a collision handling approach that takes advantage of the hash strategy Best and worst case will always range from O(1) to O(n) Open addressing means trying to reduce the likelihood of O(n)

29 Hash Tables A more dynamic approach
What if you’re not quite sure of your capacity estimate? Or, perhaps the maximum size is wildly outrageous and conducive to unused space A second collision handling approach allows for keeping a reasonably large sized array and dynamically addressing the collisions “Dynamic” memory management implies a second structure…

30 Hash Tables A hash array of linked lists
This method, known as “Separate Chaining” makes each element of the array a “head” node of a linked list When insert is performed, the hash index is found and the new element is inserted into the linked list there If a collision occurs, it’s okay because the linked list insert handles it When search or delete is performed, the initial hash takes place followed by a standard linked list search or delete

31 Hash Tables Same example as before. 3 records as heads of lists in the hash array T Y R

32 Hash Tables Record “D” is hashed to index [3] and is inserted into the linked list (note that T is now the 2nd node in the linked list there) D Y R T

33 Hash Tables Delete of record “Y” is simply hashing to index [4] and performing a linked list delete D R T

34 Hash Tables Search for “D” hashes to index [3] as normal and a linked list search is performed (which happens to be the head node!) D R T

35 Hash Tables Separate Chaining pros and cons
The overhead with using a linked list does impact performance but not necessarily the coding since the functions can be modularized In theory, the performance is the same as open addressing since it still depends on the hash function developed The size of the hash array is not a critical dependency since the linked lists handle the need for additional space The right combination of a hash function that yields wide ranging index values with the use of linked lists is generally preferred

36 Hash Tables Summary Hash tables have strong benefit for situations where single record search and maintenance is primary because of its near O(1) performance Obtaining records in ordered groups and data sets is challenging to do and not conducive to hash tables Collisions can be handled using open addressing or separate chaining, the latter of which is generally considered more flexible for performance and memory usage The key is the hash function itself – many formulas and theories exist on what fields and calculations to use to derive index values


Download ppt "Data Structures Hash Tables"

Similar presentations


Ads by Google