Data Structures Hash Tables

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Hash Tables.
Hashing.
Skip List & Hashing CSE, POSTECH.
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
Hashing Techniques.
Hash Table indexing and Secondary Storage Hashing.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Hash Table March COP 3502, UCF.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
1 Hash table. 2 A basic problem We have to store some records and perform the following:  add new record  delete record  search a record by key Find.
Comp 335 File Structures Hashing.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Storage and Retrieval Structures by Ron Peterson.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Hashing Hashing is another method for sorting and searching data.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
CS201: Data Structures and Discrete Mathematics I Hash Table.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
WEEK 1 Hashing CE222 Dr. Senem Kumova Metin
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
© Love Ekenberg Hashing Love Ekenberg. © Love Ekenberg In General These slides provide an overview of different hashing techniques that are used to store.
ISOM MIS 215 Module 5 – Binary Trees. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
1 Chapter 9 Searching And Table. 2 OBJECTIVE Introduces: Basic searching concept Type of searching Hash function Collision problems.
Chapter 5 Record Storage and Primary File Organizations
Chapter 11 (Lafore’s Book) Hash Tables Hwajung Lee.
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash functions Open addressing
Hash Table.
Hash Tables.
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Data Structures Unsorted Arrays
Presentation transcript:

Data Structures Hash Tables Phil Tayco Slide version 1.0 May 4, 2015

Hash Tables Storage space revisited A common argument in recent computing is the lower costs of acquiring large amounts of disk space Situations can then be adjusted that treat using large amounts as not as critical This implies the use of arrays for managing data sets

Hash Tables Sorted data If we are okay with using arrays, then certain situations using them could be identified Sorted data leads to O(log n) performance Sorting the data is at best O(n log n) using quicksort and O(n) if we kept the order while performing maintenance Performance is strong if the data is sorted but maintaining it can be costly

Hash Tables Unless we don’t need to sort Sorted data helps when presenting parts or all of the data (such as a web page report) If there isn’t a need to show sorted data (such as an employee management system where records are maintained one at a time), then the need to sort the data is removed Unsorted data, however, is O(n) so we are now looking for a structure that helps with O(log n) maintenance performance (or better) that does not need the sorting (and we are okay with using arrays)

Hash Tables Array index as key To take advantage of this, we need to take advantage of the fact that arrays allow for direct access to array elements Direct access is achieved by using the array index number The question is how to maximize use of the array index when performing the maintenance functions?

Hash Tables An ideal example Consider a company of 1,000 employees and perhaps this particular company is very unlikely to exceed 100,000 Storage is not an issue and memory capacity can easily accommodate 100,000 records The program to maintain these records does not have functionality that requires showing the employee records in any sorted way This is all great because an array can be used with a large amount of space that can handle the worst case 100,000 records

Hash Tables Index representation To take full advantage of the array, we treat the array index as a key value to identifying an employee Sequential employee id numbers make the perfect key (Employee 15 is employees[14]) On a larger scale, employee SSN can be used in the same way (assuming you can hold up to 999,999,999 records!) Each employee id is a unique index value so there would never be overlap (unless you reused employee ids after they left the company)

Hash Tables Ideal efficiency Just how fast does this performance lead to? Search: you know the id number, you know the array index and you have direct access Insert: maintaining the last known employee id number is easy enough to take advantage of adding new employees Update/Date: is a search followed by an appropriate change Each one of these ends up at O(1)!

Hash Tables Reality Such ideal situations are in fact that: ideal Some situations tend to lose out on some factor: Not quite enough storage space requiring a smaller array size ID values may not be a unique number Can we reduce the array size and find a way to line up a unique record ID with an array index?

Hash Tables Hashing Hashing involves deriving an index value through some logical calculation Derivation is applied to a field or combination of fields of the record that calculate an index Typical example: Adding all ASCII values of some field like first and last name and using mod to calculate the index

Hash Tables Calculations Example: “Phil Tayco” as the name of the record Add all ASCII character values 50 + 104 + 105 + 108 = 367 for “Phil” 54 + 97 + 121 + 99 + 111 = 482 for “Tayco Total = 849 Say we only allow for 500 array elements. We can also mod this value by the array size 849 % 500 = array index 349 Utilizing this approach means we have a consistent formula to derive an index value

Hash Tables Limitations Challenges immediately come to mind when looking at this example: Eventually, an index value calculation for 2 different records will derive the same value (called a “collision”) A calculation that guarantees a unique value often leads to a large amount of space required with heavy under utilization We need to keep the capacity of the array reasonable while handling the inevitable collisions

Hash Tables Collisions Multiple approaches for handling collisions when hashing Open addressing uses the strategy to find another open element in the array following a search-like algorithm Assumption is that there will be enough space for all entries (i.e. the estimated maximum capacity of the hash array is adequate

Hash Tables Linear Probing Linear probing is the basic open address agorithm If a collision occurs, look in the next immediate spot in the array If it is open, place the next item there If it is not, continue looking in the next array index (wrapping to index 0 if needed) until an open spot is found This is an issue only if the capacity is reached (making the initial estimate important

Hash Tables Linear Probe Search If the hash array utilizes this form of collision handling on insert, the other functions must follow suit Search uses the hash function to find if a given record is at the hash location If it is “empty” at that location, the search if over If it is there, then the record is found Otherwise, the search continues with the next array element “Empty”, however, must be defined such a predetermined record value. Why…?

Hash Tables Linear Probe Delete Because a delete cannot simply mean to perform the search and if the record is found, remove it from the array This would leave an empty spot in the array that may be interpreted as a record not found during a search Instead, the array element is changed to another pre-determined value of “deleted” Search does not treat this as an empty spot

Hash Tables Example: Records “T”, “Y” and “R” have been hashed into the array T Y R

Hash Tables New record “D” comes in and the hash function calculates its index as index [3] T Y R D

Hash Tables Record “D” collides with record “T”. Linear probe means try the next index T Y R D

Hash Tables However, record “Y” is already there, so we try the next one. It is open, so that’s where “D” goes T Y D R

Hash Tables Later on, record “Y” is called for deletion. When “Y” is hashed, its index value is [4]. “Y” is there, so the deletion is performed T Y D R

Hash Tables However, if we remove it, that creates an empty space… T D

Hash Tables If we left it this way, when search for record “D” begins, its original hash value is still [3] T D R

Hash Tables Since index [3] is not “empty”, search goes to index [4] which is empty and then incorrectly returns “not found” T D R

Hash Tables Solution is instead of removing the record, put in a designated “deleted” value (such as -1) T -1 D R

Hash Tables Now when search for record “D” is performed, the linear probe will treat the “-1” as not empty and continue the search correctly T -1 D R

Hash Tables Linear Probe Efficiency As records start to fill up the array, you can infer that the efficiency of the algorithm degrades to O(n) The degradation is dependent on the complexity of the hash function (more spaced out locations) and nature of the data (does the selected fields of data result in spaced out hash values) Other methods of probing exist Quadratic probing Double hashing

Hash Tables The bottom line Whatever the hash function and open addressing probe approach you take, the logic and strategy is the same: Determine an appropriate field(s) for hash use Develop a hash function that generates reasonably spaced index values Design a collision handling approach that takes advantage of the hash strategy Best and worst case will always range from O(1) to O(n) Open addressing means trying to reduce the likelihood of O(n)

Hash Tables A more dynamic approach What if you’re not quite sure of your capacity estimate? Or, perhaps the maximum size is wildly outrageous and conducive to unused space A second collision handling approach allows for keeping a reasonably large sized array and dynamically addressing the collisions “Dynamic” memory management implies a second structure…

Hash Tables A hash array of linked lists This method, known as “Separate Chaining” makes each element of the array a “head” node of a linked list When insert is performed, the hash index is found and the new element is inserted into the linked list there If a collision occurs, it’s okay because the linked list insert handles it When search or delete is performed, the initial hash takes place followed by a standard linked list search or delete

Hash Tables Same example as before. 3 records as heads of lists in the hash array T Y R

Hash Tables Record “D” is hashed to index [3] and is inserted into the linked list (note that T is now the 2nd node in the linked list there) D Y R T

Hash Tables Delete of record “Y” is simply hashing to index [4] and performing a linked list delete D R T

Hash Tables Search for “D” hashes to index [3] as normal and a linked list search is performed (which happens to be the head node!) D R T

Hash Tables Separate Chaining pros and cons The overhead with using a linked list does impact performance but not necessarily the coding since the functions can be modularized In theory, the performance is the same as open addressing since it still depends on the hash function developed The size of the hash array is not a critical dependency since the linked lists handle the need for additional space The right combination of a hash function that yields wide ranging index values with the use of linked lists is generally preferred

Hash Tables Summary Hash tables have strong benefit for situations where single record search and maintenance is primary because of its near O(1) performance Obtaining records in ordered groups and data sets is challenging to do and not conducive to hash tables Collisions can be handled using open addressing or separate chaining, the latter of which is generally considered more flexible for performance and memory usage The key is the hash function itself – many formulas and theories exist on what fields and calculations to use to derive index values