Hashing General idea: Get a large array

Slides:



Advertisements
Similar presentations
Hash Tables.
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
© 2004 Goodrich, Tamassia Hash Tables1  
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
Hashing Techniques.
Hashing CS 3358 Data Structures.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Hash Table March COP 3502, UCF.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
1 Hash table. 2 A basic problem We have to store some records and perform the following:  add new record  delete record  search a record by key Find.
Hash Tables1   © 2010 Goodrich, Tamassia.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing Hashing is another method for sorting and searching data.
© 2004 Goodrich, Tamassia Hash Tables1  
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
CS201: Data Structures and Discrete Mathematics I Hash Table.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
Hashing CSE 2011 Winter July 2018.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash functions Open addressing
Advanced Associative Structures
Hash Table.
Hash Table.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Data Structures – Week #7
Presentation transcript:

Hashing General idea: Get a large array Design a hash function h which converts keys into array indices . Insert (key, value) in the array at h(key) position. Can also be used in external memory (addresses instead of array indices).

Example Suppose the size of the hash table is k. How to map arbitrary large numbers to numbers in the range 0.....k-1? A possible solution: division hashing. index = largeNumber % k (remainder of division by k). However, even this is not very practical (largeNumber is too large in our case). Another problem is that the same index may now be associated with different words.

Requirements for Hash function The following are general requirements for hash functions: If the hash table has size k, hash function should return numbers in the range 0....k-1. Hash function must be quick to calculate the value. Hash function should minimize collisions (when several keys are hashed to the same number). The last requirement is satisfied if each key is equally likely to map to any of the k indices (uniform distribution). Usually we cannot completely avoid collisions (as in the example with words).

Hashing Methods Division method Multiplication method Folding method Other hashing methods

Division method Algorithm: H(x) = x mod m + 1   Where m is some predetermined divisor integer (i.e., te table size), x is the preconditioned key, and mod stands for modulo. Note that adding 1 is only necessary if the table starts at key 1 (if it starts at 0, the algorithm simplifies to (H(x) = x mod m).  So, in other words: given an item, divide the preconditioned key of that item by the table size (+1). The remainder is the hash key.  

Division method Given a hash table with 10 buckets, what is the hash key for 'Cat'?     Since 'Cat' = 131130 when converted to ASCII, then x = 131130. We are given the table size (i.e., m = 10, starting at 0 and ending at 9).             H(x) = x mod m             H(131130) = 131130 mod 10                               = 0 (there is no remainder)     'Cat' is inserted into the table at address 0.  

Multiplication method The method that is used in the slide is the multiplication method. It multiplies all the individual digits in the key together, and takes the remainder after dividing the resulting number by the table size. In functional notation, the algorithm is:        H(x) = (a * b * c * d *....) mod m        Where:  m is the table size, a, b, c, d, etc. are the individual digits of the item, and mod stands for modulo. Let's apply this algorithm to an example.

Example Given a hash table of ten buckets (0 through 9), what is the hash key for 'Cat'?     Since 'Cat' = 131130 when converted to ASCII, then x = 131130     We are given the table size (i.e., m = 10).     The constant can be any number we want, let's use five (i.e., c = 5).           H(x) = (a * b * c * d *....) mod m         H(131130) = (1 * 3 * 1 * 1 * 3 * 0) mod 10                           = 0 mod 10                           = 0     'Cat' is inserted into the table at address 0.  

Folding method Algorithm: H(x) = (a + b + c) mod m Where a, b, and c represent the preconditioned key broken down into three parts, m is the table size, and mod stands for modulo.   In other words: the sum of three parts of the preconditioned key is divided by the table size. The remainder is the hash key.  

Examples Fold the key 123456789 into a hash table of ten spaces (0 through 9).     We are given x = 123456789 and the table size (i.e., m = 10). Since we can break x into three parts any way we want to, we will break it up evenly.     Thus a = 123, b = 456, and c = 789.         H(x) = (a + b + c) mod m         H(123456789) = (123+456+789) mod 10                                 = 1368 mod 10                                 = 8      123456789 is inserted into the table at address 8.

Other hashing methods Now that we have shown, in detail, a few of the many hashing methods which you can use, we will briefly discuss the length-dependent method, midsquare method, digit-analysis and addition method.

Other hashing methods Midsquare mehod => Multiply x by itself (i.e., x2) and select a number of digits from the middle of the result. How many digits you select will depend on your table size and the size of your hash key. Addition method =>The sum of the digits of the preconditioned key is divided by the table size. The remainder is the hash key. Digit analysis method => Certain digits of the preconditioned key are selected in a certain predetermined and consistent pattern.

Perfect Hashing In exceptional circumstances, it is possible to design a perfect hashing function which avoids collisions. For example, if all items in the collection are given in advance (e.g. a dictionary) and a function is designed to suit this collection and nothing else.

Handling Collisions There are several methods of solving the problem of collision: open addressing: probe array for the "next" slot which is still empty. Different methods for finding the "next" slot: linear probing quadratic probing double hashing separate chaining: each slot in the array contains a reference to a bucket of items with the keys hashed to that slot. Usually a linked list.

Open addressing Using the open addressing technique, after a collision, new locations are examined until an empty one is found. The item to be inserted is placed here. If an item is deleted, then it is replaced with the word "DELETED" . The sequence in which other locations are examined can be determined in many different ways. For example, after a collision, new locations could be searched from bottom to top, top to bottom, or by every 2 places.

Open addressing Compute the index given the key Go to the index If it is free, insert the item If it is not free, go to the next possible slot. To find the next possible slot, generates a "probing sequence".

Probing Sequence linear probing: search sequentially for the next free slot. The simplest probing sequence is h(k), h(k) + 1, h(k) + 2, ... but in general We can do h(k), h(k) + c, h(k) + 2c, h(k) + 3c, ... where c is some constant. Problem: clustering. If several keys are mapped to the same index, we get large clusters of occupied cells with empty spaces in between. The larger the cluster, the longer the search. Suppose k1, k2 and k3 are all mapped to the same index. Then searching for/inserting (k1,v1) is OK, inserting (k2,v2) takes an extra step, inserting (k3,v3) takes extra two steps etc.:

Probing Sequence K1, V1 K2, V2 K3, V3

Linear probing After first being located to an already occupied table position, the item is sent sequentially down the table until an empty space is found. If m is the number of possible positions in the table, then the linear probe continues until either an empty spot is found, or until m-1 locations have been searched. One thing that happens with linear probing however is clustering, or bunching together.  Primary clustering occurs when items are always hashed to the same spot in the table, and the lookup requires searching through 4 or 5 buckets before finding the empty spot.     

Clustering A graphic might help explain this: The opposite of clustering is uniform distribution. Simply, this is when the hash function uniformly distributes the items over the table.

Chaining A collision has occurred. The position resolved by the resolution technique is already full, therefore the element you wish to insert into the hash table cannot go there without losing data. Separate chaining simply makes a backup list of elements that would have otherwise been inserted into a given position had that position been empty.

Chaining For example, let us assume that position x in the table has been previously filled by a hash insertion. Now, let us assume that the element we wish to insert into the hash table resolves to position x as well. We cannot simply overwrite the element already in position x. Each hash table entry requires a pointer to a linked list of records that can hold the element data. This list is only used when a collision occurs. The element that we are inserting, E, is then appended onto the list associated with the target position, x. The result is a linked list associated with each table position, each list containing only those elements that collided on that table position with an element already in that position.

Chaining The worst case: all keys hashed to the same slot. Items kept in a linked list of size N (size of the collection), search O(N). Best case: hash function distributes the keys uniformly, the size of the bucket is N/k.

Probing Sequence Quadratic probing - the offset is squared at every step. Jumps around much more, but the items with the same keys make exactly the same jumps. Secondary clustering: each new item with the same hash value requires a longer probing sequence. Double hashing: offset is calculated by a second hashing function h', which is different from the first one: h(k), h(k) + h'(k), h(k) + 2h'(k), h(k) + 3h'(k), ... Solves the problem of clustering.

Quadratic Probing h(k,i) = (h(k) + c1i + c2 i2) mod m Only certain combination of c1, c2, and m use the entire hash table. h(k1,0) = h(k2,0) implies h(k1,i) = h(k2,i). This leads to secondary clustering. There are only m (<<m!) distinct probe sequences. Example: h(k1,i) = (h(k) + i + i2) mod m, c1 = c2 = 1 In this example, the probe sequence is h(k)h(k) + 2h(k) + 6h(k) + 12...

Double hashing Double hashing is another method of collision resolution, but unlike the linear collision resolution, double hashing uses a second hashing function which normally limits multiple collisions. The idea is that if two values hash to the same spot in the table, a constant can be calculated from the initial value using the second hashing function which can then be used to change the sequence of locations in the table, but still have access to the entire table.

Useful rules for the table size Unless it is possible to map each key to a unique index in the array, table size should be a prime number (important if modulo division is used; also for linear probing). If open addressing is used, table size should be larger than the number of values stored. Optimal load factor for open addressing is 1/2 (the table twice larger than the number of items). Load factor is the ratio of the number of items to the size of the hash table.

Summary Hash tables are very useful when the size of collection is known in advance and search, insertion and deletion should be fast. If the table is large enough and the hash function is chosen well, these operations are almost constant. Difficult to resize (copy into a larger array and change the hashing function) and to iterate through all items (array contains gaps).