Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Hash Tables.
Hash Tables CIS 606 Spring 2010.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CSCE 3400 Data Structures & Algorithm Analysis
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.
Nov 12, 2009IAT 8001 Hash Table Bucket Sort. Nov 12, 2009IAT 8002  An array in which items are not stored consecutively - their place of storage is calculated.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Techniques.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Hashing CS 3358 Data Structures.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hash Table March COP 3502, UCF.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
1 Hash table. 2 A basic problem We have to store some records and perform the following:  add new record  delete record  search a record by key Find.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hash Tables - Motivation
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
Hashing Lecture 10. More Efficient Searching Question: Can searching be more efficient than logarithmic O(log n) significantly? Example: a set of students.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Resolving collisions: Open addressing
Data Structures – Week #7
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Space-for-time tradeoffs
EE 312 Software Design and Implementation I
Data Structures – Week #7
Hashing.
Data Structures and Algorithm Analysis Hashing
EE 312 Software Design and Implementation I
Presentation transcript:

Hashing COMP171

Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks, queues, … n Nonlinear ones: trees, graphs (relations between elements are explicit) n Now for the case ‘relation is not important’, but want to be ‘efficient’ for searching (like in a dictionary)! * Generalizing an ordinary array, n direct addressing! n An array is a direct-address table * A set of N keys, compute the index, then use an array of size N n Key k at k -> direct address, now key k at h(k) -> hashing * Basic operation is in O(1)! * To ‘hash’ (is to ‘chop into pieces’ or to ‘mince’), is to make a ‘map’ or a ‘transform’ …

Hashing 3 Hash Table * Hash table is a data structure that supports Finds, insertions, deletions (deletions may be unnecessary in some applications) * The implementation of hash tables is called hashing n A technique which allows the executions of above operations in constant average time * Tree operations that requires any ordering information among elements are not supported findMin and findMax n Successor and predecessor n Report data within a given range n List out the data in order

Hashing 4 General Idea * The ideal hash table data structure is an array of some fixed size, containing the items * A search is performed based on key  Each key is mapped into some position in the range 0 to TableSize-1 * The mapping is called hash function A hash table Data item

Hashing 5 Unrealistic Solution * Each position (slot) corresponds to a key in the universe of keys n T[k] corresponds to an element with key k n If the set contains no element with key k, then T[k]=NULL

Hashing 6 Unrealistic Solution * Insertion, deletion and finds all take O(1) (worst-case) time * Problem: waste too much space if the universe is too large compared with the actual number of elements to be stored n E.g. student IDs are 8-digit integers, so the universe size is 10 8, but we only have about 7000 students

Hashing 7 Hashing Usually, m << N h(K i ) = an integer in [0, …, m-1] called the hash value of K i The keys are assumed to be natural numbers, if they are not, they can always be converted or interpreted in natural numbers.

Hashing 8 Example Applications * Compilers use hash tables (symbol table) to keep track of declared variables. * On-line spell checkers. After prehashing the entire dictionary, one can check each word in constant time and print out the misspelled word in order of their appearance in the document. * Useful in applications when the input keys come in sorted order. This is a bad case for binary search tree. AVL tree and B+-tree are harder to implement and they are not necessarily more efficient.

Hashing 9 Hash Function * With hashing, an element of key k is stored in T[h(k)] * h: hash function n maps the universe U of keys into the slots of a hash table T[0,1,...,m-1] n an element of key k hashes to slot h(k) n h(k) is the hash value of key k

Hashing 10 Collision * Problem: collision n two keys may hash to the same slot n can we ensure that any two distinct keys get different cells?  No, if N>m, where m is the size of the hash table  Task 1: Design a good hash function n that is fast to compute and n can minimize the number of collisions  Task 2: Design a method to resolve the collisions when they occur

Hashing 11 Design Hash Function * A simple and reasonable strategy: h(k) = k mod m n e.g. m=12, k=100, h(k)=4 n Requires only a single division operation (quite fast) * Certain values of m should be avoided n e.g. if m=2 p, then h(k) is just the p lowest-order bits of k; the hash function does not depend on all the bits n Similarly, if the keys are decimal numbers, should not set m to be a power of 10 * It’s a good practice to set the table size m to be a prime number * Good values for m: primes not too close to exact powers of 2 n e.g. the hash table is to hold 2000 numbers, and we don’t mind an average of 3 numbers being hashed to the same entry  choose m=701

Hashing 12 Deal with String-type Keys * Can the keys be strings? * Most hash functions assume that the keys are natural numbers n if keys are not natural numbers, a way must be found to interpret them as natural numbers * Method 1: Add up the ASCII values of the characters in the string n Problems:  Different permutations of the same set of characters would have the same hash value  If the table size is large, the keys are not distribute well. e.g. Suppose m=10007 and all the keys are eight or fewer characters long. Since ASCII value <= 127, the hash function can only assume values between 0 and 127*8=1016

Hashing 13 * Method 2 n If the first 3 characters are random and the table size is 10,0007 => a reasonably equitable distribution n Problem  English is not random  Only 28 percent of the table can actually be hashed to (assuming a table size of 10,007) * Method 3 n computes n involves all characters in the key and be expected to distribute well 27 2 a,…,z and space

Hashing 14 Collision Handling: (1) Separate Chaining * Like ‘equivalent classes’ or clock numbers in math * Instead of a hash table, we use a table of linked list * keep a linked list of keys that hash to the same value Keys: Set of squares Hash function: h(K) = K mod 10

Hashing 15 Separate Chaining Operations * To insert a key K n Compute h(K) to determine which list to traverse n If T[h(K)] contains a null pointer, initiatize this entry to point to a linked list that contains K alone. n If T[h(K)] is a non-empty list, we add K at the beginning of this list. * To delete a key K n compute h(K), then search for K within the list at T[h(K)]. Delete K if it is found.

Hashing 16 Separate Chaining Features * If the hash function works well, the number of keys in each linked list will be a small constant. * Therefore, we expect that each search, insertion, and deletion can be done in constant time. * Disadvantage: Memory allocation in linked list manipulation will slow down the program. * Advantage: deletion is easy.

Hashing 17 Collision Handling: (2) Open Addressing * Instead of following pointers, compute the sequence of slots to be examined * Open addressing: relocate the key K to be inserted if it collides with an existing key. n We store K at an entry different from T[h(K)]. * Two issues arise n what is the relocation scheme? n how to search for K later? * Three common methods for resolving a collision in open addressing n Linear probing n Quadratic probing n Double hashing

Hashing 18 Open Addressing Strategy * To insert a key K, compute h 0 (K). If T[h 0 (K)] is empty, insert it there. If collision occurs, probe alternative cell h 1 (K), h 2 (K),.... until an empty cell is found. * h i (K) = (hash(K) + f(i)) mod m, with f(0) = 0 n f: collision resolution strategy

Hashing 19 Linear Probing * f(i) =i n cells are probed sequentially (with wrap-around) n h i (K) = (hash(K) + i) mod m * Insertion: n Let K be the new key to be inserted, compute hash(K) n For i = 0 to m-1  compute L = ( hash(K) + I ) mod m  T[L] is empty, then we put K there and stop. n If we cannot find an empty entry to put K, it means that the table is full and we should report an error.

Hashing 20 Linear Probing Example * h i (K) = (hash(K) + i) mod m * E.g, inserting keys 89, 18, 49, 58, 69 with hash(K)=K mod 10 To insert 58, probe T[8], T[9], T[0], T[1] To insert 69, probe T[9], T[0], T[1], T[2]

Hashing 21 Primary Clustering * We call a block of contiguously occupied table entries a cluster * On the average, when we insert a new key K, we may hit the middle of a cluster. Therefore, the time to insert K would be proportional to half the size of a cluster. That is, the larger the cluster, the slower the performance. * Linear probing has the following disadvantages: n Once h(K) falls into a cluster, this cluster will definitely grow in size by one. Thus, this may worsen the performance of insertion in the future. n If two clusters are only separated by one entry, then inserting one key into a cluster can merge the two clusters together. Thus, the cluster size can increase drastically by a single insertion. This means that the performance of insertion can deteriorate drastically after a single insertion. n Large clusters are easy targets for collisions.

Hashing 22 Quadratic Probing Example * f(i) = i 2 * h i (K) = ( hash(K) + i 2 ) mod m * E.g., inserting keys 89, 18, 49, 58, 69 with hash(K) = K mod 10 To insert 58, probe T[8], T[9], T[(8+4) mod 10] To insert 69, probe T[9], T[(9+1) mod 10], T[(9+4) mod 10]

Hashing 23 Quadratic Probing * Two keys with different home positions will have different probe sequences n e.g. m=101, h(k1)=30, h(k2)=29 n probe sequence for k1: 30,30+1, 30+4, 30+9 n probe sequence for k2: 29, 29+1, 29+4, 29+9 * If the table size is prime, then a new key can always be inserted if the table is at least half empty (see proof in text book) * Secondary clustering n Keys that hash to the same home position will probe the same alternative cells n Simulation results suggest that it generally causes less than an extra half probe per search n To avoid secondary clustering, the probe sequence need to be a function of the original key value, not the home position

Hashing 24 Double Hashing * To alleviate the problem of clustering, the sequence of probes for a key should be independent of its primary position => use two hash functions: hash() and hash2() * f(i) = i * hash2(K) n E.g. hash2(K) = R - (K mod R), with R is a prime smaller than m

Hashing 25 Double Hashing Example * h i (K) = ( hash(K) + f(i) ) mod m; hash(K) = K mod m * f(i) = i * hash2(K); hash2(K) = R - (K mod R), * Example: m=10, R = 7 and insert keys 89, 18, 49, 58, 69 To insert 49, hash2(49)=7, 2 nd probe is T[(9+7) mod 10] To insert 58, hash2(58)=5, 2 nd probe is T[(8+5) mod 10] To insert 69, hash2(69)=1, 2 nd probe is T[(9+1) mod 10]

Hashing 26 Choice of hash2() * Hash2() must never evaluate to zero * For any key K, hash2(K) must be relatively prime to the table size m. Otherwise, we will only be able to examine a fraction of the table entries. n E.g.,if hash(K) = 0 and hash2(K) = m/2, then we can only examine the entries T[0], T[m/2], and nothing else! * One solution is to make m prime, and choose R to be a prime smaller than m, and set hash2(K) = R – (K mod R) * Quadratic probing, however, does not require the use of a second hash function n likely to be simpler and faster in practice

Hashing 27 Deletion in Open Addressing * Actual deletion cannot be performed in open addressing hash tables n otherwise this will isolate records further down the probe sequence * Solution: Add an extra bit to each table entry, and mark a deleted slot by storing a special value DELETED (tombstone) or it’s called ‘lazy deletion’.

Hashing 28 Re-hashing * If the table is full * Double the size and re-hash everything with a new hashing function

Pattern Matching COMP171 Spring 2009

Hashing 30 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of the pattern within the text. * Example: T = and P = 0001: n first occurrence starts at T[1]. n second occurrence starts at T[5]. n third occurrence starts at T[11].

Hashing 31 Naïve algorithm Worst-case running time = O(nm).

Hashing 32 Can we do it better? * The naïve algo is O(mn) in the worst case * But we do have linear algorithm (optional): n Boyer-Moore n Knuth-Morris-Pratt n Finite automata * Using idea of ‘hashing’! Robin-Karp algorithm

Hashing 33 Boyer-Moore Algorithm * Basic idea is simple. * We match the pattern P against substrings in the text string T from right to left. * We align the pattern with the beginning of the text string. Compare the characters starting from the rightmost character of the pattern. If fail, shift the pattern to the right, by how far? *

Hashing X ANPANMAN ANPANMAN ANPANMAN ANPANMAN ANPANMAN ANPANMAN ANPANMAN ANPANMAN l The X in position 8 excludes all 8 of the possible

Hashing 35 Rabin-Karp Algorithm * Key idea: n Think of the pattern P[0..m-1] as a key, transform it into an equivalent integer p. n Similarly, we transform substrings in the text string T[] into integers.  For s=0,1,…,n-m, transform T[s..s+m-1] to an equivalent integer t s. n The pattern occurs at position s if and only if p=t s. * If we compute p and t s quickly, then the pattern matching problem is reduced to comparing p with n-m+1 integers. *

Hashing 36 Rabin-Karp Algorithm * How to compute p? p = 2 m-1 P[0] + 2 m-2 P[1] + … + 2 P[m-2] + P[m-1] * Using Horner’s rule This takes O(m) time, assuming each arithmetic operation can be done in O(1) time.

Hashing 37 Rabin-Karp Algorithm * Similarly, to compute the (n-m+1) integers t s from the text string. * This takes O((n – m + 1) m) time, assuming that each arithmetic operation can be done in O(1) time. * This is a bit time-consuming.

Hashing 38 Rabin-Karp Algorithm * A better method to compute the integers incrementally using previous result: This takes O(n+m) time, assuming that each arithmetic operation can be done in O(1) time. compute offset = 2 m Horner’s rule to compute t 0 t S-1 tStS

Hashing 39 Problem * The problem with the previous strategy is that when m is large, it is unreasonable to assume that each arithmetic operation can be done in O(1) time. n In fact, given a very long integer, we may not even be able to use the default integer type to represent it. * Therefore, we will use modulo arithmetic. Let q be a prime number so that 2q can be stored in one computer word. n This makes sure that all computations can be done using single-precision arithmetic.

Hashing 40 Compute equivalent integer for pattern O(m) O(n+m)

Hashing 41 * Once we use the modulo arithmetic, when p=t s for some s, we can no longer be sure that P[0.. m-1] is equal to T[s.. s+ m -1 ]. * Therefore, after the equality test p = t s, we should compare P[0..m-1] with T[s..s+m-1] character by character to ensure that we really have a match. * So the worst-case running time becomes O(nm), but it avoids a lot of unnecessary string matchings in practice.

Hashing 42 A spell checker with hashing Start by reading in words from a dictionary file named dictionary. The words in this dictionary file will be listed one per line, sorted alphabetically. Store each word in a hash table, using chaining to resolve collisions. Start with a table size of roughly 4K entries (the table size should be prime). If necessary, rehash to a larger table size to keep the load factor less than 1.0. After hashing each word in the dictionary file, read in the user-specified text file and check it for spelling errors by looking up each word in the hash table. A word is defined as a string of letters (possibly containing single quotes), separated by white space and/or punctuation marks. If a word cannot be found in the hash table, it represents a possible misspelling.