CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Analysis of Algorithms CS 477/677
Hash Tables.
Hash Tables CIS 606 Spring 2010.
Hashing.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Hashing CS 3358 Data Structures.
Dictionaries and Their Implementations
Data Structures – LECTURE 11 Hash tables
1.1 Data Structure and Algorithm Lecture 9 Hashing Topics Reference: Introduction to Algorithm by Cormen Chapter 12: Hash Tables.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Hashing Hashing is another method for sorting and searching data.
Hashing Amihood Amir Bar Ilan University Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
1 Hash Tables Chapter Motivation Many applications require only: –Insert –Search –Delete Examples –Symbol tables –Memory management mechanisms.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Hash table CSC317 We have elements with key and satellite data
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Hash Table.
Hashing and Hash Tables
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Dictionaries and Their Implementations
Introduction to Algorithms 6.046J/18.401J
Hash Tables – 2 Comp 122, Spring 2004.
Introduction to Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
CS 3343: Analysis of Algorithms
DATA STRUCTURES-COLLISION TECHNIQUES
DATA STRUCTURES-COLLISION TECHNIQUES
EE 312 Software Design and Implementation I
Hash Tables – 2 1.
Presentation transcript:

CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis

The Search Problem Applications Find items with keys matching a given search key Given an array A, containing n keys, and a search key x, find the index i such that x=A[i] a key could be part of a large record. Applications Banking: Keeping track of customer account information Search through records to check balances and perform transactions Flight reservations: Search to find empty seats, cancel/modify reservations Search engine: Looks for all documents containing a given word

Dictionaries Dictionary = data structure that supports the basic operations: insert, delete and search an item (based on a given key) Queries: return information about the set S: Search (S, k) Minimum (S), Maximum (S) Successor (S, x), Predecessor (S, x) Operations that modify a given set Insert (S, k) Delete (S, k) – not very often

Direct Addressing If key values are distinct and drawn from a universe U = {0, 1, . . . , m - 1} You may use Direct-address table representation: An array T[0 . . . m - 1] Each slot, or position, in T corresponds to a key in U For an element x with key k, a pointer to x (or x itself) will be placed in location T[k] If there are no elements with key k in the set, T[k] points to a NIL

Operations Alg.: DIRECT-ADDRESS-SEARCH(T, k) return T[k] Alg.: DIRECT-ADDRESS-INSERT(T, x) T[key[x]] ← x Alg.: DIRECT-ADDRESS-DELETE(T, x) T[key[x]] ← NIL Running time for these operations: O(1)

Comparing Different Implementations Implementing dictionaries using: Direct addressing Ordered/unordered arrays Ordered/unordered linked lists Insert Search direct addressing O(1) O(1) ordered array O(N) O(lgN) ordered list O(N) O(N) unordered array O(1) O(N) unordered list O(1) O(N)

Examples Using Direct Addressing Suppose that the keys are 9-digit Social Security Numbers We can use the same strategy as before but it will be very inefficient. An array of 1 Billion items is needed to store 100 records !! |U| (size of the entire key universe) can be very large |K| (actual set of keys) can be much smaller than |U|

Hash Tables When K is much smaller than U, a hash table requires much less space than a direct-address table Can reduce storage requirements to |K| and still get O(1) search time on average (not the worst case) Idea: Use a function h(.) to compute the slot for each key Store the element in slot h(k) A hash function h transforms a key into an index in a hash table T[0…m-1]. h : U → {0, 1, . . . , m - 1} We say that k hashes to slot h(k) Advantages: Reduce the range of array indices handled: m instead of |U|

Revisit Example 2 Suppose that the keys are 9-digit Social Security Numbers One possible hash function: h(ssn) = ssn mod 100 (last 2 digits of ssn) e.g. If ssn = 108 334 198 then h(108334198) = 98 m - 1 h(k3) h(k2) h(k1) h(k4) U (universe of keys) K (actual keys) k1 k4 k2 k5 k3 h(k2) = h(k5) Do you see any problems with this approach? Collisions can occur!

Collisions Handling Collisions: Chaining Open addressing Two or more keys hash to the same slot!! For a given set K of keys If |K| > m, collisions will definitely happen (i.e., there must be at least two keys that have the same hash value) If |K| ≤ m, collisions may or may not happen Avoiding collisions completely is hard, even with a good hash function Handling Collisions: Chaining Open addressing Linear probing Quadratic probing Double hashing

Handling Collisions Using Chaining Idea: Put all elements that hash to the same slot into a linked list Slot j contains a pointer to the head of the list of all elements that hash to j

Operations in Chained Hash Tables Alg.: CHAINED-HASH-INSERT(T, x) insert x at the head of list T[h(key[x])] Worst-case running time is O(1) Note that you may also need to search to check if it is already there Alg.: CHAINED-HASH-SEARCH(T, k) search for an element with key k in list T[h(k)] Running time is proportional to the length of the list of elements in slot h(k) Alg.: CHAINED-HASH-DELETE(T, x) delete x from the list T[h(key[x])] Worst-case running time is the same as the search operation

Analysis of Hashing with Chaining: Average Case Worst case: All n keys hash to the same slot. Then search/delete operations may take (n) Average case depends on how well the hash function distributes the n keys among the m slots Simple uniform hashing: Any given element is equally likely to hash into any of the m slots (i.e. when T is empty, collision probability Pr(h(x)=h(y)), is 1/m) Length of a list: nj, j = 0, 1, . . . , m – 1 Number of keys in the table: n = n0 + n1 +· · · + nm-1 Average value of nj: E[nj] = α = n/m Load factor of a hash table T:  = n/m  = the average number of elements stored in a chain n0 = 0 nm-1 = 0 T n2 n3 nj nk

Case 1: Unsuccessful Search (i.e. item not stored in the table) Theorem: An unsuccessful search in a hash table takes expected time (1+ α) under the assumption of simple uniform hashing (i.e. Given any random x and y, probability of collision Pr(h(x)=h(y)), is 1/m) Proof : need to search to the end of the list T[h(k)] Expected length of the list: E[nh(k)] = α = n/m Total time required : O(1) (to compute h(k)) + α  (1+ α)

Case 1I: Successful Search (i.e. item found in the table) A successful search in a hash table takes expected time (1+ α/2) under the assumption of simple uniform hashing where α = n/m (expected length of the list) If m (# of slots) is proportional to n (# of elements in the table): n = O(m)  α = n/m = O(m)/m = O(1)  Searching (both successful and unsuccessful) takes constant time O(1) on average

Hash Functions What makes a good hash function? Easy to compute Approximates a random function: considering all the keys, every output is equally likely (simple uniform hashing) In practice, it is very hard to satisfy the simple uniform hashing property i.e. , we don’t know in advance the probability distribution that keys are drawn from Minimize the chance that closely related keys hash to the same slot (i.e. strings such as pt and pts should hash to different slots) Derive a hash value that is independent from any patterns that may exist in the distribution of the keys

The Division Method Idea: Map a key k into one of the m slots by taking the remainder of k divided by m h(k) = k mod m Advantage: fast, requires only one operation Certain values of m are bad, e.g. power of 2 or non-prime numbers If m = 2p then h(k) is just the least significant p bits of k Choose m to be a prime, not close to a power of 2

The Multiplication Method h(k) = m (k A - k A) 0 < A < 1 fractional part of kA Disadvantage: Slower than division method Advantage: Value of m is not critical, e.g., typically 2p EXAMPLE: m = 23 .101101 (A) 110101 (k) (kA) = 1001010.0110011 Discard integer part: 1001010 Shift .0110011 by 3 bits to the left  011.0011 Take integer part: 011 Thus, h(110101) = 011

Open Addressing If we have enough contiguous memory to store all the keys (m > N)  store the keys in the table itself No need to use linked lists anymore Insertion: if a slot is full, try another one, until you find an empty one Search: follow the same sequence of probes Deletion: more difficult ... (we’ll see why) Search time depends on the length of the probe sequence! e.g., insert 14

Generalized hash function notation: A hash function contains two arguments now: (i) Key value, and (ii) Probe number h(k,p), p=0,1,...,m-1 Probe sequences <h(k,0), h(k,1), ..., h(k,m-1)> Must be a permutation of <0,1,...,m-1> There are m! possible permutations An optimum hash function has potential to produce all m! probe sequences insert 14 Example: <1, 5, 9>

Common Open Addressing Methods Linear probing Quadratic probing Double hashing

Linear probing: Inserting a key Idea: when there is a collision, check the next available position in the table (i.e., probing) h(k,i) = (h1(k) + i) mod m i=0,1,2,... First slot probed: h1(k) Second slot probed: h1(k) + 1 Third slot probed: h1(k)+2, and so on probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....> Can generate m probe sequences maximum, why? wrap around

Linear probing: Searching for a key Four cases: Position in table is occupied with an element of equal key Position in table is marked as ‘deleted’ Position in table occupied with a different element Position in table is empty Cases 2 or 3: probe the next higher index until the element is found or an empty position is found The process wraps around to the beginning of the table h(k1) h(k4) h(k2)=h(k5) h(k3) m - 1

Linear probing: Deleting a key Problems Cannot mark the slot as empty Otherwise, impossible to retrieve those keys inserted when that slot was occupied Solution Mark the slot with a sentinel value DELETED The deleted slot can later be used for insertion m - 1

Primary Clustering Problem Some slots become more likely than others Long chunks of occupied slots are created  search time increases! initially, all slots have probability 1/m Slot b: 2/m Slot d: 4/m Slot e: 5/m

Quadratic probing h(k,i) = (h’(k)+c1i + c2i2) mod m, i = 0, 1, …, m-1 where h’: U (0,1,…, m-1), c1, c2 > 0 Clustering problem is less serious but still an issue (secondary clustering) How many distinct probe sequences quadratic probing generate? (the initial probe position determines the probe sequence) m

h(k,i) = (h1(k) + i h2(k)) mod m, i=0,1,... Double Hashing Use one hash function to determine the first slot Use a second hash function to determine the increment for the probe sequence h(k,i) = (h1(k) + i h2(k)) mod m, i=0,1,... Initial probe: h1(k) Second probe is offset by h2(k) mod m, so on ... Advantage: avoids clustering Disadvantage: harder to delete an element Can generate m2 probe sequences maximum. Why? (For two different keys ka and kb, even if h1(ka)=h1(kb), it is unlikely that h2(ka)=h2(kb), therefore two keys that initially probe to the same location may generate different probe sequences. Furthermore, after the initial probe, increment could be 1, or 2, or 3, or …m each one resulting in a unique probe sequence – a total of m. There are m initial probes each may in turn generate m sequences  A total of m*m= m2 sequences )

Double Hashing: Example h1(k) = k mod 13 h2(k) = 1+ (k mod 11) h(k,i) = (h1(k) + i h2(k) ) mod 13 Insert key=14: h1(14,0) = 14 mod 13 = 1 h(14,1) = (h1(14) + h2(14)) mod 13 = (1 + 4) mod 13 = 5 h(14,2) = (h1(14) + 2 h2(14)) mod 13 = (1 + 8) mod 13 = 9 79 69 98 72 50 1 2 3 4 5 6 7 8 9 14 10 11 12

Analysis of Open Addressing Theorem 11.6: Given an open-address hash table with load factor α = n/m < 1, the expected number of probes in an unsuccessful search is at most 1/(1- α), assuming uniform hashing Proof: Ignore the problem of clustering and assume that all probe sequences are equally likely, then Unsuccessful search analysis: Probability(probe hits an occupied cell) = α (load factor α = n/m) Prob(probe hits an empty cell) = 1- α probability that a probe terminates in 2 steps = α(1-α) probability that a probe terminates in k steps = αk-1(1-α) What is the average number of steps in a probe?

Analysis of Open Addressing (cont’d) Successful search: Example Unsuccessful probe: α = 0.5 E(#steps) = 2 α = 0.9 E(#steps) = 10 Successful probe: α = 0.5 E(#steps) = 1.386 α = 0.9 E(#steps) = 2.558