Hash Tables Comp 550.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Hash Tables.
Hash Tables CIS 606 Spring 2010.
Data Structures Using C++ 2E
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Hashing CS 3358 Data Structures.
Data Structures – LECTURE 11 Hash tables
Comp 122, Spring 2004 Keys into Buckets: Lower bounds, Linear-time sort, & Hashing.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing1 Hashing. hashing2 Observation: We can store a set very easily if we can use its keys as array indices: A: e.g. SEARCH(A,k) return A[k]
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Hashing Hashing is another method for sorting and searching data.
Hashing Amihood Amir Bar Ilan University Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
1 Hash Tables Chapter Motivation Many applications require only: –Insert –Search –Delete Examples –Symbol tables –Memory management mechanisms.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Hash table CSC317 We have elements with key and satellite data
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Hashing and Hash Tables
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Introduction to Algorithms 6.046J/18.401J
Hash Tables – 2 Comp 122, Spring 2004.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Introduction to Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
CS 3343: Analysis of Algorithms
EE 312 Software Design and Implementation I
Hash Tables – 2 1.
Presentation transcript:

Hash Tables Comp 550

Dictionary Dictionary: Hash Tables: Dynamic-set data structure for storing items indexed using keys. Supports operations: Insert, Search, and Delete (take O(1) time). Applications: Symbol-table of a compiler. Routing tables for network communication. Associative arrays (python) Page tables, spell checkers, document fingerprints, … Hash Tables: Effective implementations of dictionaries. Comp 550

Dictionary by Direct-address Tables Direct-address Tables are ordinary arrays. Support direct addressing by key values. Element whose key is k is obtained by indexing into the kth position of the array. Applicable when we can afford to allocate an array with one position for every possible key. i.e. when the universe of keys U is small. Dictionary operations (search, insert, delete) can be implemented to each take O(1) time. Straightforward details in CLRS 11.1. Comp 550

Hashing hash table T[0..m–1]. U h : U  {0,1,…, m–1} U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 h(k2) h(k2) k5 k3 h(k3) h : U  {0,1,…, m–1} n = |K| << |U|. m–1 key k “hashes” to slot T[h[k]] Comp 550

Dictionary by Hash Tables Notation: U – Universe of all possible keys. K – Set of keys actually stored in the dictionary. n = |K| << |U|. Hash tables use arrays of size m = Q(n): Define functions that map keys to slots of the hash table. Resolve collisions, since many keys map to same slot. Support search, insert, delete, but not always O(1) worst-case. Hash function h : U  {0,1,…, m–1} maps keys from U to the slots of a hash table T[0..m–1]. key k maps or “hashes” to slot T[h[k]]. Comp 550

Hashing U collision (universe of keys) h(k1) h(k4) h(k2)=h(k5) h(k2) U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 collision h(k2)=h(k5) h(k2) k5 k3 h(k3) m–1 Comp 550

1. How can we choose the hash function to minimize collisions? Two questions: 1. How can we choose the hash function to minimize collisions? 2. What do we do about collisions when they occur? Comp 550

Hash table design considerations Collision resolution separate chaining CLRS 11.2 Hash function design Minimize collisions by spreading keys evenly Collisions must occur because we map many-to-one open address CLRS 11.4 perfect hashing CLRS 11.5 Worst- and average-case times of operations Comp 550

Collision Resolution by Chaining U (universe of keys) h(k1)=h(k4) X k1 k4 K (actual keys) k2 X k6 h(k2)=h(k5)=h(k6) k5 k7 k8 k3 X h(k3)=h(k7) h(k8) m–1 Comp 550

Collision Resolution by Chaining U (universe of keys) k4 k1 k1 k4 K (actual keys) k2 k6 k6 k5 k2 k5 k7 k8 k3 k7 k3 k8 m–1 Comp 550

Hashing with Chaining Dictionary Operations: Chained-Hash-Insert (T, x) Insert x at the head of list T[h(key[x])]. Worst-case complexity: O(1). Chained-Hash-Search (T, k) Search an element with key k in list T[h(k)]. Worst-case complexity: proportional to length of list. Chained-Hash-Delete (T, x) Delete x from the list T[h(key[x])]. Worst-case complexity: search time + O(1). Need pointer to preceding element, or a doubly-linked list. Comp 550

Analysis of Chained-Hash-Search Worst-case search time: time to compute h(k) + (n). Average time: depends on how h distributes keys among slots. Assume Simple uniform hashing. Any key is equally likely to hash into any of the slots, independent of where any other key hashes to. O(1) time to compute h(k). Define Load factor =n/m = average # of keys per slot. n – number of keys stored in the hash table. m – number of slots = # linked lists. Comp 550

Some results Theorem: An unsuccessful search takes expected time Θ(1+α). Theorem: A successful search takes expected time Θ(1+α). Theorem: For the special case with m=n slots, with probability at least 1-1/n, the longest list is O(ln n / ln ln n). Comp 550

Expected Cost of an Unsuccessful Search Theorem: An unsuccessful search takes expected time Θ(1+α). Assume: Any key not already in the table is equally likely to hash to any of the m slots. Proof: We must follow pointers to end of list T[h(k)], which has α items, so we expect 1+α pointer accesses. Adding the time to compute the hash function, expected total time remains Θ(1+α). Key fact: average bin size is α = n/m. Comp 550

Expected Cost of a Successful Search Theorem: A successful search takes expected time Θ(1+α). Note: Despite the similar look, this is very different from the previous – i.e., we never look at the empty bins, but often at full ones! Assume: the element being searched for is equally likely to be any of the n elements in the table. Proof: To find x we look first at all elements inserted after x, then at x. (Think of inserting in reverse order -- becomes 1+the number already in bucket!) Let Xij=IV{keys i & j hash to same slot} Pr(Xij=1) = 1/m by s.u.h. We want So, we expect 1+ α/2 pointer accesses. Comp 550

Bounding the Size of Longest List Theorem: For the special case with m=n slots, with probability at least 1-1/n, the longest list is O(ln n / ln ln n). Proof: Let Zi,k=IRV{key i hash to slot k} Pr(Zi,k=1) = 1/m by s.u.h. The probability that a particular slot k receives >κ keys is (letting m=n) If we choose κ = 3 ln n / ln ln n, then κ! > n2 and 1/κ! < 1/n2. Thus, the probability that any n slots receives >κ keys is < 1/n. With probability at least 1-1/n, Comp 550

Size of Longest List with 2 Choices Theorem: Using m=n slots and 2 hash functions, placing each item in the shorter of the two lists, then with probability at least 1-1/n, the longest list is O(lg lg n). Proof idea: The height of a ball is i if it was the i-th ball to be placed in it's bin. Note: (total # of balls of height i) ≥ (total # of bins with ≥ i balls) Let ai be the fraction of bins with ≥ i balls. Note a2≤1/2. Pr[ ball has height ≥ i + 1] ≤ ai2, since to place a ball at height i+1 we must select 2 bins of height ≥ i. So we expect ai+1 ≤ ai2. We are unlikely to see more than i = O(lg lg n) balls in a bin. Comp 550

Implications for separate chaining If n = O(m), then load factor =n/m = O(m)/m = O(1).  Search takes constant time on average. Deletion takes O(1) worst-case time if you have a pointer to the preceding element in the list. Hence, for hash tables with chaining, all dictionary operations take O(1) time on average, given the assumptions of simple uniform hashing and O(1) time hash function evaluation. Extra memory (& coding) needed for linked list pointers. Can we satisfy the simple uniform hashing assumption? Comp 550

Good hash functions CLRS 11.2 Comp 550

for all j [0…m–1], k:h(k) = j P(k) = 1/m Good Hash Functions Recall the assumption of simple uniform hashing: Any key is equally likely to hash into any of the slots, independent of where any other key hashes to. O(1) time to compute h(k). Hash values should be independent of any patterns that might exist in the data. E.g. If each key is drawn independently from U according to a probability distribution P, we want for all j [0…m–1], k:h(k) = j P(k) = 1/m Often use heuristics, based on the domain of the keys, to create a hash function that performs well. Comp 550

Keys as Natural Numbers Let’s assume that keys are natural numbers, even if we have to encode them to make them so. Example: Interpret a character string as an integer expressed in some radix notation. E.g. “CLRS”: ASCII values: C=67, L=76, R=82, S=83. Use base 27=128 to cover all basic ASCII values. So, CLRS = 67·1283+76 ·1282+ 82·1281+ 83·1280 = 141,764,947. Why not just sum the ASCII values? Comp 550

“Division Method” (mod p) Map each key k into one of the m slots by taking the remainder of k divided by m. That is, h(k) = k mod m Example: m = 31 and k = 78  h(k) = 16. Advantage: Fast, since requires just one division operation. Disadvantage: For some values, such as m=2p, the hash depends on just a subset of the bits of the key. Good choice for m: Primes are good, if not too close to power of 2 (or 10). Comp 550

Multiplication Method Map each key k to one of the m slots indicated by the fractional part of k times a chosen real 0 < A < 1. That is, h(k) = m (kA mod 1) = m (kA – kA)  Example: m = 1000, k = 123, A  0.6180339887… h(k) = 1000(123 · 0.6180339887 mod 1) = 1000 · 0.018169...  = 18. Disadvantage: A bit slower than the division method. Advantage: Value of m is not critical. Details on next slide Comp 550

Multiplication Mthd. – Implementation Simple implementation for m a power of 2. Choose m = 2p, for some integer p. Let the word size of the machine be w bits. Pick a w-bit 0 < s < 2w. Knuth suggests (5 – 1)·2w-1. Let A = s/2w. (We need 0<A<1.) Assume that key k fits into a single word. (k takes w bits.) Compute k  s = r1 ·2w+ r0 The integer part of kA = r1 , drop it. Take the first p bits of r0 by r0 << p k s = A·2w r0 r1 w bits  h(k) extract p bits · binary point Some values of A work better than others (want to mix bits); Some values of m are easier to work with than others (want to make floor/ceiling computations easy). Comp 550

Open Addressing Idea: Store all n keys in the m slots of the hash table itself. What can you say about the load factor  = n/m? Each slot contains either a key or NIL. To search for key k: Examine slot h(k). Examining a slot is known as a probe. If slot h(k) contains key k, the search is successful. If the slot contains NIL, the search is unsuccessful. There’s a third possibility: slot h(k) contains a key that is not k. Compute the index of some other slot, based on k and which probe we are on. Keep probing until we either find key k or we find a slot holding NIL. Advantages: Avoids pointers; so less code, and we can dedicate the memory to the table. Comp 550

Open addressing U collision (universe of keys) h(k1) h(k4) h(k2)=h(k5) U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 collision h(k2)=h(k5) h(k2) k5 k3 h(k5)+1 h(k3) m–1 Comp 550

Probe Sequence Sequence of slots examined during a key search constitutes a probe sequence. Probe sequence must be a permutation of the slot numbers. We examine every slot in the table, if we have to. We don’t examine any slot more than once. One way to think of it: extend hash function to: h : U  {0, 1, …, m – 1}  {0, 1, …, m – 1} probe number slot number Universe of Keys Comp 550

Operations: Search & Insert Hash-Search (T, k) 1. i  0 2. repeat j  h(k, i) 3. if T[j] = k 4. then return j 5. i  i + 1 6. until T[j] = NIL or i = m 7. return NIL Hash-Insert(T, k) 1. i  0 2. repeat j  h(k, i) 3. if T[j] = NIL 4. then T[j]  k 5. return j 6. else i  i + 1 7. until i = m 8. error “hash table overflow” Search looks for key k Insert first searches for a slot, then inserts (line 4). Comp 550

Deletion Cannot just turn the slot containing the key we want to delete to contain NIL. Why? Use a special value DELETED instead of NIL when marking a slot as empty during deletion. Search should treat DELETED as though the slot holds a key that does not match the one being searched for. Insert should treat DELETED as though the slot were empty, so that it can be reused. Disadvantage: Search time is no longer dependent on . Hence, chaining is more common when keys have to be deleted. Comp 550

Computing Probe Sequences The ideal situation is uniform hashing: Generalization of simple uniform hashing. Each key is equally likely to have any of the m! permutations of 0, 1,…, m–1 as its probe sequence. It is hard to implement true uniform hashing. Approximate with techniques that guarantee to probe a permutation of [0…m–1], even if they don’t produce all m! probe sequences Linear Probing. Quadratic Probing. Double Hashing. Comp 550

Linear Probing Quadratic Probing h(k, i) = (h(k,0)+i) mod m. The initial probe determines the entire probe sequence. Suffers from primary clustering: Long runs of occupied sequences build up. Long runs tend to get longer, since an empty slot preceded by i full slots gets filled next with probability (i+1)/m. key Probe number Original hash function Quadratic Probing h(k,i) = (h(k) + c1i + c2i2) mod m c1 c2 Can suffer from secondary clustering Comp 550

Open addressing with linear probing U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 collision h(k2)=h(k5) h(k2) k5 k3 h(k5)+1 h(k3) m–1 Comp 550

Double Hashing h(k,i) = (h1(k) + i h2(k)) mod m Two auxiliary hash functions. h1 gives the initial probe. h2 gives the remaining probes. Must have h2(k) relatively prime to m, so that the probe sequence is a full permutation of 0, 1,…, m–1. Choose m to be a power of 2 and have h2(k) always return an odd number. Or, Let m be prime, and have 1 < h2(k) < m. (m2) different probe sequences. One for each possible combination of h1(k) and h2(k). Close to the ideal uniform hashing. key Probe number Auxiliary hash functions Comp 550

Open addressing with double hashing U (universe of keys) h1(k1) =h1(k5)+ h2(k5) h1(k5)+ 2h2(k5) h1(k4) k1 K (actual keys) k4 k2 collision h1(k2) h1(k2)=h1 (k5) k5 k3 h1(k3) Comp 550

Analysis of Open-address Hashing Analysis is in terms of load factor  = n/m. Assumptions: The table never completely fills, so n <m and  < 1. uniform hashing (all probe sequences equally likely) no deletion In a successful search, each key is equally likely to be searched for. Comp 550

Expected cost of an successful search Theorem: Under the uniform hashing assumption, the expected number of probes in an unsuccessful search in an open-address hash table is at most 1/(1–α). Proof: Let Pk= IRV{the first k–1 probes hit occupied slots} The expected number of probes is just If α is a constant, search takes O(1) time. Corollary: Inserting an element into an open-address table takes ≤ 1/(1–α) probes on average. Comp 550

Expected cost of a successful search Theorem: Under the uniform hashing assumption, the expected number of probes in a successful search in an open-address hash table is at most (1/α) ln (1/(1–α)). Proof: A successful search for a key k follows the same probe sequence as when k was inserted. Suppose that k was the (i+1)st key inserted. At that time, α was i/m. By the previous corollary, the expected number of probes inserting k was at most 1/(1–i/m) = m/(m–i). We need to average over i=1..n, the positions for key k. Comp 550

Analysis of Linear Probing [PPR07] Pagh, Pagh, and Ruzic showed two things: If pairs of keys hash to independent locations, but triples do not, then you can have expected W(log n) search time for linear probing. If 5-tuples hash to independent locations, then you can guarantee expected O(1) search time for linear probing. Comp 550

Sketch of analysis for m=3n [PPR07] Imagine a binary tree over the hash table A[1..m] A node at height i expects (1/3)·2i items; call it dangerous if it has ≥(2/3)·2i items. If key k finds a run at h(k) of between 2i and 2i+1 items, then the ancestor of h(k) at height i–2 is dangerous or has a dangerous sibling. Now, we want the probability that a bin, of expected size μ=2i–2/3, actually contains 2μ elements. Pairs give bounds of O(1/μ), which is not enough, since Σk 2k·O(1/2k-2) = O(lg n). Quadruples give O(1/μ2), and Σk2k·O(1/22(k-2)) = Σ O(2-k) = O(1). Linear probing. The relevance of moments to linear probing was only recognized in a recent breakthrough paper [Pagh, Pagh, Ruzic STOC'07]. I will show the analysis for b=3n to ease notation; it is easy to extend to any load. In true data-structures style, we consider a perfect binary tree spanning the array A[1..b]. A node on level k has 2k array positions under it, and (1/3)·2k items were originally hashed to them in expectation. (Here I am counting the original location h(x) of x, not where x really appears, which may be h(x)+1, h(x)+2, ...). Call the node "dangerous" if at least (2/3)·2k elements hashed to it. Now say that we are dealing with element q (a query or an update). We must bound the contiguous run of elements that contain the position h(q). The key observation is that, if this run contains between 2k and 2k+1 elements, either the ancestor of h(q) at level k-2 is dangerous, or one of its siblings in an O(1) neighborhood is dangerous. Let's say this run goes from A[i] to A[j], i≤h(q)≤j. The interval [i,j] is spanned by 4—9 nodes on level k-2. Assume for contradiction that none are dangerous. The first node, which is not completely contained in the interval, contributes less than (2/3)·2k-2 elements to the run (it the most extreme case, this many elements hashed to the last location of that node). But the next nodes all have more than 2k-2/3 free locations in their subtree, so 2 more nodes absorb all excess elements. Thus, the run cannot go on for 4 nodes, contradiction. Now, the expected running time of an operation is clearly: Σk O(2k)·Pr[h(q) is in a run of 2k to 2k+1 elements]. As argued above, this probability is at most O(1) times the probability that a designated node at level k-2 is dangerous. The rest is a simple balls-in-bins analysis: we want the probability that a bin, of expected size μ=2k-2/3, actually contains 2μ elements. Last time, we showed that Chebyshev bounds this probability by O(1/μ). Unfortunately, this is not enough, since Σk 2k·O(1/2k-2) = O(lg n). However, if we go to the 4th moment, we obtain a probability bound of O(1/μ2). In this case, the running time is Σk2k·O(1/22(k-2)) = Σ O(2-k) = O(1). So the 4th moment is enough to make this series decay geometrically. Comp 550

Universal Hashing A malicious adversary who has learned the hash function can choose keys that map to the same slot, giving worst-case behavior. Defeat the adversary using Universal Hashing Use a different random hash function each time. Ensure that the random hash function is independent of the keys that are actually going to be stored. Ensure that the random hash function is “good” by carefully designing a class of functions to choose from. Comp 550

Universal Set of Hash Functions A finite collection of hash functions H, that map a universe of keys U into {0, 1,…, m–1}, is “universal” if, for every pair of distinct keys k,lU, the number of hH with h(k)=h(l) is ≤ |H|/m. Key idea: use number theory to pick a large set H where choosing hH at random makes Pr{h(k)=h(l) } = 1/m. (A random h can be expected to satisfy simple uniform hashing.) With table size m, pick a prime p ≥ the largest key. Define a set of hash functions for a,b[0…p–1], a>0, ha,b(k) = ( (ak + b) mod p) mod m. Related to linear conguential random number generators (CLRS 31) Comp 550

Example Set of Universal Hash Fcns With table size m, pick a prime p ≥ the largest key. Define a set of hash functions for a,b[0…p–1], a>0, ha,b(k) = ( (ak + b) mod p) mod m. Claim: H is a 2-universal family. Proof: Let's fix r≠s and calculate, for keys x ≠ y, Pr([(ax + b) = r (mod p)] AND [(ay+b) = s (mod p)]). We must have a(x–y) = (r–s) mod p, which is uniquely determines a over the field Zp+, so b = r–ax (mod p). Thus, this probability is 1/p(p–1). Now, the number of pairs r≠s with r = s (mod m) is p(p–1)/m, so Pr[(ax+b mod p) = (ay+b mod p) (mod n)] = 1/m. QED Comp 550

Chain-Hash-Search with Universal Hashing Theorem: Using chaining and universal hashing on key k: If k is not in the table T, the expected length of the list that k hashes to is  . If k is in the table T, the expected length of the list that k hashes to is  1+. Proof: Xkl = IRV{h(k)=h(l)}. E[Xkl] = Pr{h(k)=h(l)}  1/m. If key , expected # of pointer refs. is If key k T, expected # of pointer refs. is Comp 550

Perfect Hashing [FKS82] U (universe of keys) k1 k4 k5 K (actual keys) Comp 550

Two consequences of Recall our analyses for search with chaining: we let Xij=IRV{keys i & j hash to same slot} Consider m = n2: If the average # of collisions < ½, then more than half the time we have no collisions! Pick a random universal hash function and hash into table with m= n2. Repeat until no collision. Note: Thm. 11.9 in CLRS; uses Markov inequality in proof. Comp 550

Two consequences of Consider m=n: We can show that (list sizes)2 add up to O(n) Let Zi,k=IRV{key i hashes to slot k} Let Xij=IRV{keys i & j hash to same slot} Note: Thm. 11.10 in CLRS. Comp 550

Two consequences of Let Zi,k=IRV{key i hashes to slot k} Let Xij=IRV{keys i & j hash to same slot} Comp 550

Perfect Hashing k7 k3 k4 k1 k5 k2 k6 If you know the n keys in advance, makes a hash table with O(n) size, and worst-case O(1) lookup time. Just use two levels of hashing: A table of size n, then tables of size nj2. Dynamic versions have been created, but are usually less practical than other hash methods. Key idea: exploit both ends of space/#collisions tradeoff. Comp 550