Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,

Similar presentations


Presentation on theme: "David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,"— Presentation transcript:

1 David Luebke 1 11/26/2015 Hash Tables

2 David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search, insertion, and deletion ■ We typically don’t care about sorted order ■ Example: given a student’s SSN, give their GPA.

3 David Luebke 3 11/26/2015 Hash Tables ● More formally: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support: ○ Insert (T, x) ○ Delete (T, x) ○ Search(T, x) ■ We want these to be fast, but don’t care about sorting the records ● The structure we will use is a hash table ■ Supports all the above in O(1) expected time!

4 David Luebke 4 11/26/2015 Hashing: Keys ● In the following discussions we will consider all keys to be (possibly large) natural numbers ● How can we convert ASCII strings to natural numbers for hashing purposes?

5 David Luebke 5 11/26/2015 Direct Addressing ● Suppose: ■ The range of keys is 0..m-1 ■ Keys are distinct ● The idea: ■ Set up an array T[0..m-1] in which ○ T[i] = xif x  T and key[x] = i ○ T[i] = NULLotherwise ■ This is called a direct-address table ○ Operations take O(1) time! ○ So what’s the problem?

6 David Luebke 6 11/26/2015 The Problem With Direct Addressing ● Direct addressing works well when the range m of keys is relatively small ● But what if the keys are 32-bit integers? ■ Problem 1: direct-address table will have 2 32 entries, more than 4 billion ■ Problem 2: even if memory is not an issue, the time to initialize the elements to NULL may be ● Solution: map keys to smaller range 0..m-1 ● This mapping is called a hash function

7 David Luebke 7 11/26/2015 Example: SSN’s ● Suppose we have student records, that we want to index by the key of social security numbers. ● Direct Addressing means having an array of 1,000,000,000 entries ● What if I just have an array of 5,000,000 entries and divide each number by two before putting the record in the array? ● Dividing by two is the hash function ● What’s the problem with doing this?

8 David Luebke 8 11/26/2015 Hash Functions ● Next problem: collision T 0 m - 1 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys)

9 David Luebke 9 11/26/2015 Resolving Collisions ● How can we solve the problem of collisions? ● Solution 1: chaining ● Solution 2: open addressing

10 David Luebke 10 11/26/2015 Open Addressing ● Basic idea ■ To insert: if slot is full, try another slot, …, until an open slot is found (probing) ■ To search, follow same sequence of probes as would be used when inserting the element ○ If reach element with correct key, return it ○ If reach a NULL pointer, element is not in table ● Good for fixed sets (adding but no deletion)

11 David Luebke 11 11/26/2015 Open Addressing Hashing ● Approach ■ Hash table contains objects ■ Probe  examine table entry ■ Collision ○ Move K entries past current location ○ Wrap around table if necessary ■ Find location for X Examine entry at A[ key(X) ] If entry = X, found If entry = empty, X not in hash table Else increment location by K, repeat

12 David Luebke 12 11/26/2015 Open Addressing Hashing ● Approach ■ Linear probing ○ Move one slot in the table at a time ○ May form clusters of contiguous entries ■ Deletions – the tricky part ○ Find location for X ○ If X inside cluster, leave non-empty marker ■ Insertion ○ Find location for X ○ Insert if X not in hash table ○ Can insert X at first non-empty marker

13 David Luebke 13 11/26/2015 Open Addressing Example ● Hash codes ■ H(A) = 6H(C) = 6 ■ H(B) = 7H(D) = 7 ● Hash table ■ Size = 8 elements ■  = empty entry ■ * = non-empty marker ● Linear probing ■ Collision  move 1 entry past current location 1234567812345678 

14 David Luebke 14 11/26/2015 Open Addressing Example ● Operations (A=>6, B =>7, C=>6, D=>7) ■ Insert A, Insert B, Insert C, Insert D 1234567812345678 AA 1234567812345678 ABAB 1234567812345678 ABCABC 1234567812345678 DABCDABC

15 David Luebke 15 11/26/2015 Open Addressing Example ● Operations (A=>6, B =>7, C=>6, D=>7) ■ Find A, Find B, Find C, Find D 1234567812345678 1234567812345678 1234567812345678 1234567812345678 DABCDABC DABCDABC DABCDABC DABCDABC

16 David Luebke 16 11/26/2015 Open Addressing Example ● Operations (A=>6, B =>7, C=>6, D=>7) ■ Delete A, Delete C, Find D, Insert C 1234567812345678 1234567812345678 1234567812345678 1234567812345678 DCB*DCB* D*BCD*BC D*B*D*B* D*B*D*B*

17 David Luebke 17 11/26/2015 Types of Probing ● Linear: If there’s a collision, skip by a constant number of entries (we did 1 at a time) ■ Problem: creates clusters ● Quadratic probing: for k, the key, and i, the number of times I’ve probed so far, h(k,i) = (h’(k) + ci + di 2 ) mod m where c and d are constants and h’ is yet another hash function ■ Result: the more collisions I get, the more space I skip

18 David Luebke 18 11/26/2015 Types of Probing ● Double hashing h(k,i) = (h1(k) + i*h2(k)) mod m for h1 and h2 auxiliary hash functions. ■ first hash depends on h1 only ■ collisions skip different amounts ● Some of the best hash functions in practice are double hash functions

19 David Luebke 19 11/26/2015 Chaining ● Chaining puts elements that hash to the same slot in a linked list: —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

20 David Luebke 20 11/26/2015 Chaining ● How do we insert an element? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

21 David Luebke 21 11/26/2015 Chaining —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7 ● How do we delete an element?

22 David Luebke 22 11/26/2015 Chaining ● How do we search for a element with a given key? —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

23 David Luebke 23 11/26/2015 Analysis of Chaining ● Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot ● Given n keys and m slots in the table: the load factor  = n/m = average # keys per slot ● What will be the average cost of an unsuccessful search for a key?

24 David Luebke 24 11/26/2015 Analysis of Chaining ● Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot ● Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot ● What will be the average cost of an unsuccessful search for a key? A: O(1+  )

25 David Luebke 25 11/26/2015 Analysis of Chaining ● Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot ● Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot ● What will be the average cost of an unsuccessful search for a key? A: O(1+  ) ● What will be the average cost of a successful search?

26 David Luebke 26 11/26/2015 Analysis of Chaining ● Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot ● Given n keys and m slots in the table, the load factor  = n/m = average # keys per slot ● What will be the average cost of an unsuccessful search for a key? A: O(1+  ) ● What will be the average cost of a successful search? A: O(1 +  /2) = O(1 +  )

27 David Luebke 27 11/26/2015 Analysis of Chaining Continued ● So the cost of searching = O(1 +  ) ● If the number of keys (n) is proportional to the number of slots in the table (m), what is  ? ● A:  = O(1) ■ In other words, we can make the expected cost of searching constant if we make  constant

28 David Luebke 28 11/26/2015 Choosing A Hash Function ● Clearly choosing the hash function well is crucial ■ What will a worst-case hash function do? ■ What will be the time to search in this case? ● What are desirable features of the hash function? ■ Should distribute keys uniformly into slots ■ Should not depend on patterns in the data

29 David Luebke 29 11/26/2015 Hash Functions: Worst Case Scenario ● Scenario: ■ You are given an assignment to implement hashing ■ You will self-grade in pairs, testing and grading your partner’s implementation ■ In a blatant violation of the honor code, your partner: ○ Analyzes your hash function ○ Picks a sequence of “worst-case” keys, causing your implementation to take O(n) time to search ● What’s an honest CS student to do?

30 David Luebke 30 11/26/2015 Hash Functions: Universal Hashing ● As before, when attempting to foil an malicious adversary: randomize the algorithm ● Universal hashing: pick a hash function randomly in a way that is independent of the keys that are actually going to be stored ■ Guarantees good performance on average, no matter what keys adversary chooses

31 David Luebke 31 11/26/2015 Practice ● 11.2-2 Demonstrate the insertion of the following keys into a hash table with collisions resolved by chaining. Let the table have 9 slots and let the hash function be h(k) = k mod 9. ● Keys are: 5,28,19,15,20,33,12,17,10 ● Do the same for an open address table with linear probing.


Download ppt "David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,"

Similar presentations


Ads by Google