Presentation is loading. Please wait.

Presentation is loading. Please wait.

Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.

Similar presentations


Presentation on theme: "Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright."— Presentation transcript:

1 Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright © William C. Cheng Data Structure Limitations Provide consistently fast operations, but must maintain an internal ordering Binary Search Trees, Heaps What if we didn’t care about the ordering of the elements at all? How can we further improve the performance of lookup, add & removal?

2 Each value in the table has a unique key For operations where we only care about fast add/remove/search, not fast traversal, we create a table structure to optimize for fast lookup 5 Data Structures - CSCI 102 Copyright © William C. Cheng Lookup Tables The key is used as a short identifier to lookup an entire value in the table Your student ID is used to look up your student record (e.g. name, GPA, etc.) Example

3 Search(key) See if a particular value identified by key is in the table What kind of operations do we need to perform on a lookup table? 6 Data Structures - CSCI 102 Copyright © William C. Cheng Lookup Tables Insert(key,value) Insert a new value identified by key into the table Remove(key) Remove the value identified by key from the table We don’t care as much about traversal (visiting all elements) in this scenario

4 Let’s assume ID is a unique integer We want to keep a directory of all the students at USC and be able to look them up by their student ID 7 Data Structures - CSCI 102 Copyright © William C. Cheng Sample Object struct Student { string name; double gpa; int id; };

5 Student data[4999]; If we can guarantee that student IDs will always range from 0 to N (e.g. 0 to 4999), we could just store them in an array: 8 Data Structures - CSCI 102 Copyright © William C. Cheng Direct Address Table int id = 3285; Student s = data[id]; Then when we want to grab a particular student, we know Student N is at index N:

6 Data Structures - CSCI 102 Direct Address Table Student Objects John Doe3.20 Jane Doe2.62 Some Guy Name 3.7 GPA 4 ID 0 1212 3 4 5 4999 9 Copyright © William C. Cheng Student IDs Data 0 2 4

7 Direct Addressing 10 Data Structures - CSCI 102 Copyright © William C. Cheng Direct Address Table Maps keys directly to the indexes in an array Unused array indexes need to be marked O(1) worst case Generally use NULL Operations are fast

8 Key Restrictions Direct Addressing Issues 11 Data Structures - CSCI 102 Copyright © William C. Cheng Direct Address Table Array Size Keys must fall into a nice, uniform range Keys must be numeric If there are N possible keys, then data[] must be of size N Our array could get HUGE What if we’re only using a small numbers of keys? Tons of space is wasted How can we get around these limitations?

9 Hash Functions 12 Data Structures - CSCI 102 Copyright © William C. Cheng Hash Functions A function that maps key values to array indexes Input records all have a unique key The hash function maps key to an array index Records are stored at data[hash(key)] Ideally every unique key also has unique hash(key) Direct Addressing essentially uses a hash function that does nothing int directAddressHash(int studentId) { return studentId; }

10 13 Copyright © William C. Cheng Data Structures - CSCI 102 Hash Tables Student Objects John Doe Jane Doe Some Guy 3.2 2.6 3.7 024024 NameGPAID hash(4) hash(0) hash(2) Data Student IDs (Keys) 0 2 4 Hash Function

11 How can we avoid having to make our array gigantic to hold all possible keys? Hash Functions 15 Data Structures - CSCI 102 Copyright © William C. Cheng Hash Tables Simple solution: use modular arithmetic Size of the backing array is no longer dependent on the number of unique keys int modularHash(int studentId) { return studentId % ARRAY_SIZE; } int directAddressHash(int studentId) { return studentId; } Recall direct addressing:

12 Fast Hashing is supposed to be faster than a binary search tree. hash(key) needs to be O(1) What makes a good hash function? 16 Data Structures - CSCI 102 Copyright © William C. Cheng Hash Functions Deterministic If we have a key K, then hash(K) must always give the same result Uniform distribution The hash function should uniformly distribute keys across all of the available indexes in the storage array Making a good hash function is hard

13 For strings, use things like ASCII letter codes Map your data into the set of natural numbers Making a hash function N = {0, 1, 2,...} 17 Data Structures - CSCI 102 Copyright © William C. Cheng Hash Functions Prime table sizes tend to yield better results Prime numbers are your friend E.g. make sure "get" and "gets" hash differently Handle variants of the same pattern Try to be independent of any patterns that may exist in the data You won’t usually have to write your own, but you should know what the default hash function does

14 Hash Tables do not maintain any ordering of their internal elements Hashing Issues 19 Data Structures - CSCI 102 Copyright © William C. Cheng Hash Tables Creating a perfect hash function is almost impossible When two distinct keys generate the same hash value it’s called a collision Collisions hash(K1) == hash(K2)

15 If we try to insert a new element and there’s a collision, keep probing the hash table until we find a vacant space Open Addressing 23 Data Structures - CSCI 102 Copyright © William C. Cheng Collision Handling If a collision occurs, use a deterministic algorithm to calculate the next array index to check (based on the initial hash result) Probing All data is stored directly in the hash table. No extra data structures are needed.

16 Start with an empty Hash Table 25 Data Structures - CSCI 102 Copyright © William C. Cheng Open Addressing (Linear Probing) Data 0123401234

17 26 Copyright © William C. Cheng Student Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "John Doe" with ID = 123 Data 0 1 2 3 4 John Doe 2.8 123 Name GPA ID

18 27 Copyright © William C. Cheng Student 12341234 John Doe 2.8 123 Name GPA ID hash(123) = 1 hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "John Doe" with ID = 123 hash(123) = 1 Data 0

19 28 Copyright © William C. Cheng Student 12341234 John Doe 2.8 123 Name GPA ID hash(123) = 1 hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "John Doe" with ID = 123 hash(123) = 1 data[1] is empty, no collision Data 0

20 29 Copyright © William C. Cheng Student Data 0101 2 3434 John Doe 2.8 123 John Doe 2.8 123 Name GPA ID hash(123) = 1 hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "John Doe" with ID = 123 hash(123) = 1 data[1] is empty, no collision store it there

21 Data Structures - CSCI 102 Open Addressing (Linear Probing) Hash Table contains one item Data 0101 2 3 4 30 Copyright © William C. Cheng John Doe 2.8 123

22 31 Copyright © William C. Cheng Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Jane Doe" with ID = 202 Data 0123401234 John Doe 2.8 123 Student Jane Doe 3.4 202 Name GPA ID

23 32 Copyright © William C. Cheng hash(202) = 3 Data 0101 234234 John Doe 2.8 123 Student Jane Doe 3.4 202 Name GPA ID hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Jane Doe" with ID = 202 hash(202) = 3

24 33 Copyright © William C. Cheng hash(202) = 3 Data 0101 234234 John Doe 2.8 123 Student Jane Doe 3.4 202 Name GPA ID hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Jane Doe" with ID = 202 hash(202) = 3 data[3] is empty, no collision

25 34 Copyright © William C. Cheng hash(202) = 3 Data 0123401234 John Doe 2.8 123 Jane Doe 3.4 202 hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Jane Doe" with ID = 202 hash(202) = 3 data[3] is empty, no collision store it there Student Name Jane Doe GPA 3.4 ID 202

26 35 Copyright © William C. Cheng Data 0123401234 John Doe 2.8 123 Jane Doe 3.4 202 Data Structures - CSCI 102 Open Addressing (Linear Probing) Hash Table contains two items

27 36 Copyright © William C. Cheng Data 0123401234 John Doe 2.8 123 Jane Doe 3.4 202 Student Some Guy 3.5 401 Name GPA ID Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Some Guy" with ID = 401

28 37 Copyright © William C. Cheng Data 0123401234 John Doe 2.8 123 Jane Doe 3.4 202 Student Some Guy 3.5 401 Name GPA ID hash(401) = 1 hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Some Guy" with ID = 401 hash(401) = 1

29 38 Copyright © William C. Cheng Data 0123401234 John Doe 2.8 123 Jane Doe 3.4 202 Student Some Guy 3.5 401 Name GPA ID hash(401) = 1 hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Some Guy" with ID = 401 hash(401) = 1 data[1] is non-empty, collision!

30 39 Copyright © William C. Cheng hash(401) = 1 Data 0101 2 3434 John Doe 2.8 123 Jane Doe 3.4 202 Student Some Guy 3.5 401 Name GPA ID hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Some Guy" with ID = 401 hash(401) = 1 data[1] is non-empty, collision! hash(401)+1 = 2

31 40 Copyright © William C. Cheng Data 0101 2 3434 John Doe 2.8 123 Jane Doe 3.4 202 Student Some Guy 3.5 401 Name GPA ID hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Some Guy" with ID = 401 hash(401) = 1 data[1] is non-empty, collision! hash(401)+1 = 2 data[2] is empty, no collision hash(401) = 1

32 hash(401)+1 = 2 data[2] is empty, no collision 41 Copyright © William C. Cheng Data 0 12341234 John Doe 2.8 123 Some Guy 3.5 401 Jane Doe 3.4 202 hash(401) = 1 hash() Data Structures - CSCI 102 Open Addressing (Linear Probing) Insert "Some Guy" with ID = 401 hash(401) = 1 data[1] is non-empty, collision! store it there Student Name Some Guy GPA 3.5 ID 401

33 Data 0 12341234 123 Some Guy 3.5 401 Jane Doe 3.4 202 42 Copyright © William C. Cheng Data Structures - CSCI 102 Open Addressing (Linear Probing) Hash Table contains three items John Doe 2.8

34 Search(key) What is the Big O of each of these operations? 48 Data Structures - CSCI 102 Copyright © William C. Cheng Open Addressing (Linear Probing) Insert(key,value) Remove(key) Average: O(1), Worst Case: O(N) How big is the table? load factor = (# of elements) / (size of array) Operations depend on the table’s load factor How many slots are taken already? "Utilization"

35 Each slot in the Hash Table can now contain a list of elements instead of a single element Chaining 50 Data Structures - CSCI 102 Copyright © William C. Cheng Collision Handling When multiple items hash to the same slot, they are placed in the list at that slot This requires the overhead of an extra list for each slot that contains one or more elements

36 2.8 123 Jane Doe 3.4 202 51 Copyright © William C. Cheng Data 0123401234 Data Structures - CSCI 102 Chaining Hash Table contains two items John Doe

37 Student Some Guy 3.5 401 Name GPA ID 52 Copyright © William C. Cheng Data 0101 234234 Data Structures - CSCI 102 Chaining Insert "Some Guy" with ID = 401 John Doe 2.8 123 Jane Doe 3.4 202

38 2.8 123 Jane Doe 3.4 202 Student Some Guy 3.5 401 Name GPA ID 53 Copyright © William C. Cheng Data 0 1 2323 4 hash(401) = 1 hash() Data Structures - CSCI 102 Chaining Insert "Some Guy" with ID = 401 hash(401) = 1 John Doe

39 Student Some Guy 3.5 401 Name GPA ID 54 Copyright © William C. Cheng Data 0 1 2323 4 hash(401) = 1 hash() Data Structures - CSCI 102 Chaining Insert "Some Guy" with ID = 401 hash(401) = 1 data[1] is non-empty, collision! John Doe 2.8 123 Jane Doe 3.4 202

40 Student Some Guy 3.5 401 Name GPA ID 55 Copyright © William C. Cheng Data 0 1 2323 4 hash(401) = 1 hash() Data Structures - CSCI 102 Chaining Insert "Some Guy" with ID = 401 hash(401) = 1 data[1] is non-empty, collision! Chaining says to add the new entry to the list at data[1] John Doe 2.8 123 Jane Doe 3.4 202

41 Student Some Guy 3.5 401 Name GPA ID 56 Copyright © William C. Cheng Data 0101 2323 4 hash() Data Structures - CSCI 102 Chaining Insert "Some Guy" with ID = 401 hash(401) = 1 data[1] is non-empty, collision! Chaining says to add the new entry to the list at data[1] Insert Some Guy in the list at data[1] hash(401) = 1 John Doe 2.8 123 Jane Doe 3.4 202

42 57 Copyright © William C. Cheng Data 0123401234 2.8 123 Jane Doe 3.4 202 Data Structures - CSCI 102 Chaining Hash Table contains three items Some Guy 3.5 401 John Doe

43 63 Data Structures - CSCI 102 Copyright © William C. Cheng Chaining Search(key) What is the Big O of each of these operations? Insert(key,value) Remove(key) Average: O(1), Worst Case: O(N) Average: O(1), Worst Case: O(1) Average: O(1), Worst Case: O(N) Operations depend on the average length of a chain (except for insert)

44 If a malicious user knows what hash function you’re using, they can intentionally cause your worst-case behavior The Problem 66 Data Structures - CSCI 102 Copyright © William C. Cheng Collision Handling When the Hash Table is created, randomly choose a hash function independent of the keys that are going to be stored No single input gives worst-case behavior (just like randomized Quicksort) Universal Hashing

45 Like chaining, but each element in the hash table holds another hash table with a different hash function Multi-Level Hashing 67 Data Structures - CSCI 102 Copyright © William C. Cheng Collision Handling If the set of possible keys is static (never changes), we can develop a perfect multi-level hash to give O(1) worst case performance e.g. The reserved keywords in a programming language are a static set of keys Perfect Hashing By hashing multiple times, we can greatly decrease the odds of a collision

46 Hash Tables generally do provide a way for you to retrieve a list of the known keys Just keep in mind there is no guaranteed ordering of the keys Other Notes 68 Data Structures - CSCI 102 Copyright © William C. Cheng Hash Tables C++ currently has no built-in hash table There’s a proposal for unordered_map in the STL is on the table Google Sparse Hash provides C++ hash tables Boost C++ Libraries provides hash tables http://www.boost.org/


Download ppt "Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright."

Similar presentations


Ads by Google