Dictionaries and Their Implementations

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Analysis of Algorithms CS 477/677
Hash Tables.
Hashing.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CSCE 3400 Data Structures & Algorithm Analysis
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
CS202 - Fundamental Structures of Computer Science II
CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.
Dictionaries and Their Implementations Chapter 18 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013.
Hashing Techniques.
Hashing CS 3358 Data Structures.
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
CS 221 Analysis of Algorithms Data Structures Dictionaries, Hash Tables, Ordered Dictionary and Binary Search Trees.
Hash Table March COP 3502, UCF.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
1 Hash table. 2 A basic problem We have to store some records and perform the following:  add new record  delete record  search a record by key Find.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing as a Dictionary Implementation Chapter 19.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
Data Structures and Algorithms Lecture (Searching) Instructor: Quratulain Date: 4 and 8 December, 2009 Faculty of Computer Science, IBA.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Dictionaries and Their Implementations Chapter 18 Data Structures and Problem Solving with C++: Walls and Mirrors, Frank Carrano, © 2012.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Hashing.
Data Structures Using C++ 2E
Data Abstraction & Problem Solving with C++
Slides by Steve Armstrong LeTourneau University Longview, TX
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Quadratic probing Double hashing Removal and open addressing Chaining
Dictionaries and Their Implementations
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Chapter 13 Hashing © 2011 Pearson Addison-Wesley. All rights reserved.
Presentation transcript:

Dictionaries and Their Implementations CS 302 - Data Structures Mehmet H Gunes Modified from authors’ slides

Contents The ADT Dictionary Possible Implementations Selecting an Implementation Hashing Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

The ADT Dictionary Recall concept of a sort key in Chapter 11 Often must search a collection of data for specific information Use a search key Applications that require value-oriented operations are frequent Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

The ADT Dictionary A collection of data about certain cities Consider the need for searches through this data based on other than the name of the city A collection of data about certain cities Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

ADT Dictionary Operations Test whether dictionary is empty. Get number of items in dictionary. Insert new item into dictionary. Remove item with given search key from dictionary. Remove all items from dictionary. Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

ADT Dictionary Operations Get item with a given search key from dictionary. Test whether dictionary contains an item with given search key. Traverse items in dictionary in sorted search-key order. Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

ADT Dictionary UML diagram for a class of dictionaries View interface, Listing 18-1 UML diagram for a class of dictionaries Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Possible Implementations Sorted (by search key), array-based Sorted (by search key), link-based Unsorted, array-based Unsorted, link-based Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Possible Implementations A dictionary entry Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Possible Implementations The data members for two sorted linear implementations of the ADT dictionary for the data: (a) array based; (b) link based View header file for class of dictionary entries, Listing 18-2 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Possible Implementations The data members for a binary search tree implementation of the ADT dictionary for the data Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Sorted Array-Based Implementation of ADT Dictionary Consider header file for the class ArrayDictionary, Listing 18-3 Note definition of method add, Listing 18-A Bears responsibility for keeping the array items sorted Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Binary Search Tree Implementation of ADT Dictionary Dictionary class will use composition Will have a binary search tree as one of its data members Reuses the class BinarySearchTree from Chapter 16 View header file, Listing 18-4 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Selecting an Implementation Reasons for considering linear implementations Perspective, Efficiency Motivation Questions to ask What operations are needed? How often is each operation required? Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Selecting an Implementation Three Scenarios Insertion and traversal in no particular order Retrieval – consider: Is a binary search of a linked chain possible? How much more efficient is a binary search of an array than a sequential search of a linked chain? Insertion, removal, retrieval, and traversal in sorted order – add and remove must: Find the appropriate position in the dictionary. Insert into (or remove from) this position. Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Selecting an Implementation Insertion for unsorted linear implementations: (a) array based; (b) link based Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Selecting an Implementation Insertion for sorted linear implementations: (a) array based; (b) pointer based Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Selecting an Implementation The average-case order of the ADT dictionary operations for various implementations Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Hashing

The Search Problem Unsorted list Sorted list Binary Search tree O(N) Sorted list O(logN) using arrays (i.e., binary search) O(N) using linked lists Binary Search tree O(logN) (i.e., balanced tree) O(N) (i.e., unbalanced tree) Can we do better than this? Direct Addressing Hashing

Direct Addressing Assumptions: Idea: Key values are distinct Each key is drawn from a universe U = {0, 1, . . . , n - 1} Idea: Store the items in an array, indexed by keys

Direct Addressing (cont’d) Direct-address table representation: An array T[0 . . . n - 1] Each slot, or position, in T corresponds to a key in U For an element x with key k, a pointer to x will be placed in location T[k] If there are no elements with key k in the set, T[k] is empty, represented by NIL Search, insert, delete in O(1) time! 22 22

Direct Addressing (cont’d) Example 1: Suppose that the are integers from 1 to 100 and that there are about 100 records. Create an array A of 100 items and stored the record whose key is equal to i in in A[i]. |K| = |U| |K|: # elements in K |U|: # elements in U

Direct Addressing (cont’d) Example 2: Suppose that the keys are 9-digit social security numbers (SSN) Although we could use the same idea, it would be very inefficient (i.e., use an array of 1 billion size to store 100 records) |K| << |U| 24 24

Hashing Binary search tree retrieval have order O(log2n) Need a different strategy to locate an item Consider a “magic box” as an address calculator Place/retrieve item from that address in an array Ideally to a unique number for each key Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Hashing Address calculator Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Hashing is a means used to order and access elements in a list quickly by using a function of the key value to identify its location in the list. the goal is O(1) time The function of the key value is called a hash function.

Hashing Idea: Use a function h to compute the slot for each key Store the element in slot h(k) A hash function h transforms a key into an index in a hash table T[0…m-1]: h : U → {0, 1, . . . , m - 1} We say that k hashes to slot h(k)

Hashing (cont’d) h : U → {0, 1, . . . , m - 1} hash table size: m U U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 h(k2) = h(k5) k5 k3 h(k3) m - 1 h : U → {0, 1, . . . , m - 1} hash table size: m

Hashing (cont’d) Example 2: Suppose that the keys are 9-digit social security numbers (SSN)

Advantages of Hashing Reduce the range of array indices handled: m instead of |U| where m is the hash table size: T[0, …, m-1] Storage is reduced. O(1) search time (i.e., under assumptions).

Properties of Good Hash Functions Good hash function properties Easy to compute (2) Approximates a random function i.e., for every input, every output is equally likely. (3) Minimizes the chance that similar keys hash to the same slot i.e., strings such as pt and pts should hash to different slot.

Hashing Pseudocode for getItem Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Hashing Pseudocode for remove Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Hash Functions Possible algorithms Selecting digits Folding Modulo arithmetic Converting a character string to an integer Use ASCII values Factor the results, Horner’s rule Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Collisions Collisions occur when h(ki)=h(kj), i≠j U (universe of keys) U (universe of keys) h(k1) h(k4) k1 K (actual keys) k4 k2 h(k2) = h(k5) k5 k3 h(k3) m - 1

Collisions (cont’d) For a given set K of keys: If |K| ≤ m, collisions may or may not happen, depending on the hash function! If |K| > m, collisions will definitely happen i.e., there must be at least two keys that have the same hash value Avoiding collisions completely might not be easy.

Resolving Collisions A collision Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Resolving Collisions Approach 1: Open addressing Probe for another available location Can be done linearly, quadratically Removal requires specify state of an item Occupied, emptied, removed Clustering is a problem Double hashing can reduce clustering Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Open Addressing Idea: store the keys in the table itself No need to use linked lists anymore Basic idea: Insertion: if a slot is full, try another one, until you find an empty one. Search: follow the same probe sequence. Deletion: need to be careful! Search time depends on the length of probe sequences! e.g., insert 14 probe sequence: <1, 5, 9>

Generalize hash function notation: A hash function contains two arguments now: (i) key value, and (ii) probe number h(k,p), p=0,1,...,m-1 Probe sequence: <h(k,0), h(k,1), h(k,2), …. > Example: e.g., insert 14 Probe sequence: <h(14,0), h(14,1), h(14,2)>=<1, 5, 9>

Generalize hash function notation: Probe sequence must be a permutation of <0,1,...,m-1> There are m! possible permutations e.g., insert 14 Probe sequence: <h(14,0), h(14,1), h(14,2)>=<1, 5, 9> 42

Resolving Collisions Linear probing with h ( x ) = x mod 101 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Linear probing: Inserting a key Idea: when there is a collision, check the next available position in the table: h(k,i) = (h1(k) + i) mod m i=0,1,2,... i=0: first slot probed: h1(k) i=1: second slot probed: h1(k) + 1 i=2: third slot probed: h1(k)+2, and so on How many probe sequences can linear probing generate? m probe sequences maximum probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....> wrap around

Linear probing: Searching for a key Given a key, generate a probe sequence using the same procedure. Three cases: (1) Position in table is occupied with an element of equal key FOUND (2) Position in table occupied with a different element  KEEP SEARCHING (3) Position in table is empty NOT FOUND m - 1 wrap around

Linear probing: Searching for a key Running time depends on the length of the probe sequences. Need to keep probe sequences short to ensure fast search. m - 1 wrap around 46 46

Linear probing: Deleting a key First, find the slot containing the key to be deleted. Can we just mark the slot as empty? It would be impossible to retrieve keys inserted after that slot was occupied! Solution “Mark” the slot with a sentinel value DELETED The deleted slot can later be used for insertion. e.g., delete 98 m - 1

Primary Clustering Problem Long chunks of occupied slots are created. As a result, some slots become more likely than others. Probe sequences increase in length.  search time increases!! initially, all slots have probability 1/m Slot b: 2/m Slot d: 4/m Slot e: 5/m

Resolving Collisions Quadratic probing with h ( x ) = x mod 101 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Quadratic probing Clustering is less serious but still a problem secondary clustering How many probe sequences can quadratic probing generate? m -- the initial position determines probe sequence 50 50

Resolving Collisions Double hashing during the insertion of 58, 14, and 91 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,... Double Hashing (1) Use one hash function to determine the first slot. (2) Use a second hash function to determine the increment for the probe sequence: h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,... Initial probe: h1(k) Second probe is offset by h2(k) mod m, so on ... Advantage: handles clustering better Disadvantage: more time consuming How many probe sequences can double hashing generate? m2

Double Hashing: Example h1(k) = k mod 13 h2(k) = 1+ (k mod 11) h(k,i) = (h1(k) + i h2(k) ) mod 13 Insert key 14: i=0: h(14,0) = h1(14) = 14 mod 13 = 1 i=1: h(14,1) = (h1(14) + h2(14)) mod 13 = (1 + 4) mod 13 = 5 i=2: h(14,2) = (h1(14) + 2 h2(14)) mod 13 = (1 + 8) mod 13 = 9 79 69 98 72 50 1 2 3 4 5 6 7 8 9 14 10 11 12

Resolving Collisions Approach 2: Restructuring the hash table Each hash location can accommodate more than one item Each location is a “bucket” or an array itself Alternatively, design the hash table as an array of linked chains – called “separate chaining” Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Resolving Collisions Separate chaining Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Chaining How to choose the size of the hash table m? Small enough to avoid wasting space. Large enough to avoid many collisions and keep linked-lists short. Typically 1/5 or 1/10 of the total number of elements. Should we use sorted or unsorted linked lists? Unsorted Insert is fast Can easily remove the most recently inserted elements

Analysis of Hashing with Chaining: Worst Case How long does it take to search for an element with a given key? Worst case: All n keys hash to the same slot then O(n) plus time to compute the hash function T chain m - 1

Analysis of Hashing with Chaining: Average Case It depends on how well the hash function distributes the n keys among the m slots Under the following assumptions: (1) n = O(m) (2) any given element is equally likely to hash into any of the m slots i.e., simple uniform hashing property then  O(1) time plus time to compute the hash function n0 = 0 nm – 1 = 0 T n2 n3 nj nk

The Efficiency of Hashing Efficiency of hashing involves the load factor alpha (α) Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

The Efficiency of Hashing Linear probing – average value for α Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

The Efficiency of Hashing Quadratic probing and double hashing – efficiency for given α Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

The Efficiency of Hashing Separate chaining – efficiency for given α Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

The Efficiency of Hashing The relative efficiency of four collision-resolution methods Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

The Efficiency of Hashing The relative efficiency of four collision-resolution methods Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Maintaining Hashing Performance Collisions and their resolution typically cause the load factor α to increase To maintain efficiency, restrict the size of α α  0.5 for open addressing α  1.0 for separate chaining If load factor exceeds these limits Increase size of hash table Rehash with new hashing function Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

What Constitutes a Good Hash Function? Easy and fast to compute? Scatter data evenly throughout hash table? How well does it scatter random data? How well does it scatter non-random data? Note: traversal in sorted order is inefficient when using hashing Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Hashing and Separate Chaining for ADT Dictionary A dictionary entry when separate chaining is used Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013

Hashing and Separate Chaining for ADT Dictionary View Listing 18-5, The class HashedEntry Note the definitions of the add and remove functions, Listing 18-B Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013