Hash Tables Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

1 11. Hash Tables Heejin Park College of Information and Communications Hanyang University.
David Luebke 1 6/7/2014 CS 332: Algorithms Skip Lists Introduction to Hashing.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
© 2004 Goodrich, Tamassia Hash Tables
1 Designing Hash Tables Sections 5.3, 5.4, Designing a hash table 1.Hash function: establishing a key with an indexed location in a hash table.
© 2004 Goodrich, Tamassia Hash Tables
Dictionaries 4/1/2017 2:36 PM Hash Tables  …
Hashing.
Hashing as a Dictionary Implementation
© 2004 Goodrich, Tamassia Hash Tables1  
Maps. Hash Tables. Dictionaries. 2 CPSC 3200 University of Tennessee at Chattanooga – Summer 2013 © 2010 Goodrich, Tamassia.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing CS 3358 Data Structures.
1.1 Data Structure and Algorithm Lecture 9 Hashing Topics Reference: Introduction to Algorithm by Cormen Chapter 12: Hash Tables.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 48 Hashing.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
CS 221 Analysis of Algorithms Data Structures Dictionaries, Hash Tables, Ordered Dictionary and Binary Search Trees.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Hash Tables1   © 2010 Goodrich, Tamassia.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Comp 335 File Structures Hashing.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing Hashing is another method for sorting and searching data.
© 2004 Goodrich, Tamassia Hash Tables1  
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
CS201: Data Structures and Discrete Mathematics I Hash Table.
Hashing - 2 Designing Hash Tables Sections 5.3, 5.4, 5.4, 5.6.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
A Introduction to Computing II Lecture 11: Hashtables Fall Session 2000.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hash Tables ADT Data Dictionary, with two operations – Insert an item, – Search for (and retrieve) an item How should we implement a data dictionary? –
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
CSCI 210 Data Structures and Algorithms
Hashing CSE 2011 Winter July 2018.
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Dictionaries Dictionaries 07/27/16 16:46 07/27/16 16:46 Hash Tables 
© 2013 Goodrich, Tamassia, Goldwasser
Hash Tables 3/25/15 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
CS 3343: Analysis of Algorithms
Dictionaries and Hash Tables
Presentation transcript:

Hash Tables Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis

Motivation Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated data. Ideally we would like to access this data directly with the keys. We would like a data structure that supports fast search, insertion, and deletion. Do not usually care about sorting. The abstract data type is usually called a Dictionary, Map or Partial Map float googleStockPrice = stocks[Goog].CurrentPrice;

Dictionaries What is the best way to implement this? Linked Lists? Double Linked Lists? Queues? Stacks? Multiple indexed arrays (e.g., data[key[i]])? To answer this, ask what the complexity of the operations are: Insertion Deletion Search

Direct Addressing Lets look at an easy case, suppose: The range of keys is 0..m-1 Keys are distinct Possible solution Set up an array T[0..m-1] in which T[i] = xif x T and key[x] = i T[i] = NULLotherwise This is called a direct-address table Operations take O(1) time! So whats the problem?

Direct Addressing Direct addressing works well when the range m of keys is relatively small But what if the keys are 32-bit integers? Problem 1: direct-address table will have 2 32 entries, more than 4 billion Problem 2: even if memory is not an issue, the time to initialize the elements to NULL may be Solution: map keys to smaller range 0..p-1 Desire p = O(m).

Hash Table Hash Tables provide O(1) support for all of these operations! The key is rather than index an array directly, index it through some function, h(x), called a hash function. myArray[ h(index) ] Key questions: What is the set that the x comes from? What is h() and what is its range?

Hash Table Consider this problem: If I know a priori the p keys from some finite set U, is it possible to develop a function h(x) that will uniquely map the p keys onto the set of numbers 0..p-1?

h(k 2 ) Hash Functions In general a difficult problem. Try something simpler. 0 p - 1 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys)

h(k 2 ) Hash Functions A collision occurs when h(x) maps two keys to the same location. 0 p - 1 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) collision

Hash Functions A hash function, h, maps keys of a given type to integers in a fixed interval [0, N 1] Example: h(x) x mod N is a hash function for integer keys The integer h(x) is called the hash value of x. A hash table for a given key type consists of Hash function h Array (called table) of size N The goal is to store item (k, o) at index i h(k)

Example We design a hash table storing employees records using their social security number, SSN as the key. SSN is a nine-digit positive integer Our hash table uses an array of size N 10,000 and the hash function h(x) last four digits of x …

Example Our hash table uses an array of size N = 100. We have n = 49 employees. Need a method to handle collisions. As long as the chance for collision is low, we can achieve this goal. Setting N = 1000 and looking at the last four digits will reduce the chance of collision …

Collisions Can collisions be avoided? If my data is immutable, yes See perfect hashing for the case were the set of keys is static (not covered). In general, no. Two primary techniques for resolving collisions: Chaining – keep a collection at each key slot. Open addressing – if the current slot is full use the next open one.

Chaining Chaining puts elements that hash to the same slot in a linked list: k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining How do we insert an element? k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining How do we delete an element? Do we need a doubly-linked list for efficient delete? k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Chaining How do we search for a element with a given key? T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 k5k5 k2k2 k3k3 k8k8 k6k6 k7k7

Open Addressing Basic idea: To insert: if slot is full, try another slot, …, until an open slot is found (probing) To search, follow same sequence of probes as would be used when inserting the element If reach element with correct key, return it If reach a NULL pointer, element is not in table Good for fixed sets (adding but no deletion) Example: spell checking

Open Addressing The colliding item is placed in a different cell of the table. No dynamic memory. Fixed Table size. Load factor: n/N, where n is the number of items to store and N the size of the hash table. Cleary, n N, or n/N 1. To get a reasonable performance, n/N<0.5.

Probing They key question is what should the next cell to try be? Random would be great, but we need to be able to repeat it. Three common techniques: Linear Probing (useful for discussion only) Quadratic Probing Double Hashing

Linear Probing Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell. Each table cell inspected is referred to as a probe. Colliding items lump together, causing future collisions to cause a longer sequence of probes. Example: h(x) x mod 13 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Search with Linear Probing Consider a hash table A that uses linear probing get (k) We start at cell h(k) We probe consecutive locations until one of the following occurs An item with key k is found, or An empty cell is found, or N cells have been unsuccessfully probed To ensure the efficiency, if k is not in the table, we want to find an empty cell as soon as possible. The load factor can NOT be close to 1. Algorithm get(k) i h(k) p 0 repeat c A[i] if c return null else if c.key () k return c.element() else i (i 1) mod N p p 1 until p N return null

Linear Probing Search for key=20. h(20)=20 mod 13 =7. Go through rank 8, 9, …, 12, 0. Search for key=15 h(15)=15 mod 13=2. Go through rank 2, 3 and return null. Example: h(x) x mod 13 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, 12, 20 in this order

Updates with Linear Probing To handle insertions and deletions, we introduce a special object, called AVAILABLE, which replaces deleted elements remove (k) We search for an entry with key k If such an entry (k, o) is found, we replace it with the special item AVAILABLE and we return element o Have to modify other methods to skip available cells. put (k, o) We throw an exception if the table is full We start at cell h(k) We probe consecutive cells until one of the following occurs A cell i is found that is either empty or stores AVAILABLE, or N cells have been unsuccessfully probed We store entry (k, o) in cell i

Quadratic Probing Primary clustering occurs with linear probing because the same linear pattern: if a bin is inside a cluster, then the next bin must either: also be in that cluster, or expand the cluster Instead of searching forward in a linear fashion, try to jump far enough out of the current (unknown) cluster.

Quadratic Probing Suppose that an element should appear in bin h : if bin h is occupied, then check the following sequence of bins: h + 1 2, h + 2 2, h + 3 2, h + 4 2, h + 5 2,... h + 1, h + 4, h + 9, h + 16, h + 25,... For example, with M = 17 :

Quadratic Probing If one of h + i 2 falls into a cluster, this does not imply the next one will

Quadratic Probing For example, suppose an element was to be inserted in bin 23 in a hash table with 31 bins The sequence in which the bins would be checked is: 23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0

Quadratic Probing Even if two bins are initially close, the sequence in which subsequent bins are checked varies greatly Again, with M = 31 bins, compare the first 16 bins which are checked starting with 22 and 23: 22, 23, 26, 0, 7, 16, 27, 9, 24, 10, 29, 19, 11, 5, 1, 30 23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0

Quadratic Probing Thus, quadratic probing solves the problem of primary clustering Unfortunately, there is a second problem which must be dealt with Suppose we have M = 8 bins: 1 2 1, 2 2 4, In this case, we are checking bin h + 1 twice having checked only one other bin

Quadratic Probing Unfortunately, there is no guarantee that h + i 2 mod M will cycle through 0, 1,..., M – 1 Solution: require that M be prime in this case, h + i 2 mod M for i = 0,..., (M – 1)/2 will cycle through exactly ( M + 1)/2 values before repeating

Quadratic Probing Example with M = 11 : 0, 1, 4, 9, 16 5, 25 3, 36 3 With M = 13 : 0, 1, 4, 9, 16 3, 25 12, 36 10, With M = 17 : 0, 1, 4, 9, 16, 25 8, 36 2, 49 15, 64 13, 81 13

Quadratic Probing Thus, quadratic probing avoids primary clustering Unfortunately, we are not guaranteed that we will use all the bins In reality, if the hash function is reasonable, this is not a significant problem until approaches 1

Secondary Clustering The phenomenon of primary clustering will not occur with quadratic probing However, if multiple items all hash to the same initial bin, the same sequence of numbers will be followed This is termed secondary clustering The effect is less significant than that of primary clustering

Double Hashing Use two hash functions If M is prime, eventually will examine every position in the table double_hash_insert(K) if(table is full) error probe = h1(K) offset = h2(K) while (table[probe] occupied) probe = (probe + offset) mod M table[probe] = K

Double Hashing Many of same (dis)advantages as linear probing Distributes keys more uniformly than linear probing does Notes: h2(x) should never return zero. M should be prime.

Double Hashing Example h1(K) = K mod 13 h2(K) = 8 - K mod 8 we want h2 to be an offset to add

Open Addressing Summary In general, the hash function contains two arguments now: Key value Probe number h(k,p), p=0,1,...,m-1 Probe sequences Should be a permutation of There are m! possible permutations Good hash functions should be able to produce all m! probe sequences

Open Addressing Summary None of the methods discussed can generate more than m 2 different probing sequences. Linear Probing: Clearly, only m probe sequences. Quadratic Probing: The initial key determines a fixed probe sequence, so only m distinct probe sequences. Double Hashing Each possible pair (h 1 (k),h 2 (k)) yields a distinct probe, so m 2 permutations.

Choosing A Hash Function Clearly choosing the hash function well is crucial. What will a worst-case hash function do? What will be the time to search in this case? What are desirable features of the hash function? Should distribute keys uniformly into slots Should not depend on patterns in the data

From Keys to Indices A hash function is usually the composition of two maps: hash code map: key integer compression map: integer [0, N 1] An essential requirement of the hash function is to map equal keys to equal indices A good hash function minimizes the probability of collisions

Java Hash Java provides a hashCode() method for the Object class, which typically returns the 32-bit memory address of the object. Note that this is NOT the final hash key or hash function. This maps data to the universe, U, of 32-bit integers. There is still a hash function for the HashTable. Unfortunately, it is x mod N, and N is usually a power of 2. The hashCode() method should be suitably redefined for structs. If your dictionary is Integers with Java (or probably.NET), the default hash function is horrible. At a minimum, set the initial capacity to a large prime (but Java will reset this too!). Note, we have access to the Java source, so can determine this. My guess is that it is just as bad in.NET, but we can not look at the source.

Popular Hash-Code Maps Integer cast: for numeric types with 32 bits or less, we can reinterpret the bits of the number as an int Component sum: for numeric types with more than 32 bits (e.g., long and double), we can add the 32-bit components. We need to do this to avoid all of our set of longs hashing to the same 32-bit integer.

Popular Hash-Code Maps Polynomial accumulation: for strings of a natural language, combine the character values (ASCII or Unicode) a 0 a 1... a n-1 by viewing them as the coefficients of a polynomial: a 0 + a 1 x x n-1 a n-1

Popular Hash-Code Maps The polynomial is computed with Horners rule, ignoring overflows, at a fixed value x: a 0 + x (a 1 + x (a x (a n-2 + x a n-1 )... )) The choice x = 33, 37, 39, or 41 gives at most 6 collisions on a vocabulary of 50,000 English words Java uses 31. Why is the component-sum hash code bad for strings?

Random Hashing Random hashing Uses a simple random number generation technique Scatters the items randomly throughout the hash table

Popular Compression Maps Division: h(k) = |k| mod N the choice N =2 m is bad because not all the bits are taken into account the table size N should be a prime number certain patterns in the hash codes are propagated Multiply, Add, and Divide (MAD): h(k) = |ak + b| mod N eliminates patterns provided a mod N 0 same formula used in linear congruential (pseudo) random number generators

The Division Method h(k) = k mod m In words: hash k into a table with m slots using the slot given by the remainder of k divided by m What happens to elements with adjacent values of k? What happens if m is a power of 2 (say 2 P )? What if m is a power of 10? Upshot: pick table size m = prime number not too close to a power of 2 (or 10)

The Multiplication Method For a constant A, 0 < A < 1: h(k) = m (kA - kA ) What does this term represent?

The Multiplication Method For a constant A, 0 < A < 1: h(k) = m (kA - kA ) Choose m = 2 P Choose A not too close to 0 or 1 Knuth: Good choice for A = ( 5 - 1)/2 Fractional part of kA

Recap So, we have two possible strategies for handling collisions. Chaining Open Addressing We have possible hash functions that try to minimize the probability of collisions. What is the algorithmic complexity?

Analysis of Chaining Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot. Given n keys and m slots in the table: the load factor = n/m = average # keys per slot.

Analysis of Chaining What will be the average cost of an unsuccessful search for a key? O(1+ )

Analysis of Chaining What will be the average cost of a successful search? O(1 + /2) = O(1 + )

Analysis of Chaining So the cost of searching = O(1 + ) If the number of keys n is proportional to the number of slots in the table, what is ? A: = O(1) In other words, we can make the expected cost of searching constant if we make constant

Analysis of Open Addressing Consider the load factor,, and assume each key is uniformly hashed. Probability that we hit an occupied cell is then. Probability that we the next probe hits an occupied cell is also. Will terminate if an unoccupied cell is hit: (1- ). From Theorem 11.6, the expected number of probes in an unsuccessful search is at most 1/(1- ). Theorem 11.8: Expected number of probes in a successful search is at most: