1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.

Slides:



Advertisements
Similar presentations
Hash Tables.
Advertisements

CSCE 3400 Data Structures & Algorithm Analysis
Data Structures Using C++ 2E
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Techniques.
Hashing CS 3358 Data Structures.
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Hash Tables and Associative Containers CS-212 Dick Steflik.
CS 206 Introduction to Computer Science II 11 / 17 / 2008 Instructor: Michael Eckmann.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Hash Table March COP 3502, UCF.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
1 Hash table. 2 A basic problem We have to store some records and perform the following:  add new record  delete record  search a record by key Find.
Comp 335 File Structures Hashing.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing Hashing is another method for sorting and searching data.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
CHAPTER 8 SEARCHING CSEB324 DATA STRUCTURES & ALGORITHM.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Suppose we want to search for a data item in a huge data record tables How long will it take? – It depends on the data structure – (unsorted) linked.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
COSC 1030 Lecture 10 Hash Table. Topics Table Hash Concept Hash Function Resolve collision Complexity Analysis.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Hashing, Hash Function, Collision & Deletion
Hash table CSC317 We have elements with key and satellite data
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash functions Open addressing
Hash Table.
Hashing Alexandra Stefan.
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
Data Structures – Week #7
Ch Hash Tables Array or linked list Binary search trees
Hashing.
EE 312 Software Design and Implementation I
Presentation transcript:

1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005

2 Review

3 Review Arrays, lists, queues, stacks and trees are used to store and retrieve records. Arrays, lists, queues, stacks and trees are used to store and retrieve records. Each record has a key value: Each record has a key value: Student #: Name: Adelson-Velskii Grade: A+ Other information: avl

4 …Review Binary search: key = 13 Sequential search: key = comparisons comparisons

5 …Review Retrieve key=13 in a balanced Binary Search Tree comparisons

6 …Review Data structure Complexity O(logn)O(n) Sorted array search insert, delete Sorted linked- list search, insert, delete Balanced BST search, insert, delete

7 Agenda What is hashing? What is hashing? Hash functions Hash functions Collision-resolution strategies Collision-resolution strategies Analysis Analysis Problems to think about Problems to think about

8 What is hashing? 1. Basic idea 2. Definitions 3. Perfect hashing 4. Collisions 5. Open-addressing vs. Chaining

9 Basic idea A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and search in O(1) in average. A data structure that requires a limited or no search in order to find a record. A data structure that requires a limited or no search in order to find a record. The location of the record is calculated from the value of its key. The location of the record is calculated from the value of its key. No order in the stored records. No order in the stored records. No findMin or findMax. No findMin or findMax.

10 …Basic idea Consider records with integer key values: Consider records with integer key values: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 Create a table of 10 cells: index of each cell in the range [0..9]. Create a table of 10 cells: index of each cell in the range [0..9]. Each record is stored in the cell whose index corresponds to its key value. Each record is stored in the cell whose index corresponds to its key value key: 2 … key: 8 …

11 Definitions Hashing Hashing The process of accessing a record, stored in a table, by mapping the value of its key to a position in the table. Hash function Hash function A function that maps key values to table positions. Hash table Hash table The array where the records are stored. Hash value Hash value The value returned by the hash function. It usually corresponds to a position in the hash table.

12 Perfect hashing Key … Hash function: H(key)=keyH(8)=8 Key 8 2 H(2)=2 Key 2 Record Hash table

13 …Perfect hashing Each key value maps to a different position in the table. Each key value maps to a different position in the table. All the keys need to be known before the table is created. All the keys need to be known before the table is created. Problem: what if the keys are neither contiguous nor in the range of the indices of the table? Problem: what if the keys are neither contiguous nor in the range of the indices of the table? Solution: find a hash function that allows perfect hashing! Is this always possible? Solution: find a hash function that allows perfect hashing! Is this always possible?

14 …Perfect hashing Example: a company has 100 employees. Social Insurance Number (SIN) is used as a key for a each record. Example: a company has 100 employees. Social Insurance Number (SIN) is used as a key for a each record. Given a 9 digits SIN, should we create a table of 1,000,000,000 cells for only 100 employees? Given a 9 digits SIN, should we create a table of 1,000,000,000 cells for only 100 employees? Knowing the SI Numbers of all 100 employees are known in advance does not guarantee to find a perfect hash function. Knowing the SI Numbers of all 100 employees are known in advance does not guarantee to find a perfect hash function.

15 …Perfect hashing The birthday paradox: The birthday paradox: what is the number of persons that need to be together in a room in order to, “most likely”, have two of them with the same date of birth (month/day)? Answer: only 23 people. Hint: calculate p the probability that no two persons have the same date of birth.

16 …Perfect hashing Hash functions that allow perfect hashing are so rare that it is worth looking for them only in special circumstances. Hash functions that allow perfect hashing are so rare that it is worth looking for them only in special circumstances. In addition, it is often that the collection of records is not known in advance. In addition, it is often that the collection of records is not known in advance.

17 Collisions What if we cannot find a perfect hash function? What if we cannot find a perfect hash function? Collision: more than one key will map to the same location in the table! Can we avoid collisions? No, except in the case of perfect hashing (rare). Can we avoid collisions? No, except in the case of perfect hashing (rare). Solution: select a “good” hash function and use a collision-resolution strategy. Solution: select a “good” hash function and use a collision-resolution strategy.

18 …Collisions Example: The keys are integers and the hash function is hashValue = key mod tableSize  If tableSize = 10, all records whose keys have the same rightmost digit have the same hash value. Insert 13 and

19 Open-addressing vs. chaining Open-addressing: Storing the record directly in the table. Open-addressing: Storing the record directly in the table. Deal with collisions using collision-resolution strategies. Chaining: Each cell of the hash table points towards a linked-list. Chaining: Each cell of the hash table points towards a linked-list.

20 …Chaining H(key)=key mod tableSize Insert 13 Insert 23 Insert 18 Collision is resolved by inserting the elements in a linked-list.

21 Hash functions 1. Hash functions 2. Division 3. Digits selection 4. Mid-square 5. Folding 6. String keys

22 Hash functions Can we have a hash function that avoids collisions? Can we have a hash function that avoids collisions? Collisions are nearly unavoidable! If we are careful when selecting the hash function, then the number of collisions will be few. Exception: the hash function is selected for a specific set of records  Perfect hashing Exception: the hash function is selected for a specific set of records  Perfect hashing

23 …Hash functions A poor hash function: A poor hash function: Maps keys non-uniformly into table locations, or maps a set of contiguous keys into clusters. An ideal hash function: An ideal hash function: - Maps keys uniformly and randomly onto the entire range of table locations. -Each location is equally likely to be used for a randomly chosen key. -Fast computation.

24 Hash functions: division Division: Division: H(key) = key mod tableSize 0 ≤ key mod tableSize ≤ tableSize-1 0 ≤ key mod tableSize ≤ tableSize-1 Empirical studies have shown that this function gives very good results.

25 …division Assume H(key) = key mod tableSize Assume H(key) = key mod tableSize All keys such that key mod tableSize = 0 map into position 0 in the table. All keys such that key mod tableSize = 0 map into position 0 in the table. All keys such that key mod tableSize = 1 map into position 1 in the table. All keys such that key mod tableSize = 1 map into position 1 in the table.  This phenomenon is unavoidable for positions 0 and 1: we wish to avoid this phenomenon when possible.

26 …division Assume tableSize = 25 Assume tableSize = 25 All keys that are multiples of 5 will map into positions 0, 5, 10, 15 and 20 in the table! All keys that are multiples of 5 will map into positions 0, 5, 10, 15 and 20 in the table! Why? because key and tableSize have 5 as a common factor: Why? because key and tableSize have 5 as a common factor: There exists an integer m such that: key = m×5 Therefore, key mod 25 = 5×(m mod 5) is a multiple of 5

27 … division Choose tableSize as a prime number. Choose tableSize as a prime number. Example: tableSize = 29 (a prime number) Example: tableSize = 29 (a prime number) 5 mod 29 = 5, 10 mod 29 = 10, 15 mod 29 = 15, 20 mod 29 = 20, 25 mod 29 = 25, 30 mod 29 = 1, 35 mod 29 = 6, 40 mod 29 = 11…

28 Hash functions: digit selection Digit(s) selection: Digit(s) selection: key = d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 If the collection of records is known, how to choose the digit(s)? Analysis of the occurrence of each digit. Analysis of the occurrence of each digit.

29 Digit selection: analysis Assume 10 records are to be stored:

30 …Digit selection: analysis Non-uniform distributionUniform distribution Assume 100 records are to be stored:

31 …Digit selection: analysis Consider the hash function: Consider the hash function: H(d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 )=d 5 d 7 d 5 and d 7 are uniformly distributed …but d 5 = 3 and d 7 = 8 appear very often in common! 38 is the only position used in the range increasing the chances for collisions.  Analysis of correlation is required.

32 Hash functions: mid-square Mid-square: consider key = d 1 d 2 d 3 d 4 d 5 Mid-square: consider key = d 1 d 2 d 3 d 4 d 5 d 1 d 2 d 3 d 4 d 5 × d 1 d 2 d 3 d 4 d r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 Select middle digits, for example r 4 r 5 r 6 Why the middle digits and not leftmost or rightmost digits?

33 Mid-square: example × Only 321 contribute in the 3 rightmost digits (041) of the multiplication result. Similar remark regarding the leftmost digits. All key digits contribute in the middle digits of the multiplication result.

34 Hash functions: folding Folding: consider key = d 1 d 2 d 3 d 4 d 5 Folding: consider key = d 1 d 2 d 3 d 4 d 5 Combine portions of the key to form a smaller result. In general, folding is used in conjunction with other functions. In general, folding is used in conjunction with other functions. Example: H(key) = d 1 +d 2 + d 3 + d 4 + d 5 ≤ 45 or, H(key) = d 1 + d 2 d 3 + d 4 d 5 ≤ 207 or, H(key) = d 1 + d 2 d 3 + d 4 d 5 ≤ 207

35 Folding: example Consider a computer with 16-bit registers, i.e. integers < 2 16 = Consider a computer with 16-bit registers, i.e. integers < 2 16 = Assume the 9-digit SIN is used as a key. Assume the 9-digit SIN is used as a key. SIN requires folding before it is used: SIN requires folding before it is used: d 1 + d 2 d 3 d 4 d 5 + d 6 d 7 d 8 d 9 ≤ d 1 + d 2 d 3 d 4 d 5 + d 6 d 7 d 8 d 9 ≤ 20007

36 The key is a string When the key is a string, the ASCII code of each character in the string is considered. When the key is a string, the ASCII code of each character in the string is considered. The ASCII code is an integer value in the range 0…127. The ASCII code is an integer value in the range 0…127. String to decimal conversion: Consider key = “data” hashValue = (‘a’+’t’×128+’a’ × ’d’ ×128 3 ) mod tableSize (‘a’+’t’×128+’a’ × ’d’ ×128 3 ) mod tableSize

37 …The key is a string This method generates huge numbers that the machine might not store correctly. Goal: reduce the number of arithmetic operations and generate relatively small numbers. Goal: reduce the number of arithmetic operations and generate relatively small numbers. hashValue = ‘d’ mod tableSize hashValue = (hashValue×128 + ‘a’) mod tableSize hashValue = (hashValue×128 + ‘t’) mod tableSize hashValue = (hashValue×128 + ‘a’) mod tableSize

38 Collision-resolution strategies in open addressing 1. Linear probing: The problem of clustering 2. Quadratic probing

39 Linear probing If H(key) is already occupied: Search sequentially (and by wrapping around the table if necessary) until an empty position is found. Search sequentially (and by wrapping around the table if necessary) until an empty position is found. Example: H(key)=key mod tableSize Insert Insert Insert Insert Insert 9 9

40 …Linear probing hashValue = H(key) Probe table positions (hashValue + i) mod tableSize with i= 1,2,…tableSize-1 Until an empty position is found in the table, or all positions have been checked.

41 Primary clustering Linear probing makes that many items are stored in a few areas creating clusters: Linear probing makes that many items are stored in a few areas creating clusters: This is known as primary clustering. Contiguous keys are mapped into contiguous table locations. Contiguous keys are mapped into contiguous table locations. Consequence: Slow search even when the table’s load factor λ is small: Consequence: Slow search even when the table’s load factor λ is small: λ=(number of occupied locations)/tableSize λ=(number of occupied locations)/tableSize

42 Quadratic probing Collision-resolution strategy that eliminates primary clustering. Collision-resolution strategy that eliminates primary clustering. It works as follows: It works as follows: hashValue = H(key) hashValue = H(key) if table[hashValue] is occupied probe table positions (hashValue + i 2 ) mod tableSize, i=1,2,3... until an empty position is found.

43 …Quadratic probing Insert 89 Insert Insert 49 Insert Insert 9 9 Quadratic probing creates spaces between the inserted elements hashing to the same position: eliminates primary clustering.

44 …Quadratic probing Very important result: Very important result: If quadratic probing is used, tableSize is prime and table is at least half empty, the insertion of a new element is guaranteed and no cell is probed twice.

45 Analysis

46 Analysis We calculate the average number of comparisons to search successfully S and unsuccessfully U for a record given the load factor of the table.

47 Analysis U=unsuccessful searchS=successful search U=unsuccessful searchS=successful search H, is uniform H, is uniform Linear probing: Linear probing: U=(1+1/(1-λ) 2 )/2S=(1+1/(1-λ))/2 Quadratic probing: Quadratic probing: U=1/(1- λ)S=-(1/ λ)ln(1- λ) Chaining: Chaining: U= λS=1+ λ/2

48 Comparison US Linear probing λ = λ = λ = Quadratic probing λ = λ = λ = Chaining λ = λ = λ =

49 Problems to think about

50 Proofs Proof of the birthday paradox. Proof of the birthday paradox. In quadratic probing: In quadratic probing: pos i = (H(key)+i 2 ) mod tableSize pos i = (H(key)+i 2 ) mod tableSize Show that: pos i = (pos i-1 + 2i – 1) mod tableSize What is the advantage of this result?

51 Implementation issues Implementation of hash tables. Implementation of hash tables. Deletion in the case of open-addressing. Deletion in the case of open-addressing. How to keep a table at least half empty? How to keep a table at least half empty? Empirical evaluation of different hash functions for a particular problem. Empirical evaluation of different hash functions for a particular problem. Empirical evaluation of probing strategies for a particular problem. Empirical evaluation of probing strategies for a particular problem.

52 Other questions What is the relationship between the number of probes for an insertion and an unsuccessful search? What is the relationship between the number of probes for an insertion and an unsuccessful search?