CS202 - Fundamental Structures of Computer Science II

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Hashing as a Dictionary Implementation
CS202 - Fundamental Structures of Computer Science II
Dictionaries and Their Implementations Chapter 18 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013.
Hashing Techniques.
Hashing CS 3358 Data Structures.
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Hash Table March COP 3502, UCF.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Hash Tables.  What can we do if we want rapid access to individual data items?  Looking up data for a flight in an air traffic control system  Looking.
Comp 335 File Structures Hashing.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing Hashing is another method for sorting and searching data.
HASHING PROJECT 1. SEARCHING DATA STRUCTURES Consider a set of data with N data items stored in some data structure We must be able to insert, delete.
Hashing as a Dictionary Implementation Chapter 19.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Dictionaries and Their Implementations Chapter 18 Data Structures and Problem Solving with C++: Walls and Mirrors, Frank Carrano, © 2012.
Hash Tables ADT Data Dictionary, with two operations – Insert an item, – Search for (and retrieve) an item How should we implement a data dictionary? –
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Chapter 27 Hashing Jung Soo (Sue) Lim Cal State LA.
Hashing.
Data Structures Using C++ 2E
Hashing CSE 2011 Winter July 2018.
Data Abstraction & Problem Solving with C++
School of Computer Science and Engineering
Slides by Steve Armstrong LeTourneau University Longview, TX
Subject Name: File Structures
Data Structures Using C++ 2E
Hash functions Open addressing
Quadratic probing Double hashing Removal and open addressing Chaining
Advanced Associative Structures
Hash Table.
Hash Tables.
Chapter 21 Hashing: Implementing Dictionaries and Sets
Dictionaries and Their Implementations
CSCE 3110 Data Structures & Algorithm Analysis
Hash Tables and Associative Containers
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
Introduction to Algorithms
Advanced Implementation of Tables
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
EE 312 Software Design and Implementation I
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
Data Structures – Week #7
Ch Hash Tables Array or linked list Binary search trees
Ch. 13 Hash Tables  .
Hash Maps Introduction
Data Structures and Algorithm Analysis Hashing
DATA STRUCTURES-COLLISION TECHNIQUES
EE 312 Software Design and Implementation I
Chapter 13 Hashing © 2011 Pearson Addison-Wesley. All rights reserved.
Lecture-Hashing.
Presentation transcript:

CS202 - Fundamental Structures of Computer Science II lec07-hashing February 5, 2019 HASHING Using balanced trees (2-3, 2-3-4, red-black, and AVL trees) we can implement table operations (retrieval, insertion and deletion) efficiently.  O(logN) Can we find a data structure so that we can perform these table operations better than balanced search trees?  O(1) YES  HASH TABLES In hash tables, we have an array (index: 0..n-1) and an address calculator (hash function) which maps a search key into an array index between 0 and n-1. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Hash Function – Address Calculator Hash Table 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II Hashing A hash function tells us where to place an item in array called a hash table. This method is know as hashing. A hash function maps a search key into an integer between 0 and n-1. We can have different hash functions. Ex. h(x) = x mod n if x is an integer The hash function is designed for the search keys depending on the data types of these search keys (int, string, ...) Collisions occur when the hash function maps more than one item into the same array. We have to resolve these collisions using certain mechanism. A perfect hash function maps each search key into a unique location of the hash table. A perfect hash function is possible if we know all the search keys in advance. In practice (we do not know all the search keys), a hash function can map more than one key into the same location (collision). 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II Hash Function We can design different hash functions. But a good hash function should: be easy and fast to compute, place items evenly throughout the hash table. We will consider only hash functions operate on integers. If the key is not an integer, we map it into an integer first, and apply the hash function. The hash table size should be prime. By selecting the table size as a prime number, we may place items evenly throughout the hash table, and we may reduce the number of collisions. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Hash Functions -- Selecting Digits If the search keys are big integers (Ex. nine-digit numbers), we can select certain digits and combine to create the address. h(033475678) = 37 selecting 2nd and 5th digits (table size is 100) h(023455678) = 25 Digit-Selection is not a good hash function because it does not place items evenly throughout the hash table. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Hash Functions – Folding Folding: Selecting all digits and add them h(033475678) = 0 + 3+ 3 + 4 +7 + 5 + 6 +7 + 8 = 43 0  h(nine-digit search key)  81 We can select a group of digits and we can add these groups too. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Hash Functions – Modula Arithmetic Modula arithmetic provides a simple and effective hash function. We will use modula arithmetic as our hash function in the rest of our discussions. h(x) = x mod tableSize The table size should be prime. Some prime numbers: 7,11, 13, ..., 101, ... 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Hash Functions Converting Character String into An Integer If our search keys are strings, first we have to convert the string into an integer, and apply a hash function which is designed to operate on integers to this integer value to compute the address. We can use ASCII codes of characters in the conversion. Consider the string “NOTE”, assign 1 (00001) to ‘A’, .... N is 14 (01110), O is 15 (01111), T is 20 (10100), E is 5 ((00101) Concatenate four binary numbers to get a new binary number 011100111111010000101  474,757 apply x mod tableSize 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II Collision Resolution There are two general approaches to collision resolution in hash tables: Open Addressing – Each entry holds one item Chaining – Each entry can hold more than item Buckets – hold certain number of items 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II A Collision Table size is 101 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II Open Addressing During an attempt to insert a new item into a table, if the hash function indicates a location in the hash table that is already occupied, we probe for some other empty (or open) location in which to place the item.The sequence of locations that we examine is called the probe sequence.  If a scheme which uses this approach we say that it uses open addressing There are different open-addressing schemes: Linear Probing Quadratic Probing Double Hashing 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Open Addressing – Linear Probing In linear probing, we search the hash table sequentially starting from the original hash location. If a location is occupied, we check the next location We wrap around from the last table location to the first table location if necessary. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Linear Probing - Example Table Size is 11 (0..10) Hash Function: h(x) = x mod 11 Insert keys: 20 mod 11 = 9 30 mod 11 = 8 2 mod 11 = 2 13 mod 11 = 2  2+1=3 25 mod 11 = 3  3+1=4 24 mod 11 = 2  2+1, 2+2, 2+3=5 10 mod 11 = 10 9 mod 11 = 9  9+1, 9+2 mod 11 =0 9 1 2 3 13 4 25 5 24 6 7 8 30 20 10 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Linear Probing – Clustering Problem One of the problems with linear probing is that table items tend to cluster together in the hash table. This means that the table contains groups of consecutively occupied locations. This phenomenon is called primary clustering. Clusters can get close to one another, and merge into a larger cluster. Thus, the one part of the table might be quite dense, even though another part has relatively few items. Primary clustering causes long probe searches and therefore decreases the overall efficiency. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Open Addressing – Quadratic Probing Primary clustering problem can be almost eliminated if we use quadratic probing scheme. In quadratic probing, We start from the original hash location i If a location is occupied, we check the locations i+12 , i+22 , i+32 , i+42 ... We wrap around from the last table location to the first table location if necessary. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Quadratic Probing - Example Table Size is 11 (0..10) Hash Function: h(x) = x mod 11 Insert keys: 20 mod 11 = 9 30 mod 11 = 8 2 mod 11 = 2 13 mod 11 = 2  2+12=3 25 mod 11 = 3  3+12=4 24 mod 11 = 2  2+12, 2+22=6 10 mod 11 = 10 9 mod 11 = 9  9+12, 9+22 mod 11, 9+32 mod 11 =7 1 2 3 13 4 25 5 6 24 7 9 8 30 20 10 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Open Addressing – Double Hashing Double hashing also reduces clustering. In linear probing and and quadratic probing , the probe sequences are independent from the key. We can select increments used during probing using a second hash function. The second hash function h2 should be: h2(key)  0 h2  h1 We first probe the location h1(key) If the location is occupied, we probe the location h1(key)+h2(key), h1(key)+(2*h2(key)), ... 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Double Hashing - Example Table Size is 11 (0..10) Hash Function: h1(x) = x mod 11 h2(x) = 7 – (x mod 7) Insert keys: 58 mod 11 = 3 14 mod 11 = 3  3+7=10 91 mod 11 = 3  3+7, 3+2*7 mod 11=6 1 2 3 58 4 5 6 91 7 8 9 10 14 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Open Addressing – Retrieval & Deletion In open addressing, to find an item with a given key: We probe the locations (same as insertion) until we find the desired item or we reach to an empty location. Deletions in open addressing cause complications: We CANNOT simply delete an item from the hash table because this new empty (deleted locations) cause to stop prematurely (incorrectly) indicating a failure during a retrieval. Solution: We have to have three kinds of locations in a hash table: Occupied, Empty, Deleted. A deleted location will be treated as an occupied location during retrieval and insertion. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II Separate Chaining Another way to resolve collisions is to change the structure of the hash table. In open-addressing, each location of the hash table holds only one item. We can define a hash table so that each location is itself an array called bucket, we can store the items which hash into this location in this array. Problem: What will be the size of the bucket? A better approach is to design the hash table as an array of linked lists, this collision resolution method is known as separate-chaining. In separate-chaining , each entry (of the hash table) is a pointer to a linked list (the chain) of the items that the hash function has mapped into that location. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II Separate Chaining 2/5/2019 CS202 - Fundamental Structures of Computer Science II

CS202 - Fundamental Structures of Computer Science II Hashing - Analysis An analysis of the average-case efficiency of hashing involves the load factor , which is the ration of the current number of items in the table to the table size.  = (current number of items) / tableSize The load factor measures how full a hash table is. The hash table should be so filled too much to get better performance from the hashing. Unsuccessful searches generally require more time than successful searches. In average case analyses, we assume that the hash function uniformly distributes the keys in the hash table. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Linear Probing – Analysis For linear probing, the approximate average number of comparisons (probes) that a search requires as follows: for a successful search for an unsuccessful search As load factor increases, the number of collisions increases causing increased search times. To maintain efficiency, it is important to prevent the hash table from filling up. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Linear Probing – Analysis -- Example What is the average number of probes for a successful search and an unsuccessful search for this hash table? Hash Function: h(x) = x mod 11 Successful Search: 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0 Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8 Unsuccessful Search: We assume that the hash function uniformly distributes the keys. 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8: 8,9,10,0,1 9: 9,10,0,1 -- 10: 10,0,1 Avg. Probe for US = (2+1+5+4+3+2+1+1+5+4+3)/11=31/11 9 1 2 3 13 4 25 5 24 6 7 8 30 20 10 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Quadratic Probing & Double Hashing– Analysis For quadratic probing and double hashing, the approximate average number of comparisons (probes) that a search requires as follows: for a successful search for an unsuccessful search On average, both methods require fewer comparisons than linear probing. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Separate Chaining For separate-chaining, the approximate average number of comparisons (probes) that a search requires as follows: for a successful search for an unsuccessful search Separate-chaining is most efficient collision resolution scheme. But it requires more storage. We need storage for the pointer fields. We can easily perform deletion operation using separate-chaining scheme. Deletion is very difficult in open-addressing. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

The relative efficiency of four collision-resolution methods 2/5/2019 CS202 - Fundamental Structures of Computer Science II

What Constitutes a Good Hash Function A hash function should be easy and fast to compute. A hash function should scatter the data evenly throughout the hash table. How well does the hash function scatter random data? How well does the hash function scatter non-random data? Two general principles : The hash function should use entire key in the calculation. If a hash function uses modulo arithmetic, the table size should be prime. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Hash Table versus Search Trees In the most of operations, the hash table performs better than search trees. But, the traversing the data in the hash table in a sorted order is very difficult. For similar operations, the hash table will not be good choice. Ex. Finding all the items in a certain range. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Data with Multiple Organizations Several independent data structure do not support all operations efficiently. We may need multiple organizations for data to get efficient implementations for all operations. One organization will be used for certain operations, the other organizations will be used for other operations. 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Data with Multiple Organizations (cont.) 2/5/2019 CS202 - Fundamental Structures of Computer Science II

Data with Multiple Organizations (cont.) 2/5/2019 CS202 - Fundamental Structures of Computer Science II