Hash Tables - Motivation

Slides:



Advertisements
Similar presentations
Hash Tables.
Advertisements

Hashing General idea Hash function Separate Chaining Open Addressing
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Hashing as a Dictionary Implementation
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Techniques.
Hashing CS 3358 Data Structures.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
CSE 373 Data Structures Lecture 10
CSE 326: Data Structures: Hash Tables
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
CS 206 Introduction to Computer Science II 11 / 17 / 2008 Instructor: Michael Eckmann.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Hashing General idea: Get a large array
Cpt S 223 – Advanced Data Structures Hashing
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
Hashing The Magic Container. Interface Main methods: –Void Put(Object) –Object Get(Object) … returns null if not i –… Remove(Object) Goal: methods are.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can.
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
1.  We’ll discuss the hash table ADT which supports only a subset of the operations allowed by binary search trees.  The implementation of hash tables.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Hashing - 2 Designing Hash Tables Sections 5.3, 5.4, 5.4, 5.6.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hash Tables CSIT 402 Data Structures II. Hashing Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
CE 221 Data Structures and Algorithms
Hashing Problem: store and retrieving an item using its key (for example, ID number, name) Linked List takes O(N) time Binary Search Tree take O(logN)
Hashing Alexandra Stefan.
Collision Resolution Neil Tang 02/18/2010
CS202 - Fundamental Structures of Computer Science II
EE 312 Software Design and Implementation I
Collision Resolution Neil Tang 02/21/2008
Data Structures – Week #7
Hashing.
DATA STRUCTURES-COLLISION TECHNIQUES
EE 312 Software Design and Implementation I
Collision Resolution: Open Addressing Extendible Hashing
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

Hash Tables - Motivation Consider the problem of storing several (key, value) pairs in a data structure that would support the following operations efficiently Insert(key, value) Delete(key, value) Find(key) Data Structures we have looked at so far Arrays – O(1) Insert/Delete, O(N) Search Linked Lists – O(1) Insert/Delete, O(N) Search Search Trees (BST, AVL, Splay) – all ops O(logN) Can we make Find/Insert/Delete all O(1)?

Hash Tables – Main Idea Main idea: Use the key (string or number) to index directly into an array – O(1) time to access records Hash Table. N = 11 45 1 // Returns the index of // the key in the table int HashFunction(int key){ return key%N; } //end-HashFunction 2 45 3 3 3 4 16 20 16 5 42 6 Keys 20 & 42 both map to the same slot – 9 Called a collision! Need to handle collisions 7 8 20 42? 9 10

Hash Tables – String Keys If the keys are strings, then convert them to an integer index Hash Table. N = 11 cem 1 2 // Returns the index of // the key in the table int HashFunc(char key[]){ ????????? } //end-HashFunc cem 3 ali 4 veli veli hasan 5 6 ali 7 8 hasan 9 10

Hash Functions for String Keys If keys are strings, can get an integer by adding up ASCII values of characters in key Problems? // Returns the index of // the key in the table int HashFunction(String key){ int hashCode = 0; int len = key.length(); for (int i=0; i<len; i++){ hashCode += key.charAt(i); } //end-for return hashCode % N; } //end-HashFunction Will map “abc” and “bac” to the same slot! 2. If all keys are 8 or less characters long, then will map only to positions 0 through 8*127 = 1016 Need to evenly distribute keys

Hash Functions for String Keys Problems with adding up char values for string keys If string keys are short, will not hash to all of the hash table Different character combinations hash to same value “abc”, “bca”, and “cab” all add up to 6 Suppose keys can use any of 29 characters plus blank A good hash function for strings: treat characters as digits in base 30 (using “a” = 1, “b” = 2, “c” = 3, ……… “z” = 29, “ “ (space) = 30) “abc” = 1*30^2 + 2*30^1 + 3 = 900+60+3=963 “bca” = 2*30^2 + 3*30^1 + 1 = 1800+270+1=2071 “cab” = 3*30^2 + 1*30^1 + 2 = 2700+30+2=2732 Can use 32 instead of 30 and shift left by 5 bits for faster multiplication

Hash Functions for String Keys A good hash function for strings: treat characters as digits in base 30 Can use 32 instead of 30 and shift left by 5 bits for faster multiplication // Returns the index of the key in the table int HashFunction(String key){ int hashCode = 0; int len = key.length(); for (i=0; i<len; i++){ hashCode = (hashCode << 5) + key.charAt(i)-’a’; } //end-for return hashCode % N; } //end-HashFunction

Hash Table Size We need to make sure that Hash Table is big enough for all keys and that it facilitates the Hash Function’s job to evenly distribute the keys What if TableSize is 10 and all keys end in 0? All keys would map to the same slot! Need to pick TableSize carefully typically, a prime number is chosen

Properties of Good Hash Functions Should be efficiently computable – O(1) time Should hash evenly throughout hash table Should utilize all slots in the table Should minimize collisions

Collisions and their Resolution A collision occurs when two different keys hash to the same value E.g. For TableSize = 17, the keys 18 and 35 hash to the same value 18 mod 17 = 1 and 35 mod 17 = 1 Cannot store both data records in the same slot in array! Two different methods for collision resolution: Separate Chaining: Use data structure (such as a linked list) to store multiple items that hash to the same slot Open addressing (or probing): search for empty slots using a second function and store item in first empty slot that is found

Separate Chaining # of keys occupying the slot Each hash table cell holds a pointer to a linked list of records with same hash value (i, j, k in figure) Collision: Insert item into linked list To Find an item: compute hash value, then do Find on linked list Can use a linked-list for Find/Insert/Delete in linked list Can also use BSTs: O(log N) time instead of O(N). But lists are usually small – not worth the overhead of BSTs nil i k1 nil nil j k2 k3 k4 nil nil k k5 k6 nil nil * Hash(k1) = i * Hash(k2)=Hash(k3)=Hash(k4) = j * Hash(k5)=Hash(k6)=k

Load Factor of a Hash Table Let N = number of items to be stored Load factor LF = N/TableSize Suppose TableSize = 2 and number of items N = 10 LF = 5 Suppose TableSize = 10 and number of items N = 2 LF = 0.2 Average length of chained list = LF Average time for accessing an item = O(1) + O(LF) Want LF to be close to 1 (i.e. TableSize ~N) But chaining continues to work for LF > 1

Collision Resolution by Open Addressing Linked lists can take up a lot of space… Open addressing (or probing): When collision occurs, try alternative cells in the array until an empty cell is found Given an item X, try cells h0(X), h1(X), h2(X), …, hi(X) hi(X) = (Hash(X) + F(i)) mod TableSize Define F(0) = 0 F is the collision resolution function. Three possibilities: Linear: F(i) = i Quadratic: F(i) = i^2 Double Hashing: F(i) = i*Hash2(X)

Open Addressing I: Linear Probing Main Idea: When collision occurs, scan down the array one cell at a time looking for an empty cell hi(X) = (Hash(X) + i) mod TableSize (i = 0, 1, 2, …) Compute hash value and increment until free cell is found In-Class Example: Insert {18, 19, 20, 29, 30, 31} into empty hash table with TableSize = 10 using: (a) separate chaining (b) linear probing

Load Factor Analysis of Linear Probing Recall: Load factor LF = N/TableSize Fraction of empty cells = 1 – LF Fraction cells we expect to probe = 1/(1- LF) Can show that expected number of probes for: Successful searches = O(1+1/(1- LF)) Insertions and unsuccessful searches = O(1+1/(1- LF)^2) Keep LF <= 0.5 to keep number of probes small (between 1 and 5). (E.g. What happens when LF = 0.99)

Drawbacks of Linear Probing Works until array is full, but as number of items N approaches TableSize (LF ~ 1), access time approaches O(N) Very prone to cluster formation (as in our example) If key hashes into a cluster, finding free cell involves going through the entire cluster Inserting this key at the end of cluster causes the cluster to grow: future Inserts will be even more time consuming! This type of clustering is called Primary Clustering Can have cases where table is empty except for a few clusters Does not satisfy good hash function criterion of distributing keys uniformly

Open Addressing II: Quadratic Probing Main Idea: Spread out the search for an empty slot Increment by i^2 instead of I hi(X) = (Hash(X) + i2) mod TableSize (i = 0, 1, 2, …) No primary clustering but secondary clustering possible Example 1: Insert {18, 19, 20, 29, 30, 31} into empty hash table with TableSize = 10 Example 2: Insert {1, 2, 5, 10, 17} with TableSize = 16 Theorem: If TableSize is prime and LF < 0.5, quadratic probing will always find an empty slot

Open Addressing III: Double Hashing Idea: Spread out the search for an empty slot by using a second hash function No primary or secondary clustering hi(X) = (Hash(X) + i*Hash2(X)) mod TableSize for i = 0, 1, 2, … E.g. Hash2(X) = R – (X mod R) R is a prime smaller than TableSize Try this example: Insert {18, 19, 20, 29, 30, 31} into empty hash table with TableSize = 10 and R = 7 No clustering but slower than quadratic probing due to Hash2

Lazy Deletion with Probing Need to use lazy deletion if we use probing (why?) Think about how Find(X) would work… Mark array slots as “Active/Not Active” If table gets too full (LF ~ 1) or if many deletions have occurred: Running time for Find etc. gets too long, and Inserts may fail! What do we do?

Rehashing Rehashing – Allocate a larger hash table (of size 2*TableSize) whenever LF exceeds a particular value How does it work? Cannot just copy data from old table: Bigger table has a new hash function Go through old hash table, ignoring items marked deleted Recompute hash value for each non-deleted key and put the item in new position in new table Running time = O(N) but happens very infrequently

Extendible Hashing What if we have large amounts of data that can only be stored on disks and we want to find data in 1-2 disk accesses Could use B-trees but deciding which of many branches to go to takes time Extendible Hashing: Store item according to its bit pattern Hash(X) = first dL bits of X Each leaf contains = M data items with dL identical leading bits Root contains pointers to sorted data items in the leaves

Extendible Hashing – The Details Extendible Hashing: Store data according to bit patterns Root is known as the directory M is the size of a disk block, i.e., # of keys that can be stored within the disk block Hash(X) = First 2 bits of X Directory 00 01 10 11 0000 0010 0011 0101 0110 1000 1010 1011 1110 Disk Blocks (M=3)

Extendible Hashing – More Details Insert: If leaf is full, split leaf Increase directory bits by one if necessary (e.g. 000, 001, 010, etc.) To avoid collisions and too much splitting, would like bits to be nearly random Hash keys to long integers and then look at leading bits Hash(X) = First 2 bits of X Directory 00 01 10 11 0000 0010 0011 0101 0110 1000 1010 1011 1110 Disk Blocks (M=3)

Extendible Hashing – Splitting example Hash(X) = First 1 bit of X 00 01 10 11 Directory Hash(X) = First 2 bits of X Disk Blocks (M=3) 0010 0110 0100 0111 1101 1010 1111 Insert(0111) 1 0010 0110 0100 0111 1101 1010 1111 1 0010 0110 0100 1101 1010 1111 0010 0110 0100 0111 Split Disk Blocks (M=3)

Extendible Hashing – Splitting example Hash(X) = First 2 bits of X 00 01 10 11 Directory Hash(X) = First 2 bits of X 0010 0110 0100 0111 1010 1011 1101 1111 Directory Insert(1011) 00 01 10 11 0010 0110 0100 0111 1101 1010 1111 1010 1011 1101 1111 1011 Disk Blocks (M=3)

Extendible Hashing – Splitting example 00 01 10 11 Directory Hash(X) = First 2 bits of X 0010 0110 0100 0111 1010 1011 1101 1111 Hash(X) = First 3 bits of X 000 001 010 011 Directory 0010 1010 1011 1101 1111 100 101 110 111 0100 0101 0110 0111 Insert(0101) 0101 0100 0101 0110 0111

Applications of Hash Tables In Compilers: Used to keep track of declared variables in source code – this hash table is known as the “Symbol Table.” In storing information associated with strings Example: Counting word frequencies in a text In on-line spell checkers Entire dictionary stored in a hash table Each word is text hashed – if not found, word is misspelled.

Hash Tables vs Search Trees Hash Tables are good if you would like to perform ONLY Insert/Delete/Find Hash Tables are not good if you would also like to get a sorted ordering of the keys Keys are stored arbitrarily in the Hash Table Hash Tables use more space than search trees Our rule of thumb: Time versus space tradeoff  Search Trees support sort/predecessor/successor/min/max operations, which cannot be supported by a Hash Table