Search  We’ve got all the students here at this university and we want to find information about one of the students.  How do we do it?  Linked List?

Slides:



Advertisements
Similar presentations
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Advertisements

CSCE 3400 Data Structures & Algorithm Analysis
Searching: Self Organizing Structures and Hashing
September 26, Algorithms and Data Structures Lecture VI Simonas Šaltenis Nykredit Center for Database Research Aalborg University
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Techniques.
Hashing CS 3358 Data Structures.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hash Table March COP 3502, UCF.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
1 Hash table. 2 A basic problem We have to store some records and perform the following:  add new record  delete record  search a record by key Find.
1 Symbol Tables The symbol table contains information about –variables –functions –class names –type names –temporary variables –etc.
Comp 335 File Structures Hashing.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Hashing Hashing is another method for sorting and searching data.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
October 6, Algorithms and Data Structures Lecture VII Simonas Šaltenis Aalborg University
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Hashing Jeff Chastine.
Hash table CSC317 We have elements with key and satellite data
Hashing CSE 2011 Winter July 2018.
Hashing Alexandra Stefan.
Search We’ve got all the students here at this university and we want to find information about one of the students. How do we do it? Linked List? Binary.
Hashing Alexandra Stefan.
Resolving collisions: Open addressing
Search We’ve got all the students here at this university and we want to find information about one of the students. How do we do it? Array? Linked List?
Hashing Alexandra Stefan.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
EE 312 Software Design and Implementation I
Lecture-Hashing.
Presentation transcript:

Search  We’ve got all the students here at this university and we want to find information about one of the students.  How do we do it?  Linked List?  Binary Search Tree?  AVL Tree?  Binary Heap?  Array?  We want something better...

Hashing: Let’s go back to arrays: If we know the index of where the student occurs in the array, we can access the student info in 1 e.g., the student’s id num We need a way to MAP the student (or other data) to an index in the array Mapping: a way of taking a key and mapping it to an index (a number) Hashing function maps the key to an index

Mapping We’ve got 5000 students, each with a student id that’s 5 digits long. Why not use the student ids as the index? Can you think of a better way to map the student to an index? How big should the array be? What problems might we hit?

Hash Functions Goal: to take x keys and map each key to a different index in an x-element array This is a perfect hashing function If we cannot define a perfect hashing function, we must deal with collisions. When more than one key maps to the same index We need to worry about: Hashing function Array Size How we handle collisions

Hash Function:  A good hash function:  Maps all keys to indices within an array  Distributes keys evenly within array  Avoids collisions  Is fast to compute

Potential Hash functions:  Could just take the key (which somehow can be represented as a number) and then mod with arraysize  E.g., student.id % arraySize  Problem:  Could end up with many numbers hashing to the same value  E.g., array is 100 and keys are all multiples of 10

Improving Hash Functions: Array Size  We know that we’re probably not going to be able to fill the array perfectly (we’ll have some unfilled spaces)  So let’s pick the size of the array  Make it a prime number  works better with larger primes that aren’t close to powers of 2)  E.g., 8 random numbers between 0 and 100, hash function is number%11:  x: 71 y: 5  x: 81 y: 4  x: 75 y: 9  x: 89 y: 1  x: 29 y: 7  x: 99 y: 0  x: 79 y: 2  x: 72 y:

Hash Functions:  There are many hashing functions  You can come up with your own…  Remember:  Quick to calculate  Evenly distributes keys within a range  Consistently map a key to an index  An example:  Multiply the key by some constant c between 0 and 1 k*c  Take the fractional part of k*c (the stuff that gets cut out when you floor a number) (k*c) – floor(k*c)  Multiply that by a number m * ((k*c) – floor(k*c))  Take the floor of that h(k) = floor(m * ((k*c) – floor(k*c))) A good value for c is: (sqrt(5) – 1)/2

Potential Hash Functions: Strings  A simple function to map strings to integers:  Add up character ASCII values (0-255) to produce integer keys  E.g., “abcd” = = 394 ==> h(“abcd”) = 394 % ArraySize  Calculations are quick  Depend on length of string  Potential problems:  Anagrams will map to the same index  h(“listen”) == h(“silent”)  Small strings may not use all of array  h(“a”) < 255  h(“I”) < 255  h(“be”) < 510  If our array is 3000, the hash function will skew the indexing towards the beginning of the array

Hashing of Strings (2.0):  Treat first 3 characters of string as base-27 integer (26 letters plus space)  Key = (S[0] + (27 1 * S[1]) + (27 2 * S[2])) % ArrayLength  You could pick some other number than 27…  Which problem does this address?  Calculated quickly (good!)  Problem with this approach:  It’s better, but there are an awful lot of words in the English language that start with the same first 3 letters:  record, recreation, receipt, reckless, recitation…  preclude,preference, predecessor, preen, previous...  Destitute, destroy, desire, designate, desperate…

Hashing with strings (3.0) Use all N characters of string as an N-digit base-b number  Choose b to be prime number larger than number of different characters  i.e., b = 29, 31, 37  If L = length of string s, then for i = 0; i < L; i++ { h += s[L-i-1] * pow(37,i); } h= h%ArrayLength; Code: int main() { string strarr[10]={"release","quirk","craving","cuckold","estuary","vitrify","logship","vase","bowl","cat"}; string maparr[17]; for (int i = 0; i < 10; i++) { unsigned long h = 0; int L = strarr[i].length(); for (int j = 0; j < L; j++) { h += ((int)strarr[i][L-j-1])*pow(37,j); } h %= 17; maparr[h] = strarr[i]; } return(0); }

Hashing function: Base: 37 Array length: 17  Problems:  longer calculations, especially for longer words:  Even with this hashing function we have a collision! stringreleasequirkcravingcuckoldestuaryvitrifylogshipvasebowlcat value value% vase vitrifycatestuarycuckoldlogshipbowlreleasecraving

Collisions  When multiple keys map to the same array index.  There’s a trade-off between the number of collisions and the size of the array:  Huge arrays should mean fewer collisions  Load factor: number of indices (n)/total number of slots (m)  Indicates how full the array is  But with a reasonable array size, we will have collisions, no matter how good our hashing function is…

Handling Collisions:  There are many ways to handle collisions  Chaining  linear probing  quadratic probing  random probing  double hashing  etc.

Collisions: Chaining  Two keys hash to the same index  We could store them both in the same index  Make each entry in the array be a pointer to a linked list  (You thought we’d escaped pointers for a while, huh).  HashArray is an array of linked lists  Insert element either at the head  Or at the tail  The key is stored in the list at arr[h(k)]  e.g., arraySize = 10  H(k) = k % 10  Insert: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81  Note: we shouldn’t pick 10 as an array size – it was used for easy demonstration

Chaining:  Worst case, how long to:  Insert?  Delete?  Search?

Chaining downfalls:  Linked lists could get long  Especially when number of keys approaches number of slots in array  A bit more memory because of pointers  Must allocate and deallocate memory (slower)  Absolute worst-case :  All N elements in one linked list!  Bad hash function!

Open Addressing:  Store all elements in the Hash Array  so no pointers to linked list  When a collision occurs, look for another empty slot  Probe for another empty slot in a systematic way  Why systematic?  We will most likely need a larger Array than for chaining  Why?

Open Addressing: Linear Probing  Hash the key to an index.  If the index is full, look at the next slot  If that is full, look at the next slot  Continue until a slot in the array is empty  Insert key in the empty slot  If hit the end of the array, loop back to beginning  Effectiveness?  Insert?  Delete?  Search?

Problems:  Clustering  Keys tend to cluster in one part of the array  Keys that hash into the cluster will be placed at the end of the cluster  Making the cluster even larger  Could add 1, then add 2 to that, then add 3 to that, etc.  E.g., h(k0) = 3  Check 3, then 4, then 6, then 9, then 13, etc.  Helps some if keys are clustered in the same area  Doesn’t help as much if many keys result in the same index  Over time, probing takes longer

Open Addressing: Quadratic Probing  Another way of dealing with collisions:  h i (k) = (h(k) + i 2 ) % ArraySize  So probe sequence would be:  h(k) + 0, then +1, then +4, then +9, then +16, etc.  Example: h 0 (58) = (h(58) ) % 10 = 8 (X) h 1 (58) = (h(58) ) % 10 = 9 (X) h 2 (58) = (h(58) ) % 10 = 2 (X) h 3 (58) = (h(58) ) % 10 = 7  This helps to avoid the clustering right around the collision (even more spread out)  Doesn’t help a lot when many keys hash to the same index in the hash array

Next: Pseudo-random probing  Ideally, when a collision happens, the next index selected would be randomly chosen from the unvisited slots in the array  Can’t select the next index randomly  Why not?  Instead, pseudo-random probing  Use the same sequence of random numbers  For the ith slot in the probe sequence,  H(k) + r(i) where r(i) is the ith value in the random permutation of numbers from to the length of the array  All insertions and searches use the same sequence of random numbers

Pseudo-random probing  So for instance:  Random number sequence: h 0 (33) = (33 + rs[0])%10 =3 h 0 (43) = (43 + rs[0])%10 =3 X h 1 (43) = (43 +rs[1])%10 = 1 h 0 (51) = (51 + rs[0])%10 = 1X h 1 (51) = (51 + rs[1])%10 = 9 h 0 (53) = (53 + rs[0])%10 = 3 X h 1 (53) = (53 + rs[1])%10 = 1 X h 2 (53) = (53 + rs[2])%10 = 6  Calculations: quick!  Helps with clustering (when keys cluster to the same area in the hasharray  Doesn’t really help with when many keys cluster to the same index

Double Hashing:  Problem: if more than one key hashes to the same index, with linear probing, quadratic probing, and even random probing, the probes follow the same pattern  The sequence of probing after that first hash is based on the index, not on the original key  Fix:  Double-hashing  If collision, probe at:  p(k,i) = h(k) + i*h 2 (k)  Example: h2(k) = 1+(k mod(m))  Make m be a prime number less than the size of the array

Example of double-hashing E.g., arraysize = 11, m = 7 h2(k) = i+(k mod(m-1)) h2(k) = i+(k mod(m)  h 0 (55) = 55%11 = 0  h 0 (66) = 66%11 = 0 X  H2((66) =(1+k%(M))) = 1 + (66%7) = 4  P2(66) = 0 + 1*4 = 4%11 = 4  h 0 (11) = 11%11 = 0 X  H2(11) = 1+k%(M))) = 1 + (11%7) =5  P2(11) = 0 + 1*5 = 5%11 = 5  h 0 (88) = 88%11 = 0X  H2(88) = 1+k%(M))) = 1 + (88%7) =5  P2(88) = 0 + 1*5 = 5%11 = 5X  P3(88) = 0 + 2*5 = 10%11 = 10  Note: why do we need to add 1 to the h2 function?

Deletion with Probing:  What if we delete a value?  Would this cause a problem?  Quick and Dirty Solution:  When you delete, mark the slot as “deleted” somehow  Different from an empty slot  So when probing during a search, continue to search past “deleted” slots until either the value is found or a slot is empty  Note: The array must have an empty value (and hopefully a bunch of empty values)  Why?  Problem: could have a hash array with very few values, yet search could take a while  May need “compaction”  Sort of like “defragging”  Remove all values from the hash array and rehash

Back to inserting:  What is the best case for insertion?  What is the worst case for insertion?  When does this happen?  Clearly the more we avoid collisions, the more efficient hashing is  Usually, the more elements in the hash array, the more collisions  Back to load of hash array  Rule-of-thumb – we don’t want the hash array to get more than 70% full  When a hash table(array) is more than 70% full, we want to:  Allocate a new array  Size at least double the previous array’s size  Take all the values and rehash  Modifying the hashing function so that it maps to all possible values in the new array  Time: 0(n)  Ugh!

Hash Tables:  Good for:  data that can handle random access  data that requires a lot of searching for data  Not so good for:  Data that must be ordered  Finding the largest, smallest, median value, etc.  Dynamic data  A lot of adding and deleting of data  Data that doesn’t have a lot of unique keys