Tirgul 11 Notes Hash tables –reminder –examples –some new material.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Hash Tables.
Hash Tables CIS 606 Spring 2010.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Data Structures Using C++ 2E
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Hashing CS 3358 Data Structures.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Dictionaries and Hash Tables1  
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Lecture 10: Search Structures and Hashing
Tirgul 7 Heaps & Priority Queues Reminder Examples Hash Tables Reminder Examples.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Comp 335 File Structures Hashing.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Hashing Hashing is another method for sorting and searching data.
Hash Tables - Motivation
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Introduction to Algorithms 6.046J/18.401J LECTURE7 Hashing I Direct-access tables Resolving collisions by chaining Choosing hash functions Open addressing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.
Hashing Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision handling Separate chaining.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Advanced Associative Structures
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Presentation transcript:

Tirgul 11 Notes Hash tables –reminder –examples –some new material

Dictionary / Map ADT This ADT stores pairs of the form (in java: “value” instead of “data”). Supports the operations insert(key, data), find(key), and delete(key). One way to implement it is by search trees. The standard operations take O(log n) this way. What if we want to be more efficient??

Direct addressing Say the keys comes from a (large) set U. one way to have fast operations is by allocating an array of size |U|. This is of course a waste of memory, since most entries in the array will remain empty. For example, A Hebrew dictionary (e.g. Even-Shushan) holds less than 100,000 words whereas the number of possible combinations of Hebrew letters is much bigger (22 5 for 5-letter words only). It’s impractical to try and allocate all this space that will never be used.

Hash table In a hash table, we allocate an array of size m, which is much smaller than |U|, and we use a hash function h() to determine the entry of each key. When we want to insert/delete/find a key k we look for it in the entry h(k) in the array. Notice that this way, it is not necessary to have an order among the elements of the table. Two questions arise from this: first, what kind of hash functions should we use, and second, what happens when several keys will have the same entry. (clearly it might happen, since U is much larger than m). But first, example...

Example Take for example the login names of the students in dast. There are about 300 names, so putting them in a search tree will create a tree of height 8, even if it is balanced. Storing them in a hash table of 100 entries, on the other hand, will create the following spread of names on entries (the x-axis describes the number of items in an entry, and the y-axis describes how many entries with this load exist)

Example (cont.) We used the very simple hashing function of performing “mod”. Notice that even if the spread of names was perfect, there would have been 3 names in an entry, so the actual spread of the hash is quite good. Also notice that in a search tree, half of the elements are in the leaves, so it would take 8 operations to find them, while in this specific example, the worst case is still 8, but the search for most of the elements (about 80% of them) will take about half than that.

How to choose hash functions The crucial point: the hash function should “spread” the keys of U equally among all the entries of the array. Unfortunately, since we don’t know in advance the keys that we’ll get from U, this can be done only approximately. Remark: the hash functions usually assume that the keys are numbers. We’ll discuss later what to do if the keys are not numbers.

The division method If we have a table of size m, we can use the hash function Some values of m are better than others : –Good m’s are prime numbers not too close to 2 p. –Bad choice is 2 p - the function uses only p bits –Likewise - if keys are decimal using 10 p is bad For example, if we have |U|=2000, and we want each search to take (on average) 3 operations, we can choose m=701. (what keys will fall into entry 0?) Another example (Coremen. Page 226, question ) : suppose m=9, what is the hash value of the keys 5, 28, 19, 15, 20, 33, 12, 17, 10

The multiplication method One disadvantage of the previous hash function is that it depends on the size of the table: the size we choose for the table will affect the performance of the hash function. In the multiplication method, we have a constant 0<A<1. We take the fractional part of kA, and multiply it by m. Formally, For example, if m = 100 and A=1/3, then for k=10, h(k)=33, and for k=11, h(k)=66. You can see that this is not a good choice of A, since we’ll have only three values of h(k)... It is argued (as a result of experiments with different values) that A=(sqrt(5)-1)/2 is a good choice. For example (m=1000): h(61)=700, h(62)=318, h(63)=936, h(64)=554

What if keys are not numbers? The hash functions we showed only work for numbers. When keys are not numbers,we should first convert them to numbers. For example, a string can be treated as a number in radix 256 (each character is a digit between 0 and 255). Thus, the string “key” will be translated to ((int)’k’)*256^2+((int)’e’)*256^1+((int)’y’)256^0. This may cause a problem if we have a long string, since we will have a very large number. If we use the division method, then we can use the fact that (a+b) mod n = (a+(b mod n)) mod n, and that (a*b) mod n = (a*(b mod n)) mod n.

Translating long strings to numbers (cont.) We can write the integer value of “word” as (((256*w + o)*256 + r)*256 + d) Using the properties of mod, we get the simple alg.: int hash(String s, int m) int h=s[0] for ( i=1 ; i<s.length ; i++) h = ((h*256) + s[i])) mod m return(h) Notice that h is always smaller than m. This will also improve the performance of the algorithm if we don’t allow overflow.

How to handle collisions Now that we have good hash functions, we should think about the second issue - collisions. Collisions are more likely to happen when the hash table is more loaded. We define the “load factor” as, where n is the number of keys in the hash table. In the lecture you saw two methods to handle collisions. The first is chaining: each array entry is actually a linked list that holds all the keys that are mapped to this entry. It was proved in class that in a hash table with chaining, search operation takes time under uniform hashing assumption. (notice that in the chaining method, the load factor may be greater than one).

Open addressing In this method, the table itself holds all the keys. We change the hash function to receive two parameters, the first is the key and the second is the probe number. We first try h(k,0), if it fails we try h(k,1), and so on. It is required that {h(k, 0),...,h(k,m-1)} will be a permutation of {0,..,m-1}, so that after m-1 probes we’ll definitely find a place to put k (unless the table is full). Notice that here, the load factor must be smaller than one. In this method, there is a problem with deleting keys: when we search for a key and reach an empty slot, we don’t know if it means that the key we’re searching for doesn’t exist or that there was a key in this empty slot and it was deleted.

Open addressing (cont.) There are several ways to implement this hashing: –linear probing - h(k,i)=(h’(k)+i) mod m problem (primary clustering): if several consecutive slots are occupied, the next free slot has high probability of being occupied. Thus large clusters build up, increasing search time. –quadratic probing - h(k,i)=(h’(k)+c 1 i+c 2 i 2 ) mod m better than linear but secondary clustering can still occur –double hashing - h(k,i)=(h 1 (k)+ih 2 (k)) mod m

Performance (without proofs) Insertion and unsuccessful search of an element into an open-address hash table requires probes on average. A successful search: the average number of probes is For example, if the table is 50% full then a search will take about 1.4 probes on average. If the table 90% full then the search will take about 2.6 probes on average.

Example (cormen page 240 question ) Consider inserting the keys 10, 22, 31, 4, 15, 28, 17, 88, 59 into a hash table with m=11 with open addressing. The primary hash function is h1(k)=k mod m. Show the result for linear probing, quadratic probing with c1=1 and c2=3, and double hashing with h2(k)=1+(k mod (m-1))

Summary (or when to use hash tables) Hash tables are very useful for implementing dictionaries if we don’t have an order on the elements, or we have order but we need only the standard operations. On the other hand, hash tables are less useful if we have order and we need more than just the standard operations. For example, last(), or iterator over all elements, which is problematic if the load factor is very low. We should have a good estimate of the number of elements we need to store (for example, the hebrew univ. has about 30,000 students each year, but still it is a dynamic d.b.). Re-hashing: If we don’t know a-priori the number of elements, we might need to perform re-hashing, increasing the size of the table and re-assigning all elements.