COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31

Slides:



Advertisements
Similar presentations
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
Advertisements

1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing General idea: Get a large array
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.
COMP 103 Hashing 2014-T2 Lecture 32 Marcus Frean School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.
COMP 103 Hashing. 2 RECAP-TODAY RECAP Bitmaps are a fast way to implement Sets of integers, characters, etc TODAY  Hashing is a similar idea  Detecting.
Hashing Hashing is another method for sorting and searching data.
Hashing as a Dictionary Implementation Chapter 19.
CSC 427: Data Structures and Algorithm Analysis
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
COMP 103 Hashing (II), and exam tips 2014-T2 Lecture 33 Marcus Frean School of Engineering and Computer Science, Victoria University of Wellington  Marcus.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
COSC 1030 Lecture 10 Hash Table. Topics Table Hash Concept Hash Function Resolve collision Complexity Analysis.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
2015-T2 Lecture 30 School of Engineering and Computer Science, Victoria University of Wellington  Lindsay Groves, Marcus Frean, Peter Andreae, and Thomas.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
COMP 103 Course Review. 2 Menu  A final word on hash collisions in Open Addressing / Probing  Course Summary  What we have covered  What you should.
CSC 213 – Large Scale Programming. Today’s Goal  Review when, where, & why we use Map s  Why Sequence -based approach causes problems  How hash can.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Appendix I Hashing.
Sets and Maps Chapter 9.
Hashing (part 2) CSE 2011 Winter March 2018.
Searching, Maps,Tries (hashing)
Hashing.
COMP 53 – Week Eleven Hashtables.
Hash table CSC317 We have elements with key and satellite data
Hashing CSE 2011 Winter July 2018.
Data Abstraction & Problem Solving with C++
School of Computer Science and Engineering
Slides by Steve Armstrong LeTourneau University Longview, TX
CSC 427: Data Structures and Algorithm Analysis
COMP 103 Sorting with Binary Trees: Tree sort, Heap sort Alex Potanin
More complexity analysis & Binary Search
Efficiency add remove find unsorted array O(1) O(n) sorted array
Hash functions Open addressing
Hash Tables Part II: Using Buckets
Hashing CS2110 Spring 2018.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Computer Science 2 Hashing
Building Java Programs
Hash Tables.
Hashing CS2110.
Data Structures and Algorithms
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Searching Tables Table: sequence of (key,information) pairs
Data Structures and Algorithms
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Hash Tables Computer Science and Engineering
Sets and Maps Chapter 9.
Dictionaries 4/5/2019 1:49 AM Hash Tables  
Algorithms: Design and Analysis
slides created by Marty Stepp
Collision Handling Collisions occur when different elements are mapped to the same cell.
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Chapter 13 Hashing © 2011 Pearson Addison-Wesley. All rights reserved.
Lecture-Hashing.
Presentation transcript:

COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31 Marcus Frean, Lindsay Groves, Peter Andreae and Thomas Kuehne, VUW Marcus Frean School of Engineering and Computer Science, Victoria University of Wellington 2015-T2 Lecture 31

RECAP-TODAY RECAP TODAY Linked Structures, including trees, heaps achieved perfect O(log n) insert/find performance TODAY Hashing: O(1) insert/find performance!

NEW TOPIC: Sets with O(1) operations? BASIC IDEA: convert any item into an integer  use the integer to know where to insert / find that item. Potential Set, Bag, Maps with constant time insert / find! 2 Challenges: how to compute the hash code? how to deal with collisions?

Hashing We need a way to compute an index for an object: add(“2001 – A Space Odyssey”) “Hashing”: compute the “hash code” of an object “2001 – A Space Odyssey” Hash function 581 ✔ ✗ ✗ ✗ ✔ ✗ ✗ ✗ ✔ ✗ ⋯ ✔ ✗ ⋯ ✗ 1 2 3 4 5 6 7 8 9 581 N

Collisions are a problem But there are too many possible film titles! Suppose the hash function always produces a number between 0 and 1000 ⇒ some film titles must end up with the same number! ⇒ “Collision” “2001 – A Space Odyssey” “Gravity” HASH HASH ✔ ✗ ✗ ✗ ✔ ✗ ✗ ✗ ✔ ✗ ⋯ ✔ ✔ ⋯ ✗ 1 2 3 4 5 6 7 8 9 581 N

Detecting collisions Store the item in the array, instead of a boolean Questions How to choose hash function that minimises collisions? How to manage collisions when they occur? “2001 – A Space Odyssey” “Gravity” HASH HASH ⋯ ⋯ 1 2 3 4 5 6 7 8 9 581 N

Computing Hash Codes “Wish list” for a HashCode method: Should produce an integer Should distribute the hash codes evenly through the range minimises collisions Should be fast to compute Should take account of all components of the object Must be consistent with equals() two items that are equal must have the same hash value Can we avoid clashes altogether? That would be perfect!  perfect hash function

A Simple Hash Function for Strings We could add up the codes of all the characters: private int hash(String value) { int hashCode = 0; for (int i = 0; i < value.length(); i++) hashCode += value.charAt(i); return hashCode; } Why is this not very good?

Example: Hashing course codes 418 ← DEAF101 419 ← DEAF102 DEAF201 ⋮ 429 ← BBSC201 MDIA101 430 ← ECHI410 MDIA102 MDIA201 431 ← ECHI303 JAPA111 JAPA201 MDIA202 MDIA220 MDIA301 432 ← ARCH101 ASIA101 BBSC231 BBSC303 BBSC321 CHEM201 ECHI403 ECHI412 JAPA112 JAPA211 JAPA301 MDIA203 MDIA302 MDIA320 ⋮ 450 ← ANTH412 ARCH389 ARTH111 BIOL228 BIOL327 BIOL372 CHEM489 COML304 COML403 COML421 COMP102 COMP201 CRIM313 CRIM421 DESN215 DESN233 ECON328 ECON409 ECON418 ECON508 EDUC449 EDUC458 EDUC548 EDUC557 ENGL228 ENGL408 ENGL426 ENGL435 ENGL444 ENGL453 FREN124 FREN331 FREN403 FREN412 GEOL362 GEOL407 GERM214 GERM403 GERM412 INFO213 INFO312 INFO402 ITAL206 ITAL215 LALS501 LATI404 LING224 LING323 LING404 MAOR102 MARK304 MARK403 MATH206 MATH314 MATH323 MATH431 MOFI403 PHIL104 PHIL203 PHIL302 PHIL320 PHIL401 PHIL410 RELI321 RELI411 SAMO101 ⋮ a lot of collisions!

Better Hash Functions Make the contribution of each character depend on its position: private int hash(String course) { int k = 257; int hashCode = 0; for (int i = 0; i < course.length(); i++) hashCode = hashCode * k + course.charAt(i); return hashCode; } hashCode(s) = k6x s0 + k5x s1 + k4x s2 + k3x s3 + k2x s4 + k1x s5 + s6 (it is best to use a prime number for the constant k) If the constant is divisible by the number of buckets then not all digits will contribute to unique locations. In the worst case, only the last digit determines the bucket. [container has to use hashCode modulo #buckets)

Perfect Hash Functions Perfect hash function gives no collisions for a given data set Example - for VUW courses private int hash(String course) { int hash = 0; for (int i = 0; i < course.length(); i++) hash = (hash * 51 + course.charAt(i)) % 72201; return hash; } Building a perfect hash function is very difficult very specific to a particular set of possible values only useful in very specialised circumstances

Dealing with Collisions Two approaches Use a collection at each place (“buckets” or “chaining”) Look for an empty place in the hashtable (“probing” or “open addressing”) “2001 – A Space Odyssey” HASH “Gravity” HASH ⋯ ⋯ 1 2 3 4 5 6 7 8 9 581 N

Collisions: chaining / buckets This is what Java's HashMap does. If the sets get too big  Rehash: double array size and reassign elements Store a Set in each cell: hash value → which set Performance? if the array is of size k, each subset will be about 1/kth of size() cost ≈ cost(hashCode) + cost (subset) eel gnu jay ant fox hen owl pig sow tui kea cow elk why? nit ray yak cod roe dog bee ape bat bug cat

Java and hashCode All objects have a hashCode method and an equals method, so: you can call equals on any object and you can put any object into a HashSet, HashMap, … Many predefined objects (eg String) have good equals and hashCode methods defined The default equals method: compares references, i.e., equals is == if this is not what you want, define your own equals method The default hashCode returns an integer based on the reference (pointer value) If you redefine equals, you should redefine hashCode too!

Linear Probing Hash value tells us where to start looking. if value.hashCode() → p start at index p if cell is used, try p+1, p+2, p+3 … wrap round to 0 at the end of the array. hash = (name[0]+name[1])%7 Stu (2) Sam (4) Stig (2) (5) Sun (3) Sven Steve (2) 1 2 3 4 5 6

Hash Tables and Load Factor When is the hashTable “full”? When number of items is close to array size: May have to probe a large number of cells to find empty cell ⇒ performance becomes very slow. Linear probing is particularly bad! Should not let table get more than 70% - 80% full (maximum “load factor”) With a low load factor, cost is O(1) ...........high..............................O(N) “kea” “eel” “pig” “cat” “bee” “fox” “dog” “owl” “hen” “ant”

ensureCapacity If it is full, double and copy: Index depends on… how do you copy? Index depends on… hashCode and length (division method)! and it depends on previous collisions... ⇒ Have to rehash everything! “eel” “kea” “ant” “cat” “bee” “fox” “dog” “eel” “kea” “ant” “cat” “bee” “fox” “dog” “dog” “kea” “eel” “cat” “bee” “fox” “ant”

Linear Probing: Runs and Clustering Linear probing is particularly bad: Repeated collisions at one index create runs Runs → linear performance With linear probing, runs join up ⇒ they grow fast: the bigger the run, the faster it grows This is called "clustering“ Does it help to increase step size (p, p+d, p+2d, …) ? “dog” “kea” “eel” “cat” “bee” “fox” “ant” 1,2 5 3 4 hen owl pig gnu emu rat tui

Quadratic Probing Make the sequence of probes have increasing steps: runs don’t join up so fast h, h+1, h+4, h+9, h+16, … p=h, p+=1, p+=3, p+=5, p+= 7, p+= 9, …. In general, quadratic probing uses a quadratic formula: probei = hash + a  i + b  i2 ( b  0) Eg: with a=b=½ , the step sizes become 1,2,3… instead of 1,3,5… “dog” “hen” “kea” “eel” “bee” “cat” “fox” “owl” “ant”

Quadratic Probing Another problem, perhaps? sequence might wrap back on itself before checking each cell: If we choose a = b = ½, and length is a power of 2... ⇒ guaranteed not to wrap until it has checked every cell ! probei = hash + ½ (i + i2) ⇒ probes are hash, hash+1, hash+3, hash+6, hash+10, hash+15, ... ⇒ step sizes are 1, 2, 3, 4, 5, … “dog” “hen” “eel”

Hash Table with Probing: remove Inserted: Stu (2) Sven (5) Sam (4) Steve (2) Sun (4) Now remove: Sam (4) What’s the problem? contains(Sun) will return false! To remove, need to leave a marker (not null, not a value !) public void remove() { throw new UnsupportedOperationException(); } 1 2 3 4 5 6 Sun Stu Steve Sam Sven Stig insert a "tombstone" key instead

Iterator? Iterating through hash table is not so simple! there will be nulls to skip over the order that items are returned appears random (and may change when the array is doubled!) At each call to next(), Iterator must advance the index to the next non-null cell. Could be slow!... “dog” “kea” “eel” “cat” “bee” “fox” “ant”

hashing summary hashing gives add/find that is crazily quick two ideas: buckets and probing with the probing method, removing requires “tombstones” when a hashtable is too full, you need to increase its size: this requires rehashing everything a HashSet could be a slow to iterate over