COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31

COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31
Marcus Frean, Lindsay Groves, Peter Andreae and Thomas Kuehne, VUW Marcus Frean School of Engineering and Computer Science, Victoria University of Wellington 2015-T2 Lecture 31

RECAP-TODAY RECAP TODAY Linked Structures, including trees, heaps
achieved perfect O(log n) insert/find performance TODAY Hashing: O(1) insert/find performance!

NEW TOPIC: Sets with O(1) operations?
BASIC IDEA: convert any item into an integer  use the integer to know where to insert / find that item. Potential Set, Bag, Maps with constant time insert / find! 2 Challenges: how to compute the hash code? how to deal with collisions?

Hashing We need a way to compute an index for an object:
add(“2001 – A Space Odyssey”) “Hashing”: compute the “hash code” of an object “2001 – A Space Odyssey” Hash function 581 ✔ ✗ ✗ ✗ ✔ ✗ ✗ ✗ ✔ ✗ ⋯ ✔ ✗ ⋯ ✗ 1 2 3 4 5 6 7 8 9 581 N

Collisions are a problem
But there are too many possible film titles! Suppose the hash function always produces a number between 0 and 1000 ⇒ some film titles must end up with the same number! ⇒ “Collision” “2001 – A Space Odyssey” “Gravity” HASH HASH ✔ ✗ ✗ ✗ ✔ ✗ ✗ ✗ ✔ ✗ ⋯ ✔ ✔ ⋯ ✗ 1 2 3 4 5 6 7 8 9 581 N

Detecting collisions Store the item in the array, instead of a boolean
Questions How to choose hash function that minimises collisions? How to manage collisions when they occur? “2001 – A Space Odyssey” “Gravity” HASH HASH ⋯ ⋯ 1 2 3 4 5 6 7 8 9 581 N

Computing Hash Codes “Wish list” for a HashCode method:
Should produce an integer Should distribute the hash codes evenly through the range minimises collisions Should be fast to compute Should take account of all components of the object Must be consistent with equals() two items that are equal must have the same hash value Can we avoid clashes altogether? That would be perfect!  perfect hash function

A Simple Hash Function for Strings
We could add up the codes of all the characters: private int hash(String value) { int hashCode = 0; for (int i = 0; i < value.length(); i++) hashCode += value.charAt(i); return hashCode; } Why is this not very good?

Example: Hashing course codes
418 ← DEAF101 419 ← DEAF102 DEAF ⋮ 429 ← BBSC201 MDIA101 430 ← ECHI410 MDIA102 MDIA201 431 ← ECHI303 JAPA111 JAPA201 MDIA202 MDIA220 MDIA301 432 ← ARCH101 ASIA101 BBSC231 BBSC303 BBSC321 CHEM201 ECHI403 ECHI412 JAPA112 JAPA211 JAPA301 MDIA203 MDIA302 MDIA ⋮ 450 ← ANTH412 ARCH389 ARTH111 BIOL228 BIOL327 BIOL372 CHEM489 COML304 COML403 COML421 COMP102 COMP201 CRIM313 CRIM421 DESN215 DESN233 ECON328 ECON409 ECON418 ECON508 EDUC449 EDUC458 EDUC548 EDUC557 ENGL228 ENGL408 ENGL426 ENGL435 ENGL444 ENGL453 FREN124 FREN331 FREN403 FREN412 GEOL362 GEOL407 GERM214 GERM403 GERM412 INFO213 INFO312 INFO402 ITAL206 ITAL215 LALS501 LATI404 LING224 LING323 LING404 MAOR102 MARK304 MARK403 MATH MATH314 MATH323 MATH431 MOFI403 PHIL104 PHIL PHIL302 PHIL320 PHIL401 PHIL410 RELI321 RELI411 SAMO ⋮ a lot of collisions!

Better Hash Functions Make the contribution of each character depend on its position: private int hash(String course) { int k = 257; int hashCode = 0; for (int i = 0; i < course.length(); i++) hashCode = hashCode * k + course.charAt(i); return hashCode; } hashCode(s) = k6x s0 + k5x s1 + k4x s2 + k3x s3 + k2x s4 + k1x s5 + s6 (it is best to use a prime number for the constant k) If the constant is divisible by the number of buckets then not all digits will contribute to unique locations. In the worst case, only the last digit determines the bucket. [container has to use hashCode modulo #buckets)

Perfect Hash Functions
Perfect hash function gives no collisions for a given data set Example - for VUW courses private int hash(String course) { int hash = 0; for (int i = 0; i < course.length(); i++) hash = (hash * 51 + course.charAt(i)) % 72201; return hash; } Building a perfect hash function is very difficult very specific to a particular set of possible values only useful in very specialised circumstances

Dealing with Collisions
Two approaches Use a collection at each place (“buckets” or “chaining”) Look for an empty place in the hashtable (“probing” or “open addressing”) “2001 – A Space Odyssey” HASH “Gravity” HASH ⋯ ⋯ 1 2 3 4 5 6 7 8 9 581 N

Collisions: chaining / buckets
This is what Java's HashMap does. If the sets get too big  Rehash: double array size and reassign elements Store a Set in each cell: hash value → which set Performance? if the array is of size k, each subset will be about 1/kth of size() cost ≈ cost(hashCode) + cost (subset) eel gnu jay ant fox hen owl pig sow tui kea cow elk why? nit ray yak cod roe dog bee ape bat bug cat

Java and hashCode All objects have a hashCode method and an equals method, so: you can call equals on any object and you can put any object into a HashSet, HashMap, … Many predefined objects (eg String) have good equals and hashCode methods defined The default equals method: compares references, i.e., equals is == if this is not what you want, define your own equals method The default hashCode returns an integer based on the reference (pointer value) If you redefine equals, you should redefine hashCode too!

Linear Probing Hash value tells us where to start looking.
if value.hashCode() → p start at index p if cell is used, try p+1, p+2, p+3 … wrap round to 0 at the end of the array. hash = (name[0]+name[1])%7 Stu (2) Sam (4) Stig (2) (5) Sun (3) Sven Steve (2) 1 2 3 4 5 6

Hash Tables and Load Factor
When is the hashTable “full”? When number of items is close to array size: May have to probe a large number of cells to find empty cell ⇒ performance becomes very slow. Linear probing is particularly bad! Should not let table get more than 70% - 80% full (maximum “load factor”) With a low load factor, cost is O(1) high O(N) “kea” “eel” “pig” “cat” “bee” “fox” “dog” “owl” “hen” “ant”

ensureCapacity If it is full, double and copy: Index depends on…
how do you copy? Index depends on… hashCode and length (division method)! and it depends on previous collisions... ⇒ Have to rehash everything! “eel” “kea” “ant” “cat” “bee” “fox” “dog” “eel” “kea” “ant” “cat” “bee” “fox” “dog” “dog” “kea” “eel” “cat” “bee” “fox” “ant”

Linear Probing: Runs and Clustering
Linear probing is particularly bad: Repeated collisions at one index create runs Runs → linear performance With linear probing, runs join up ⇒ they grow fast: the bigger the run, the faster it grows This is called "clustering“ Does it help to increase step size (p, p+d, p+2d, …) ? “dog” “kea” “eel” “cat” “bee” “fox” “ant” 1,2 5 3 4 hen owl pig gnu emu rat tui

Quadratic Probing Make the sequence of probes have increasing steps:
runs don’t join up so fast h, h+1, h+4, h+9, h+16, … p=h, p+=1, p+=3, p+=5, p+= 7, p+= 9, …. In general, quadratic probing uses a quadratic formula: probei = hash + a  i + b  i ( b  0) Eg: with a=b=½ , the step sizes become 1,2,3… instead of 1,3,5… “dog” “hen” “kea” “eel” “bee” “cat” “fox” “owl” “ant”

Quadratic Probing Another problem, perhaps?
sequence might wrap back on itself before checking each cell: If we choose a = b = ½, and length is a power of ⇒ guaranteed not to wrap until it has checked every cell ! probei = hash + ½ (i + i2) ⇒ probes are hash, hash+1, hash+3, hash+6, hash+10, hash+15, ⇒ step sizes are 1, 2, 3, 4, 5, … “dog” “hen” “eel”

Hash Table with Probing: remove
Inserted: Stu (2) Sven (5) Sam (4) Steve (2) Sun (4) Now remove: Sam (4) What’s the problem? contains(Sun) will return false! To remove, need to leave a marker (not null, not a value !) public void remove() { throw new UnsupportedOperationException(); } 1 2 3 4 5 6 Sun Stu Steve Sam Sven Stig insert a "tombstone" key instead

Iterator? Iterating through hash table is not so simple!
there will be nulls to skip over the order that items are returned appears random (and may change when the array is doubled!) At each call to next(), Iterator must advance the index to the next non-null cell. Could be slow!... “dog” “kea” “eel” “cat” “bee” “fox” “ant”

hashing summary hashing gives add/find that is crazily quick
two ideas: buckets and probing with the probing method, removing requires “tombstones” when a hashtable is too full, you need to increase its size: this requires rehashing everything a HashSet could be a slow to iterate over

COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31

Similar presentations

Presentation on theme: "COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31

Similar presentations

Presentation on theme: "COMP 103 Hashing Marcus Frean 2015-T2 Lecture 31"— Presentation transcript:

Similar presentations

About project

Feedback