Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay.

Similar presentations


Presentation on theme: "COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay."— Presentation transcript:

1 COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay Groves, Peter Andreae and Thomas Kuehne, VUW

2 2 RECAP-TODAY RECAP  Linked Structures, including trees, heaps achieved perfect O(log n) insert/find performance TODAY  Mind-blowingly fast sorting O(1) insert/find performance!

3 3 Linear Time Sorting Algorithm  Constant time per entry to sort private int HashSort(int[] numbers) { int[] present = new int[7]; for (int i = 0; i < numbers.length(); i ++ ) present[numbers[i]]++; }  Limitations  elements must be integers  element value range must be limited  frequency data structure may be sparsely populated 5 3 5 2 6 1 numberspresent 0 1 2 3 4 5 6 1 1 2 1 1 1 cf. BucketSort

4 4 Hashing  Fixing the limitations  convert element into an integer  use a hash function to assign an integer to an element  Potential  Set, Bag, Maps with constant time insert / find!  Challenges  how to compute the hash code?  how to deal with collisions?

5 5 O(1) Sets with big values? ✔ We need a way to compute an index for an object: add(“2001 – A Space Odyssey”) “Hashing”: compute the “hash code” of an object 0123456789581N ✔✗✔✔✗✗✗✗✗✗✗ ⋯⋯ ✗ Hash function 581 “ 2001 – A Space Odyssey ”

6 6 O(1) Sets with big values?  But there are too many possible film titles!  Suppose the hash function always produces a number between 0 and 1000 ⇒ some film titles must end up with the same number! ⇒ “Collision” 0123456789581N ✔✗✔✔✗✗✗✗✗✗✗ ⋯⋯ ✔✔ HASH “ Gravity ” “ 2001 – A Space Odyssey ” HASH

7 7 Detecting collisions  Store the item in the array, instead of a boolean  Questions 1. How to choose hash function that minimises collisions? 2. How to manage collisions when they occur? 0123456789581N ⋯⋯ “ Gravity ” “ 2001 – A Space Odyssey ” HASH

8 8 A HashSet private E[ ] data ; public boolean contains(E value) { int hash = Math.abs(value.hashCode() % data.length); if (data[hash] == null) return false; else if (data[hash].equals(value)) return true; else //Collision !!! } public boolean add(E value) { int hash = Math.abs(value.hashCode() % data.length); if (data[hash] == null) { data[hash] = value; size++; return true; } else if (data[hash].equals(value)) return false; else //Collision !!! } Cost is independent of number of items in Set Cost is determined by cost of hashCode() must be consistent: a.equals(b)  a.hashCode() == b.hashCode() every class defines this method every class defines this method

9 9 Computing Hash Codes Wish list Summary for HashCode Function  Should produce an integer  Should distribute the hash codes evenly through the range minimises collisions  Should be fast to compute  Should take account of all components of the object  Must be consistent with equals() two items that are equal must have the same hash value Can we avoid clashes altogether? That would be perfect!  perfect hash function

10 10 A Simple Hash Function for Strings  We could add up the codes of all the characters: private int hash(String value) { int hashCode = 0; for (int i = 0; i < value.length(); i++) hashCode += value.charAt(i); return hashCode; } Why is this not very good?

11 11 Example: Hashing course codes 418 ← DEAF101 419 ← DEAF102 DEAF201 ⋮ 429 ← BBSC201 MDIA101 430 ← ECHI410 MDIA102 MDIA201 431 ← ECHI303 JAPA111 JAPA201 MDIA202 MDIA220 MDIA301 432 ← ARCH101 ASIA101 BBSC231 BBSC303 BBSC321 CHEM201 ECHI403 ECHI412 JAPA112 JAPA211 JAPA301 MDIA203 MDIA302 MDIA320 ⋮ 450 ← ANTH412 ARCH389 ARTH111 BIOL228 BIOL327 BIOL372 CHEM489 COML304 COML403 COML421 COMP102 COMP201 CRIM313 CRIM421 DESN215 DESN233 ECON328 ECON409 ECON418 ECON508 EDUC449 EDUC458 EDUC548 EDUC557 ENGL228 ENGL408 ENGL426 ENGL435 ENGL444 ENGL453 FREN124 FREN331 FREN403 FREN412 GEOL362 GEOL407 GERM214 GERM403 GERM412 INFO213 INFO312 INFO402 ITAL206 ITAL215 LALS501 LATI404 LING224 LING323 LING404 MAOR102 MARK304 MARK403 MATH206 MATH314 MATH323 MATH431 MOFI403 PHIL104 PHIL203 PHIL302 PHIL320 PHIL401 PHIL410 RELI321 RELI411 SAMO101 ⋮ a lot of collisions!

12 12 Better Hash Functions  Make the contribution of each character depend on its position: private int hash(String course) { int k = 257; int hashCode = 0; for (int i = 0; i < course.length(); i ++ ) hashCode = hashCode * k + course.charAt(i); return hashCode; } hashCode(s) = k 6 x s 0 + k 5 x s 1 + k 4 x s 2 + k 3 x s 3 + k 2 x s 4 + k 1 x s 5 + s 6 (it is best to use a prime number for the constant k)

13 13 Perfect Hash Functions  Perfect hash function gives no collisions for a given data set  Example - for VUW courses private int hash(String course) { int hash = 0; for (int i = 0; i < course.length(); i++) hash = (hash * 51 + course.charAt(i)) % 72201; return hash; }  Building a perfect hash function is  very difficult  very specific to a particular set of possible values  only useful in very specialised circumstances

14 14 Dealing with Collisions  Two approaches  Use a collection at each place (“buckets” or “chaining”)  Look for an empty place in the hashtable (“probing” or “open addressing”) 0123456789581N ⋯⋯ “ 2001 – A Space Odyssey ” HASH “ Gravity ” HASH

15 15 Collisions: chaining / buckets  Store a Set in each cell: hash value → which set  Performance?  if the array is of size k, each subset will be about 1/k th of size()  cost ≈ cost(hashCode) + cost (subset) ant fox hen dog bee kea cow elk owl pig sow tui ape bat bug cat eel gnu jay nit ray yak cod roe This is what Java's HashMap does. If the sets get too big  Rehash: double array size and reassign elements This is what Java's HashMap does. If the sets get too big  Rehash: double array size and reassign elements

16 16 Java and hashCode  All objects have a hashCode method and an equals method, so:  you can call equals on any object  and you can put any object into a HashSet, HashMap, …  Many predefined objects (eg String) have good equals and hashCode methods defined  The default equals method:  compares references, i.e., equals is ==  if this is not what you want, define your own equals method  The default hashCode  returns an integer based on the reference (pointer value)  If you redefine equals, you should redefine hashCode too!


Download ppt "COMP 103 Hashing 2013-T2 Lecture 28 Thomas Kuehne School of Engineering and Computer Science, Victoria University of Wellington  Marcus Frean, Lindsay."

Similar presentations


Ads by Google