Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

Similar presentations


Presentation on theme: "1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate."— Presentation transcript:

1 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate its size and end up with a very sparse structure

2 2 S TORING BIG DATA We tend to think that the actual number of keys to be stored is equal to the universe of possible existing keys

3 3 H ASH T ABLES Often the number of keys to be stored is smaller than the number in the universe of keys. In this case, a hash table may save us a lot of space.

4 4 H ASH T ABLES How can you store all possible SSN in an array? Use an array with range ,999,999– a billion possible locations! This will give you O(1) access time but … considering there are approximately 308,000,000 people in the USA,you waste 1,000,000, ,000,000 array entries!

5 5 P ROBLEM - W ASTED S PACE Problem: The range of key values we are mapping is too large (0-999,999,999 when compared to the # of actual keys (US citizens)

6 6 H ASH T ABLES All search structures so far Relied on a comparison operation Performance O(n) or O( log n) for input of Size N WE CAN DO BETTER WITH HASHING

7 7 Simplest case: Assume we have keys with values in the range 1.. M Use a hash method to compute the value of the key (an int) to select a slot in a direct access table in which to store the item

8 8 H ASH ( KEY ) To search for an item with key, k, look in slot hash (key) which produces an int that maps to an index in the array. If theres an item there, youve found it If the tag is 0, its missing.

9 9 CONSTANT TIME SEARCH This produces a Constant time search O(1)

10 10 E XAMPLE ( IDEAL ) HASH FUNCTION Suppose we now have Strings and must hash them to an integer. Our hash function maps the following values: hashCode("apple") = 5 hashCode("watermelon") = 3 hashCode("grapes") = 8 hashCode("cantaloupe") = 7 hashCode("kiwi") = 0 hashCode("strawberry") = 9 hashCode("mango") = 6 hashCode("banana") = 2 kiwi banana watermelon apple mango cantaloupe grapes strawberry

11 11 W HY HASH TABLES ? We use key/value pairs to store an Entry into the table We use use a hash function to map a key Hawk Key(hawk) to an integer The value column holds the data we are actually interested in robin sparrow hawk seagull bluejay owl robin info sparrow info hawk info seagull info bluejay info owl info key value

12 12 H ASH F UNCTIONS Hash tables normally provide O(1) time (constant time) to access an element A value(called a key) is normally stored in slot k – which is an integer value) In hash tables, this element is stored in slot = hash(key).

13 13 HASH FUNCTIONS hash(k) is a hash function. It maps the universe U of keys into the slots of a hash table (smaller than the universe) ---- Thus reducing the size of the space we need to use.

14 14 P ICTORIAL VIEW OF H ASH T ABLES k1 k2 k3 k4 UNIVERSE OF VALUES ARE MAPPED TO A SMALLER NUMBER OF SLOTS

15 15 H ASHING Assume I have a hash function where the key is a String e.g. A label which represents a city in our HPAir project hash( key ) integer i.e. the function maps the key to an integer That is a string – city name – to an int – which is an index into the HashMap What performance (Big(0) do I get ?

16 16 H ASH T ABLES - C ONSTRAINTS Initial Constraints – hash a key to an integer The hashcode of a Key must be unique Keys must lie in a small range for storage efficiency, keys must be dense in the range - If theyre sparse (lots of gaps between values), a lot of space is used to obtain speed

17 17 H ASH T ABLES - Hashing Keys produces integers, therefore We need a hash function hash( key ) ® integer ie one that maps(hashes) a key to an integer Applying this function to the key produces a unique address

18 18 P ROBLEMS WITH A UNIQUE ADDRESS FOR EACH KEY If hash(key) maps each key to a unique integer in the range 0.. m-1 then search is O(1) - BUT THIS IS HARD TO DO!!!!!

19 19 Example - using an n-character key e.g. a String – n = number of characters in the String. Use a String class method to change the String to a character array - Call a method with an array name and the number of chars in String: hash(char array, # of characters)

20 20 H ASHING A STRING OF CHARACTERS // n = number of chars in the String int hash( char [] sarray, int n ) { int sum = 0, i= 0; // sum ascii values of the characters while( n-- > 0 ) sum = sum + sarray[ i + +].getNumericValue(); return sum % 256 } // number of ASCII characters –is 256 returns a value in

21 21 E VALUATION int hash( char [] sarray, int n ) { int sum = 0, i= 0; while( n-- > 0 ) // get ascii values of each character // and sum them sum = sum + sarray[i++].getNumericValue(); return sum % 256; } returns a value in The hash function itself is O(1) since the number of characters is a constant for each String - that number will not change for each String

22 22 H ASH T ABLES – PROBLEM - C OLLISIONS With this hash function int hash( char []s, int n ) { int sum = 0, i = 0; while( n-- > 0 ) sum = sum + s[i++].getNumericValue; return sum % 256; } FOR: hash( AB, 2 ) and hash( BA, 2 ) their Ascii (Unicode) values return the same value! Unicode value A is 65, for B is 66 Add them together in any order and they equal 131 This is called a collision

23 23 C OLLISIONS Because we're mapping a larger universe into a smaller set of slots, collisions occur. A variety of techniques are used for resolving collisions Therefore having a unique key is HARD TO DO.

24 24 P ICTORIAL VIEW OF COLLISION k1 k2 k3 k4 k5 Sometimes keys map to the same memory location COLLISION

25 25 H ASH T ABLES – C OLLISION SOLUTIONS I We need to store the actual key with the item in the hash table We compute the address index = hash( key ) Next, look for the index in the table if ( the location is occupied) then we try next entry till we find an open one

26 26 C OLLISION R ESOLUTION & O PEN H ASHING The most common resolution mechanism for collisions is called chaining. This is also called Open Hashing. Being "open", the Hashtable will store a linked list of entries whose keys hash to the same value Chaining incorporates the concepts of linked lists and direct access structures like arrays Each slot of a hash table will be a pointer to a linked list

27 27 C HAINING OR OPEN HASHING When hashing a key, if a collision happens the new key is stored in the linked list in that location E.g., suppose that we're mapping the universe of integers to a hash table of size 10

28 28 O PEN H ASH T ABLE KEYS BUCKETS ENTRIES John Smith and Sandra map to the same location – a linked list is started from John to Sandra

29 29 H ASH T ABLES - L INKED LISTS Collisions - Resolution Linked list is attached to each primary table slot // Three entries map to same location h(k) == h(k1) == h(k2) Searching for k1 Calculate hash(k1) Item doesnt match Follow linked list to k1 If NULL found, key isnt in table

30 30 H ASH T ABLES - L INKED L ISTS If a search can be satisfied by any item with key, k, performance is still O(1) but If the key values are different we get O( 1 * max ) Where max is the largest number of duplicates - or length of the longest chain (Linked List)

31 31 Ë TECHNIQUE TWO - USE AN OVERFLOW AREA Linked list constructed in special area of table called OVERFLOW AREA If two keys map to same location hash(k) == hash(j) k stored first Adding j When hash(j) maps to hash(k) Find k THEN Go to first slot in overflow area Put j in it Searching - same as linked list

32 32 H ASHING (103) hash(103) = 103 mod 10 hash(103) = 3 hash(103) = 103 mod 10 hash(103) = 3 Our hash function is based on the division method for creating hash functions: hash(k) = k mod size

33 33 H ASHING (103) hash(n) = 103 mod 10 hash(n) = 3 hash(n) = 103 mod 10 hash(n) = / /

34 34 H ASHING (69) hash(n) = 69 mod 10 hash(n) = 9 hash(n) = 69 mod 10 hash(n) = / / 69 / /

35 35 H ASHING (20) h(n) = 20 mod 10 h(n) = 0 h(n) = 20 mod 10 h(n) = / / 69 / / 20 / /

36 36 H ASHING (13) hash(n) = 13 mod 10 hash(n) = 3 hash(n) = 13 mod 10 hash(n) = / / 20 / / 13 / /

37 37 H ASHING (110) hash(n) = 110 mod 10 hash(n) = 0 hash(n) = 110 mod 10 hash(n) = / / / / 110 / /

38 38 H ASHING (53) hash(n) = 53 mod 10 hash(n) = 3 hash(n) = 53 mod 10 hash(n) = / / / / 53 / /

39 39 F INAL H ASH T ABLE / / / / 53 / /

40 40 S EARCHING FOR 53 U SING C HAINING / / / / 110 / / 53 / /

41 41 S EARCHING FOR / / / / 110 / / 53 / /

42 42 S EARCHING FOR / / / / 110 / / 53 / / temp

43 43 S EARCHING FOR / / / / 110 / / 53 / / temp

44 44 S EARCHING FOR / / / / 110 / / 53 / / temp

45 45 C LOSED H ASHING - R E - HASH FUNCTIONS Closed hashing, is a method of collision resolution in hash tables. With this method, a hash collision is resolved by probing, or searching through other locations in the array –

46 46 +1 S OLUTION - L INEAR PROBING In one variation, the probing sequence is called (+1) – Linear Probing Continue probing adjacent locations until an unused array slot is found. Then put the Entry in that location.

47 47 C LOSED HASHING - E. G. LINEAR PROBING Closed Hashing keeps keys in the main table and uses a re-hash function which has many variations. Linear probing - previous example - is the most commonly Closed Hashing uses the Main Table or flat area to find another location

48 48 R EHASH FUNCTION - LINEAR PROBING The rehash function for Linear Probing is = hash(x) is +1 Keep going to the next slot until you find an empty one

49 49 I NSERTION, I Suppose you want to add seagull to this hash table Also suppose: hashCode(seagull) = 143 table[143] is not empty table[143] != seagull table[144] is not empty table[144] != seagull table[145] is empty Therefore, put seagull at location 145 robin sparrow hawk bluejay owl seagull

50 50 S EARCHING, I Suppose you want to look up seagull in this hash table Also suppose: hashCode(seagull) = 143 table[143] is not empty table[143] != seagull table[144] is not empty table[144] != seagull table[145] is not empty table[145] == seagull ! We found seagull at location 145 robin sparrow hawk bluejay owl seagull

51 51 S EARCHING, II Suppose you want to look up cow in this hash table Also suppose: hashCode(cow) = 144 table[144] is not empty table[144] != cow table[145] is not empty table[145] != cow table[146] is empty If cow were in the table, we should have found it by now Therefore, it isnt here robin sparrow hawk bluejay owl seagull

52 52 I NSERTION, II Suppose you want to add hawk to this hash table Also suppose hashCode(hawk) = 143 table[143] is not empty table[143] != hawk table[144] is not empty table[144] == hawk hawk is already in the table, so do nothing robin sparrow hawk seagull bluejay owl

53 53 I NSERTION, III Suppose: You want to add cardinal to this hash table hashCode(cardinal) = 147 The last location is and 148 are occupied Solution: Treat the table as circular; after 148 comes 0 Hence, cardinal goes in location 0 (or 1, or 2, or...) robin sparrow hawk seagull bluejay owl

54 54 L INEAR PROBING – R EVIEW : Closed Hashing uses Linear Probing (among others) Linear Probing: If position h(key) is occupied, do a linear search in the table until you find a empty slot. The slot is searched in this order: h(key), k(key)+1, h(key)+2,..., h(key)+c

55 55 E XPANDING THE TABLE If the table becomes full, an exception can be thrown or we can expand the capacity. This process is involved because if we double the size, we risk a sparse structure that can impact the efficiency we seek. One solution is to rehash the table using the new table size.

56 56 C LOSED H ASHING - B UCKETS One implementation for closed hashing groups hash table slots into buckets. The M slots of the hash table are divided into B buckets, with each bucket consisting of M/B slots. The hash function assigns each record to the first slot within one of the buckets.

57 57 B UCKET H ASHING - USES M AIN T ABLE If this slot is already occupied, then the bucket slots are searched sequentially until an open slot is found.

58 58 B UCKETS ON THE TABLE If a bucket is entirely full, then the record is stored in an overflow bucket of infinite capacity at the end of the table. All buckets share the same overflow bucket. See link below: See this link for a fuller explanation

59 59 S LOTS OR B UCKETS – 4 BUCKETS

60 60 B UCKET H ASHING To search, hash the key to determine which bucket should contain the record. The records in this bucket are then searched. How is this better than linear probing? -- +1

61 61 B UCKET H ASHING If the desired key value is not found and the bucket still has free slots, then the search is complete. If the bucket is full, then the search goes to the overflow bucket. If many records are in the overflow bucket, this will be an expensive process.

62 62 B UCKET H ASHING ADVANTAGE Bucket methods are good for implementing hash tables stored on disk, because the bucket size can be set to the size of a disk block. Whenever search or insertion occurs, the entire bucket is read into memory.

63 63 USING BUCKETS Because the entire bucket is then in memory, processing an insert or search operation requires only one disk access, unless the bucket is full. If the bucket is full, then the overflow bucket must be retrieved from disk as well.

64 64 C LUSTERING Even with a good hash function, linear probing has its problems: The position of the initial mapping of key k is called the home position of k. When several insertions map to the same home position, they end up placed contiguously in the table. This collection of keys with the same home position is called a cluster.

65 65 C LUSTERS A cluster is a group of items not containing any open slots Clusters cause efficiency to degrade

66 66 C LUSTERING As clusters grow, the probability increases that a key will map to the middle of a cluster, increasing the rate of the clusters growth.

67 67 C LUSTERS This tendency of linear probing to place items together is known as primary clustering. As these clusters grow, they merge with other clusters forming even bigger clusters which grow even faster.

68 68 O THER COLLISION TECHNIQUES We have looked at chaining(Linked Lists) (Open Hashing) and Linear Probing( Closed Hashing): Bucket Hashing Let us look at some other collision techniques

69 69 Other Closed hash function techniques are: Quadratic probing: a variant of the above where the term being added to the hash result is squared. h(key) + c 2 Random probing: the term being added to the hash function is a random number. h(key) + random()

70 70 R EHASH FUNCTIONS Rehashing: is a technique where a sequence of hashing functions are defined (h 1, h 2,... h k ). If a collision occurs the functions are used in the this order

71 71 Hash 2(j) - second hash function Ì Use a second hash function - Re-Hashing hash(k) == hash(j) k stored first Adding j Calculate hash(j) Find k first Calculate hash2(j) where hash2 is some other hash function Repeat until we find an empty slot Put j in it

72 72 H ASH T ABLES - R E - HASH FUNCTIONS Ì The re-hash function has many variations Quadratic probing h(x) is squared Avoids primary clustering Secondary clustering occurs All keys which collide on h(x) follow the same sequence First a = h(j) Then a + c, a + 4c, a + 16c,....

73 73 Q UADRATIC P ROBING Some versions use: p(K, i) = c 1 i 2 + c 2 i 2 + c 3 i 2 for some choice of constants c 1, c 2, and c 3. Secondary clustering generally less of a problem

74 74 S EARCHING IN A H ASH T ABLE We have already seen how searching works with chaining. With Closed Hashing, we use the following steps Given a target, hash the target Take the value of the hash of target and go to the slot. If the target exist it must be in this slot Search in the list in the current slot using a linear search.

75 75 L OOK UP A KEY public lookup(key) { int I ; i = find_slot(key) // method to find key in table if slot[i] is occupied // key is in table return slot[i].value return slot[i].value ; // return value in slot else // key is not in table return not found }

76 76 LINEAR PROBING AND SINGLE - SLOT STEP public find_slot(key) { int i; i = hash(key) ; // use a hash method to hash the key // search until we either find the key, or find an empty slot. while ( (slot[i] is occupied) and ( slot[i].key key ) ) { i = (i + 1) } return i }

77 77 Deleting in a table – Closed Hashing Suppose you want to look up cow in this hash table Also suppose: – hashCode(cow) = 144 – table[144] is not empty – table[144] != cow – table[145] is not empty – table[145] != cow – table[146] is empty If cow were in the table, we should have found it by now Therefore it is not there. robin sparrow hawk bluejay owl seagull

78 78 D ELETING FROM A TABLE Problem: When an empty slot is reached, we assume the item we are searching for is not there. Deletion leaves an empty slot, When we next search for an item using linear probing, We assume the item is not there when we reached the empty slot.

79 79 T OMBSTONES We assume the item is not there when we reached the empty slot. When, in fact, the item could be AFTER the empty slot.

80 80 Therefore, straight deletion of an item would not work. Instead, the cell is marked (usually by use of a boolean variable) when a item is deleted The slot is often termed a tombstone. TOMBSTONES

81 81 H ASH T ABLES - S UMMARY SO FAR... Potential O(1) search time If a suitable function hash(key) integer can be found Space for speed trade-off Full hash tables dont work (more later!) Collisions Inevitable

82 82 Various resolution strategies looked at so far: Linked lists Overflow areas Re-hash functions Linear probing h is +1 Quadratic probing h is + i 2 - Any other hash function! or even sequence of functions!

83 83 C OMPARISON OF COLLISION TECHNIQUES Linear Probing Random Probing Chaining

84 84 H ASHING WITH C HAINING What is the running time to insert/search/delete? Insert: It takes O(1) time to compute the hash function and insert at head of linked list Search: It is proportional to max linked list length Delete: Same as search

85 85 E FFICIENCY OF CHAINING Therefore, if we have a bad hash function, all n keys may hash to the same table index giving an O(n) run-time! So how can we create a good hash function?

86 86 H ASH T ABLES - C HOOSING THE H ASH F UNCTION Some functions are definitely better than others! Key criterion Minimum number of collisions Keeps chains short Maintains O(1) on average

87 87 W RITING YOUR OWN HASH C ODE METHOD A hashCode method must: Return a value that is a legal array index Always return the same value for the same input It cant use random numbers, or the time of day Return the same value for equal inputs Must be consistent with your equals method

88 88 H ASHCODE F UNCTION It does not need to return different values for different inputs – some collisions are inevitable. A good hashCode method should: Be efficient to compute Give a uniform distribution of array indices so NO SPARSE ARRAYS!

89 89 O THER CONSIDERATIONS The hash table might fill up; we need to be prepared for that Generally speaking, hash tables work best when the table size is a prime number

90 90 H ASH TABLES IN J AVA Java provides two classes, Hashtable and HashMap classes which implement the MAP Interface Both are maps: they associate keys with values Hashtable is synchronized; it can be accessed safely from multiple threads Hashtable uses an open hash, and has a rehash method, to increase the size of the table –

91 91 H ASH M AP HashMap is newer, faster, and usually better, but it is not synchronized HashMap (default) uses a bucket hash - (linked list) and has a remove method

92 92 H ASH TABLE OPERATIONS Both Hashtable and HashMap are in java.util Both have no-argument constructors, as well as constructors that take an integer table size Both have methods as listed in next slide

93 93 M ETHODS // put the entry in the table public T put(T key, T value) //Returns the value for this key, or null public T get(T key) public void clear() // clears the table public Set keySet() // returns the values in the table in a Set

94 94 H ASH T ABLES - R EDUCING THE RANGE TO [ 0, M ) Weve mapped the keys to a range of integers 0 key < r - decided on total number of possible keys – For social security numbers - 999,999,999 Now we must reduce this range to [ 0, m ) // from 0 to M where m is a reasonable size for the hash table where m is a reasonable size for the hash table

95 95 H ASH T ABLES – H ASH FUNCTIONS ¬ Some typical functions ¬ Division : Use a mod function hash(k) = abs( k mod m) where m is table size which yields a range between 0 and m-1

96 96 Some typical functions Choice of m ? Powers of 2 are generally not good! h(k) = k mod 2 n Prime numbers close to 2 n - good choices

97 97 C HOOSING A VIABLE VALUE FOR M Prime numbers close to 2 n - good choices Eg. want ~4000 entry table, choose m = 4093 Other methods in your text.

98 98 P ERFORMANCE A NALYSIS If n slots in a table of size m are occupied, the load factor is defined as: ( α is the load factor) when =1 means the table is full, and =0 means the table is empty. It is generally good to get a value < 1, near.8. n = number of items m = number of slots

99 Average # of probes Load factor Successful search Linear probing Separate Chaining Double Hashing

100 Average # of probes Load factor Unsuccessful search Linear probing Double hashing Separate chaining

101 101 H ASH T ABLES - C OLLISION R ESOLUTION S UMMARY Chaining + Unlimited number of elements + Unlimited number of collisions - Overhead of multiple linked lists Re-hashing + Fast re-hashing + Fast access through use of main table space - Maximum number of elements must be known - Multiple collisions become probable - CLUSTERING! Overflow area + Fast access + Collisions don't use primary table space

102 102 T ERMS TO K NOW Open Addressing looks for another open position in the table other than the one to which the element is originally hashed. Requires that the load factor be < 1. Open Addressing using Linear Probing - seeking next available position –creates clusters - alternative methods - quadratic probing etc. Separate Chaining If two keys map to the same address, separate chaining creates a linked list of keys that map to that address.

103 103 H ASH C ODE FUNCTION IN J AVA Hash function - has two parts: Map key k to an integer There is a default hashcode() in Java - the method maps each object to an integer. It returns a 32 bit integer – which may be where the object is in memory. It works poorly with Strings as two strings could be in different locations in memory and contain the same data.

104 104 H ASH T ABLES - R EVIEW If you can meet the constraints of a hash function that gives a Big(O) of 1: Hash Tables will generally give good performance O(1) search

105 105 BUT: not advisable for unknown data If collection size is relatively static – few insertions and deletions - memory management is actually simpler –

106 106 U NIVERSAL OR P ERFECT H ASHING Dynamic perfect hashing" involves using a second hash table as the data structure to store multiple values within a particular bucket. How do we find the next location with this approach?

107 107 U NIVERSAL H ASHING What advantages does it have over linear probing? What are possible problems with the approach? Perfect hashing means that read access takes constant time even in the worst case.

108 108 U NIVERSAL OR P ERFECT H ASHING For inserting, the time bounds are only true on average. To make insertion fast enough, the second level hash table is very large for the number of keys (k 2 ), large enough so that collisions become unlikely.

109 109 SECOND LEVEL HASH TABLES This is not a problem with table size because the first level hash distributes keys evenly so that on average second level hash tables are still relatively small. The hash function for the second level tables are chosen at random from a set of parameterized hash functions.

110 110 U NIVERSAL H ASHING It is possible when you know exactly what set of keys you are going to be hashing when you design your hash function. It's popular for hashing keywords for compilers Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all.

111 111 C HAINED B UCKET Note: when using chaining, each linked list attached to a slot is called a bucket - this is called chained bucket hashing However, there is also bucket hashing done on the main table - just to make things real clear.


Download ppt "1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate."

Similar presentations


Ads by Google