Algorithms: Design and Analysis

Algorithms: Design and Analysis
, Semester 2, 5. Hashing Objectives: introduce hashing, hash functions, hash tables, collisions, linear probing, double hashing, bucket hashing, Java's HashMap

1. Searching an Array If an array is not sorted, a search requires O(n) time. If an array is sorted, a binary search requires O(log n) time If an array is organized using hashing then it is possible to have constant time search: O(1).

2. Hashing A hash function calculates an array index position for a data item. The array + hash function is called a hash table. Hash tables support insertion and search in O(1) time but the hash function must be carefully coded to be this fast

A Simple Hash Function kiwi banana watermelon apple mango cantaloupe
A simple hash function which returns the values: hash("apple") == 5 hash("watermelon") == 3 hash("grapes") == 8 hash("cantaloupe") == 7 hash("kiwi") == 0 hash("strawberry") == 9 hash("mango") == 6 hash("banana") == 2 Use: String[] fruits = new String[10]; String f1 = "mango"; fruits[ hash(f1) ] = f1; // O(1) insertion : 1 2 3 4 5 6 7 8 9 kiwi banana watermelon apple mango cantaloupe grapes strawberry fruits[]

Possible Implementation
int hash(String s) { return hash(s, 163, 53) % 10; } // prime seed & modulo int hash(String s, int seed, int modulo) // convert each characters into a number, and combine { long hash = 0; for (int i = 0; i < s.length(); i++) { hash += ((long)Math.pow(seed, i) * (int)s.charAt(i)); hash = Math.floorMod(hash, modulo); // hash % modulo } return (int)hash; Not Examinable

A Second Example studs[]
A hash table storing (ID, Name) Student objects, ID is a nine-digit integer The hash table is an array of size N == 10,000 The hash function is hash(ID) == last four digits of the ID key  1 2 3 4 9997 9998 9999 … ( , tim) ( , ad) ( , tim) ( , jim) Student objects

Sample Code Student[] studs = new Student[10000]; Student s1 = new Student( , "ad"); Student s2 = new Student( , "tim"); studs[ hash(s1.getID()) ] = s1; // O(1) insertion studs[ hash(s2.getID()) ] = s2; : : Student s = studs[ hash( ) ]; // O(1) lookup System.out.println( s.getName()); int hash(long n) { return n % 10000; } // too simple

3. Hash Function Problem kiwi banana watermelon apple ? cantaloupe
Real hash functions produce values which collide: hash("apple") == 5 hash("watermelon") == 3 hash("grapes") == 8 hash("cantaloupe") == 7 hash("kiwi") == 0 hash("strawberry") == 9 hash("mango") == 6 hash("banana") == 2 hash("honeydew") == 6 1 2 3 4 5 6 7 8 9 kiwi banana watermelon apple ? cantaloupe grapes • Two fruits hash to the same index. Now what? strawberry

Collisions A collision occurs when two items hash to the same array location e.g "mango" and "honeydew" both hash to 6 Where should we place the second and other items that hash to this same location? Three popular solutions: linear probing double hashing bucket hashing (chaining)

4. Solution #1: Linear Probing
Place the colliding item in the next empty table cell (perhaps after cycling around the table) Each inspected table cell is called a probe in the above example, three probes were needed to insert 32 32 added here num = 32 to be added [hash(32) = 6] 41 18 44 59 32 22 1 2 3 4 5 6 7 8 9 10 11 12

Example 1 A hash table with N = 13 and h(k) = k mod 13
Insert keys: 18, 41, 22, 44, 59, 32, 31, 73 Total number of probes: 19 1 2 3 4 5 6 7 8 9 10 11 12 41 18 44 59 32 22 31 73 1 2 3 4 5 6 7 8 9 10 11 12

Example 2: Insertion Suppose we want to add "seagull":
robin sparrow hawk bluejay owl . . . 141 142 143 144 145 146 147 148 Suppose we want to add "seagull": hash(seagull) == 143 3 probes: table[143] is not empty; table[144] is not empty; table[145] is empty put seagull at location 145

Searching Look up "seagull": hash(seagull) == 143
robin sparrow hawk bluejay owl . . . 141 142 143 144 145 146 147 148 seagull Look up "seagull": hash(seagull) == 143 3 probes: table[143] != seagull; table[144] != seagull; table[145] == seagull found seagull at location 145

Searching Again Look up "cow": hash(cow) == 144
3 probes: table[144] != cow; table[145] != cow; table[146] is empty "cow" is not in the table since the probes reached an empty cell robin sparrow hawk bluejay owl . . . 141 142 143 144 145 146 147 148 seagull

Insertion Again Add "hawk": hash(hawk) == 143
2 probes: table[143] != hawk; table[144] == hawk hawk is already in the table, so do nothing robin sparrow hawk seagull bluejay owl . . . 141 142 143 144 145 146 147 148

Insertion Again Add "cardinal": hash(cardinal) == 147
robin sparrow hawk seagull bluejay owl . . . 141 142 143 144 145 146 147 148 Add "cardinal": hash(cardinal) == 147 147 and 148 are occupied; "cardinal" goes in location 0 (or 1, or 2, or ...) search for space loops back to start

Lazy Deletion Deletions are done by marking a table cell as deleted, rather than emptying the cell. Deleted locations are treated as empty when inserting and as occupied during a search.

Lazy Deletion Example robin DELETED hawk seagull bluejay owl . . . 141 142 143 144 145 146 147 148 Delete "sparrow" by marking the location as "DELETED" If the "sparrow" cell was really emptied, what would happen when we later try to find "hawk"?

Where is "hawk"? Find "hawk": hash(hawk) == 143
Location 143 is empty, which means "hawk" has not been added. But it was, but shifted to location 144. Lazy deletion avoids this problem robin hawk seagull bluejay owl . . . 141 142 143 144 145 146 147 148

Clustering Linear Probing tends to form “clusters”.
a cluster is a sequence of non-empty locations e.g. the cluster in Example 1 on slide 11 The bigger a cluster gets, the more likely it is that new items will hash into that cluster, and make it ever bigger. Clusters reduce hash table efficiency searching becomes sequential (O(n)) n is the no. of items in the table continued

Load Factor One way to measure the speed of a hash table is to calculate the no of items in the table (n) divided by the size of the table (N) This ratio is called the load factor == n/N A good load factor is near to 0. This means there will be few collisions and few clusters. linear probing will be fast: O(1) A bad load factor is near to 1. This means there will be many collisions and large clusters linear probing will be slow: O(n)

5. Solution #2: Double Hashing
In the event of a collision, compute a second 'offset' hash function (called d() on the next slide) this is used as an offset from the collision location Collisions in linear probing always use an offset of 1, which causes clusters. Double Hashing makes the offset more random.

Example of Double Hashing
Same data and hash function as slide 11 A hash table with N = 13, h(k) = k mod 13 and d(k) = 7 - k mod 7 Insert keys 18, 41, 22, 44, 59, 32, 31, 73 Total number of probes: 11 1 2 3 4 5 6 7 8 9 10 11 12 offset function 31 41 18 32 59 73 22 44 1 2 3 4 5 6 7 8 9 10 11 12

Probes as a Measure Another way to judge the speed of a hash table is to calculate the number of probes needed to put all the data into the table. The linear probing example on slide 11 needed 19 probes. The double hashing example on the previous slide only needed 11 probes, for the same data and hash function this shows double hashing is faster

6. Solution #3: Bucket Hashing
robin sparrow hawk bluejay owl . . . 141 142 143 144 145 146 147 148 seagull The previous solutions only use an array In bucket hashing, an array cell points to a list of items that all hash to that cell. also called chaining continued

Chaining is generally faster than linear probing.
With linear probing and double hashing, the number of items (n) is limited to the table size (N), whereas the lists in chaining can keep growing. To delete an element, just erase it from its list.

7. Table Resizing As the number of items in the hash table increases, the load factor gets closer to 1. Solution: Increase the hash table size. But the items in the old table must be re-hashed to be moved to new positions in the new table.

8. Using Java's HashMap A HashMap with Strings as keys and values
uses a bucket hash, and has a remove() method A HashMap with Strings as keys and values HashMap "Charles Nguyen" "(531) " "Lisa Jones" "(402) " "William H. Smith" "(998) " A telephone book

Coding a Map HashMap <String, String> phoneBook = new HashMap<String, String>(); phoneBook.put("Charles Nguyen", "(531) "); phoneBook.put("Lisa Jones", "(402) "); phoneBook.put("William H. Smith", "(998) "); // O(1) cost String phoneNumber = phoneBook.get("Lisa Jones"); // O(1) cost System.out.println( phoneNumber ); prints: (402)

HashMap<String, String> h = new HashMap<String, String>(100, /*capacity*/ 0.75f /*load factor*/ ); h.put( "WA", "Washington" ); h.put( "NY", "New York" ); h.put( "RI", "Rhode Island" ); h.put( "BC", "British Columbia" );

Capacities and Load Factors
HashMaps round capacities up to powers of two e.g > 128 default capacity is 16 Load factor is n/capacity, where n == no. of elements added to HashMap In Java's HashMap, the default load factor is 0.75 The load factor is used to decide when it is time to double the size of the HashMap just after you have added 12 elements, since 16 * 0.75 == 12

Algorithms: Design and Analysis

Similar presentations

Presentation on theme: "Algorithms: Design and Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms: Design and Analysis

Similar presentations

Presentation on theme: "Algorithms: Design and Analysis"— Presentation transcript:

Similar presentations

About project

Feedback