Presentation on theme: "Hash Tables and Sets Lecture 3. Sets A set is simply a collection of elements Unlike lists, elements are not ordered Very abstract, general concept with."— Presentation transcript:
Hash Tables and Sets Lecture 3
Sets A set is simply a collection of elements Unlike lists, elements are not ordered Very abstract, general concept with broad usefulness: The set of all Google search queries from the past 24 hours The set of all photos with your face in them The set of all files in a folder How are sets represented in computers? Consider the following problem: We want to store a large set of approx. 10 million random numbers The following operations are happening constantly: Add – inserting a new number into the set Delete – removing an existing element from the set Lookup – checking if a new random number is in the set
Representing Sets Suppose we use an ArrayList for this heavily churning set: Add, Delete, and Lookup are all O(n) Suppose the ArrayList is sorted: Lookup is O(log n) Add/Delete are still O(n) Cleverer algorithms: Self-balancing trees: Lookup, Add, and Delete are guaranteed O(log n) Hash tables: Lookup, Add, and Delete are worst-case O(n) … but on average O(1)
Using Buckets Lets go back to ArrayLists, but use a different approach: Create 2 ArrayLists Even numbers go in the first list Odd numbers go in the second list Now, Add/Delete/Lookup only take half the work: Check if the number is even or odd Get the right ArrayList Search through about 5 million entries instead of 10 million This is promising! … but still O(n)
Using Buckets Yet another approach: Instead of two different ArrayLists, lets use 4 Multiples of 4 go in the first list Multiples of 4 have the property (x % 4) == 0 If (x % 4) == 1, then x goes in the second list If (x % 4) == 2, then x goes in the third list If (x % 4) == 3, then x goes in the fourth list Now, Add/Delete/Lookup only take ¼ as much work: Calculate the number mod 4 Find the right list Search through 2.5 million elements instead of 10 million This is even better! … but still O(n)
Using Buckets Yet another approach: use 10 million buckets! If the numbers are truly randomly distributed, then: Some buckets may be empty Some buckets may have 2 or even 100 elements On average, each bucket has close to 1 element Suddenly, Add/Delete/Lookup become very cheap – O(1) As long as we scale up the number of buckets to match the amount of data, we can maintain O(1) lookup This is a hash table!
Hash Functions In our example, we were only storing integers We can use this to store arbitrary data, as long as one thing is provided: A hash function What is a hash function? A function that converts any data into an integer This integer is used to determine which bucket in which to store the data The hash function must ensure fairly even distribution in the table. More on this later.
Example Hash Function Suppose we wish to store a set of strings instead of integers We need a hash function Heres a simple one: a = 1, b = 2, c = 3, …, z = 26 Sum the value of each letter asdf.hashCode() = a + s + d + f = = 30 asdf goes in the 30 th bucket
Hash Collisions This hash function has some problems: It only deals with English letters We can solve this by using the ASCII or Unicode value of the character instead of its index in the English alphabet It is prone to collisions A hash collision is when two or more distinct values have the same hash code In example hash function, all anagrams collide: least = 57 steal = 57 stale = 57 Therefore, this hash table would be very bad for storing sets of anagrams! It would degenerate into using a single ArrayList, as one bucket would be used.
Generalizing What exactly is a hash table? Given elements that have a hash function, hash tables are just arrays! Each array element is an ArrayList in order to resolve collisions Number of buckets is proportional to number of elements in the set Expliot time-memory tradeoff to get quick lookup times Array is resized when hash table gets too full Load factor: The ratio of filled hash table slots to total slots Load factor is 0.0 when the hash table is empty and 1.0 when every bucket has at least one element When load factor reaches a certain value, 0.75 in our case, the array gets larger to maintain sparseness Hash tables can get much more complicated than this, but the fundamentals remain the same.
The Lab In this lab, we have implemented a very simple hash table SimpleHashTable.java It is so simple that it cannot handle collisions! Each bucket isnt an ArrayList – its just a single element when full, or null if empty Your task is to modify the code and implement collision resolution This means that each array slot should be an ArrayList instead of merely an Object
Java Generics You will see some strange angle-bracket notation: ArrayList, SimpleHashTable If parentheses indicate function arguments, then angle brackets indicate type arguments Type arguments are a way of specifying data structures that work on various types: ArrayList has: void add(String arg0) String get(int index) SimpleHashSet has: void add(Integer arg0) boolean contains(Integer arg0)
Operations to Implement SimpleHashSet.java: public void add(T element) public boolean contains(T element) public boolean remove(T element) public void clear() public boolean isEmpty() public int size() Some of these may remain unchanged You will also have to edit the private members and reimplement some private methods