School of Computing Clemson University Fall, 2012

School of Computing Clemson University Fall, 2012
Lecture 5. Hashing Applications II CpSc 212: Algorithms and Data Structures Brian C. Dean School of Computing Clemson University Fall, 2012

So Far, We’ve Covered… Hash tables Hashing large objects
Collision resolution via probing, chaining, and Cuckoo hashing. Common hash functions; e.g. h(k) = (ak+b)%M. Expanding a hash table when it gets too full. Hashing large objects Polynomial hash functions and their analysis. Applications in security, string matching, etc. Birthday paradox, how this relates to hashing and collisions.

Hash Tables as Maps We often augment the elements in a hash table with extra information. Example: Problem: Find the most-frequently-occurring words in a text file. Solution: Maintain a hash table of strings, each having an associated integer frequency count. We can think of a hash table in this setting as a “map” from strings to ints.

Associative Arrays Hash-based maps are supported in many languages (e.g., Perl, Python, Javascript) using simple array notation: grade[“Brian”] = 100; id_num = ; student_name[id_num] = “Brian”; Even though these look like arrays, they are actually hash tables under the hood! Space-efficient and fast.

Storing Large Records We often think of a hash table as storing (key, value) pairs. E.g., a map from key -> value Alternatively, we can think of storing large structs in a hash table, where each one is identified by a designated “key” element. Remember that to make the hash table more nimble, we often store just (key, pointer to rest of struct) in the table itself.

Another Nice Application: Pseudo-Random Number Generation
True random numbers are impossible to generate on a deterministic computer. Hash functions give us a simple way to generate psuedo-random numbers, though, since they are designed to look as “random” as possible. Just evaluate h(0), h(1), h(2), … to get a nice stream of pseudo-random numbers. 6

Application: Estimating the # of Distinct
Elements in a Large Data Stream Your program monitors a set of N integers streaming off a sensor very quickly. How can you determine the number of distinct integers that go by?

Elements in a Large Data Stream Your program monitors a set of N integers streaming off a sensor very quickly. How can you determine the number of distinct integers that go by? Store them all in a hash table, skipping duplicates. O(N) time. Now what if you don’t have enough memory to store the hash table (say N is very large)?

Elements in a Large Data Stream If the D distinct numbers streaming by were random, then we would expect: ~ D/2 of them to be even (ending in 0 in binary). ~ D/4 of them to be multiples of 4 (ending in 00). ~ D/8 of them to be multiples of 8 (ending in 000). etc. Look at the # of trailing 0’s (when written in binary) of all the numbers streaming by. The max of these is an estimate of log2 D.

Elements in a Large Data Stream We can estimate the number of distinct numbers streaming by if this stream consists of random numbers. The numbers in our input are not random, so this technique would not necessarily work. However, we can make our numbers look random by looking at their hashes instead. And this doesn’t change the number of distinct elements!

Exploiting Collisions
Problem: given a large dictionary, find all pairs of words that are anagrams. E.g., “clemsontigers” and “scoresmelting”. How can we test of two words are anagrams of each-other?

Exploiting Collisions:
“Geometric” Hashing “Dear GPS, please tell me all the restaurants within a 1 mile radius of my current location…”

“Geometric” Hashing “Dear GPS, please tell me all the restaurants within a 1 mile radius of my current location…” Subdivide space by “hashing” all points to the appropriate cell In a 2D array.

“Geometric” Hashing “Dear GPS, please tell me all the restaurants within a 1 mile radius of my current location…” Subdivide space by “hashing” all points to the appropriate cell In a 2D array. What if this array is only sparsely filled? How can we represent it in a space-efficient manner?

Exploiting Collisions: Locality-Sensitive Hashing
In machine learning, we often design a very complicated function for measuring “similarity” between two objects. E.g., sim( , ) = 0.73 A “locality sensitive” hash function is designed so that the probability of two items colliding is proportional to their similarity. This gives a fast way to estimate similarity.

School of Computing Clemson University Fall, 2012

Similar presentations

Presentation on theme: "School of Computing Clemson University Fall, 2012"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School of Computing Clemson University Fall, 2012

Similar presentations

Presentation on theme: "School of Computing Clemson University Fall, 2012"— Presentation transcript:

Similar presentations

About project

Feedback