CSE 326: Data Structures Lecture #12

CSE 326: Data Structures Lecture #12
Whoa… Good Hash, Man Bart Niswonger Summer Quarter 2001

Today’s Outline Hashing Unix Tutorial Midterm
What do you want covered? Midterm Amortized time ADT vs Data Structure Various things today, but in general rather a short day I really want to increase the amount of interactivity in class – I am going to ask you to do small activities, write down ideas/questions, work with your neighbors, etc. Please be efficient – please be willing – please work with different people from time to time – please take it seriously – I am doing this because I think it can help you understand some of the more difficult ideas and also because I think it can keep you more focused on what is going on. I will let you know if any of it will be graded, but I don’t want it to be – I will only do that as a way to make you participate Hashing

Intermediate Unix Tutorial
2 minutes 3 things you love about unix 3 things you hate 5 things you wish you knew how to do 1 gift idea The friendly acm folks are putting on another unix tutorial mainly for your benefit. Since its basically for you guys, they want to know what you want to know!! Take a piece of paper – and in 2 minutes write the above 11 things ALSO!!! I want to give them each a little thank you present for helping you guys out – something from the class, but I need help thinking of things – everyone give me one idea, and then we’ll talk about it Collect ideas and vote?

Asymptotic Time Bounds worst-case running time
Over m operations Worst-case for single operation may be really bad, but worst-case for m operations is bounded Lots of people noticed that the SkewHeap::merge method I gave on the midterm was flawed in that it didn’t swap the children Almost no one got the right answer for how adding this line affects the running time! So I ask again – how does that swap change the running time? Work through simple example showing that with out the swap, might as well be merging a binary tree, worst case is O(n), O(mn) for m operations With the swap- since stuff moves back and forth-it is better than that. The question is how much - and through some math which I wont go into, it can be shown to be O(m log n) IN THE WORST CASE even though single operations can be bad

ADT vs Data Structure Abstract Data Type Data structures Abstract
Operations & semantics Data-less One No notion of running time or complexity Data structures Concrete implementation Set of algorithms a Holds data Many Very particular running times and complexities Some distinctions between ADTs and DSs Abstract (no implementation) – concrete ( implementation) Operations and what they mean at a high level – implementation via a collection of algorithms Adts know nothing about how the data is stored, just that there is a particular concept of information (data & key) – ds store data explicitly and support “concept of information” One adt has many dss Adts have no implementation, no algorithms, hence no complexity – dss have algorithms which have complexities Tensions that might cause you to choose one ds over another? time vs. space performance vs. elegance generality vs. simplicity one operation’s performance vs. another’s

Dictionary ADT Review Dictionary operations
create destroy insert find delete Stores values associated with user-specified keys values may be any (homogenous) type keys may be any (homogenous) comparable type kim chi spicy cabbage Krispy Kreme tasty doughnut kiwi Australian fruit kale leafy green Krispix breakfast cereal insert kohlrabi - upscale tuber find(kiwi) kiwi - Australian fruit Dictionaries associate some key with a value, just like a real dictionary (where the key is a word and the value is its definition). In this example, I’ve stored foods associated with short descriptions. This is probably the most valuable and widely used ADT we’ll hit. What are some of the implementations we have seen and the complexity of their operations? (30 seconds)

Hash Table Approach Kiwi Kumquat f(x) Kim chi Kale Kohlrabi
But… is there a problem in this pipe-dream?

Hash Table Dictionary Data Structure
Hash function: maps keys to integers result: can quickly find the right spot for a given entry Unordered and sparse table result: cannot efficiently list all entries, Cannot find min and max efficiently, Cannot find all items within a specified range efficiently. Kiwi f(x) Kumquat F(x) Kim chi Kale Kohlrabi A binary search tree is a binary tree in which all nodes in the left subtree of a node have lower values than the node. All nodes in the right subtree of a node have higher value than the node. It’s like making that recursion into the data structure! I’m storing integers at each node. Does everybody think that’s what I’m _really_ going to store? What do I need to know about what I store? (comparison, equality testing)

Hash Table Terminology
hash function table Kumquat f(x) Kiwi Kim chi collision Kale Kohlrabi load factor  = # of entries in table tableSize keys

Hash Table Code (First Pass)
Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; } What should the hash function be? (for integers) What should the table size be? How should we resolve collisions? Take 2 minutes – work with a neighbor – answer these questions No looking at the notes!!! Lets hear some ideas!! -- how do I hash integers? Everyone else – hold those thoughts

A Good Hash Function… is easy (fast) to compute (O(1) and practically fast). distributes the data evenly (hash(a)  hash(b)) uses the whole hash table (for all 0  k < size, there’s an i such that hash(i) % size = k).

A Good Hash Function for Integers
Choose tableSize is prime hash(n) = n % tableSize Example: tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17) 1 2 3 Here is an example of a way to hash integers How’d you do? Now take 3 minutes with the same person and come up with a way to hash strings – again, no looking ahead!! 4 5 6

Good Hash Function for Strings?
I want to be able to: insert(“kale”) insert(“Krispy Kreme”) insert(“kim chi”)

Good Hash Function for Strings?
Sum the ASCII values of the characters. Consider only the first 3 characters. Uses only 2871 out of 17,576 entries in the table on English words. Let s = s1s2s3s4…s5: choose hash(s) = s1 + s s s … + sn128n Problems: hash(“really, really big”) = well… something really, really big hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.

Easy to Compute String Hash
Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (si + 128*h) % tableSize; } return h; What is horner’s rule???

Universal Hashing For any fixed hash function, there will be some pathological sets of inputs everything hashes to the same cell! Solution: Universal Hashing Start with a large (parameterized) class of hash functions No sequence of inputs is bad for all of them! When your program starts up, pick one of the hash functions to use at random (for the entire time) Now: no bad inputs, only unlucky choices! If universal class large, odds of making a bad choice very low If you do find you are in trouble, just pick a different hash function and re-hash the previous inputs

“Random” Vector Universal Hash
Parameterized by prime size and vector: a = <a0 a1 … ar> where 0 <= ai < size Represent each key as r + 1 integers where ki < size size = 11, key = ==> <3,9,7,5,2> size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4> ha(k) = R is the range – the number of things you could feed into the hash function Case one – range is integers 0 – 9 Case two – range is letters dot product with a “random” vector!

Universal Hash Function
Strengths: works on any type as long as you can form ki’s if we’re building a static table, we can try many a’s a random a has guaranteed good properties no matter what we’re hashing Weaknesses must choose prime table size larger than any ki

Hash Function Summary Goals of a hash function Example Hash functions
reproducible mapping from key to table entry evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) complete quickly Example Hash functions h(n) = n % size h(n) = string as base 128 number % size One Universal hash function: dot product with random vector There are many, many hash functions – some give you one guarantee, others give you another- these are some simple ones, I encourage you to come up with your own, to look on the web for more, talk to me, etc.

How to Design a Hash Function
Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)

Collisions Pigeonhole principle says we can’t avoid all collisions
try to hash without collision m keys into n slots with m > n try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? open hashing: put little dictionaries in each entry closed hashing: pick a next entry to try The pigeonhole principle is a vitally important mathematical principle that asks what happens when you try to shove k+1 pigeons into k pigeon sized holes. Don’t snicker. But, the fact is that no hash function can perfectly hash m keys into fewer than m slots. They won’t fit. What do we do? 1) Shove the pigeons in anyway. 2) Try somewhere else when we’re shoving two pigeons in the same place. Does closed hashing solve the original problem?no – sometimes quite badly – like run out of space after using ½ of buckets! shove extra pigeons in one hole!

To Do Project II Homework 4 Read Chapter 5 (fast!)

Coming Up More hashing Cool stuff! Project III

CSE 326: Data Structures Lecture #12

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #12"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 326: Data Structures Lecture #12

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #12"— Presentation transcript:

Similar presentations

About project

Feedback