Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

Similar presentations


Presentation on theme: "1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003."— Presentation transcript:

1 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

2 2 Review: Hash Tables A hash table = implementation of a dictionary Main idea: use an array – direct access, in time O(1) Problem: keys are not integers Solution: use a hash function, h(key)=index No hash function is perfect  collisions. Two ways to deal with collisions

3 3 3 2 1 0 6 5 4 ad eb c Review: Hashing with Separate Chaining Put a little dictionary at each entry –choose type as appropriate –common case is unordered linked list (chain) Properties –performance degrades with length of chains – can be greater than 1 h(a) = h(d) h(e) = h(b) What was ??

4 4 Review: Closed Hashing Problem with separate chaining: Memory consumed by pointers – 32 (or 64) bits per key! What if we only allow one Key at each entry? –two objects that hash to the same spot can’t both go there –first one there gets the spot –next one must go in another spot Properties –  1 –performance degrades with difficulty of finding right spot a c e 3 2 1 0 6 5 4 h(a) = h(d) h(e) = h(b) d b

5 5 Review: Closed Hashing Given an item X, try cells h 0 (X), h 1 (X), h 2 (X), …, h i (X) h i (X) = (Hash(X) + F(i)) mod TableSize –Define F(0) = 0 F is the collision resolution function. Some possibilities: –Linear: F(i) = i –Quadratic: F(i) = i 2 –Double Hashing: F(i) = i  Hash 2 (X)

6 6 Deletion with Separate Chaining Why is this slide blank?

7 7 0 1 2 7 3 2 1 0 6 5 4 delete(2) 0 1 7 3 2 1 0 6 5 4 find(7) Where is it?! Deletion in Closed Hashing What should we do instead?

8 8 0 1 2 7 3 2 1 0 6 5 4 delete(2) 0 1 # 7 3 2 1 0 6 5 4 find(7) Indicates deleted value: if you find it, probe again Lazy Deletion But now what is the problem?

9 9 The Squished Pigeon Principle An insert using Closed Hashing cannot work with a load factor of 1 or more. –Quadratic probing can fail if > ½ –Linear probing and double hashing slow if > ½ –Lazy deletion never frees space Separate chaining becomes slow once > 1 –Eventually becomes a linear search of long chains How can we relieve the pressure on the pigeons? REHASH!

10 10 Rehashing Example Separate chaining h 1 (x) = x mod 5 rehashes to h 2 (x) = x mod 11 =1 =5/11 1234 12345678910 0 0 2537 52 83 98 2537835298

11 11 Rehashing Amortized Analysis Consider sequence of n operations insert(3); insert(19); insert(2); … What is the max number of rehashes? What is the total time? –let’s say a regular hash takes time a, and rehashing an array contain k elements takes time bk. Amortized time = (an+b(2n-1))/n = O( 1 ) log n

12 12 Rehashing without Stretching Suppose input is a mix of inserts and deletes –Never more than TableSize/2 active keys –Rehash when =1 (half the table must be deletions) Worst-case sequence: –T/2 inserts, T/2 deletes, T/2 inserts, Rehash, T/2 deletes, T/2 inserts, Rehash, … Rehashing at most doubles the amount of work – still O(1)

13 13 Case Study Spelling dictionary –50,000 words –static –arbitrary(ish) preprocessing time Goals –fast spell checking –minimal storage Practical notes –almost all searches are successful –words average about 8 characters in length –50,000 words at 8 bytes/word is 400K –pointers are 4 bytes –there are many regularities in the structure of English words Why?

14 14 Solutions –sorted array + binary search –separate chaining –open addressing + linear probing

15 15 Storage Assume words are strings and entries are pointers to strings Array + binary search Separate chaining … Closed hashing n pointers table size + 2n pointers = n/ + 2n n/ pointers

16 16 Analysis Binary search –storage:n pointers + words = 200K+400K = 600K –time: log 2 n  16 probes per access, worst case Separate chaining - with = 1 –storage: n/ + 2n pointers + words = 200K+400K+400K = 1GB –time:1 + /2 probes per access on average = 1.5 Closed hashing - with = 0.5 –storage:n/ pointers + words = 400K + 400K = 800K –time: probes per access on average = 1.5 50K words, 4 bytes @ pointer

17 17 Approximate Hashing Suppose we want to reduce the space requirements for a spelling checker, by accepting the risk of once in a while overlooking a misspelled word Ideas?

18 18 Approximate Hashing Strategy: –Do not store keys, just a bit indicating cell is in use –Keep low so that it is unlikely that a misspelled word hashes to a cell that is in use

19 19 Example 50,000 English words Table of 500,000 cells, each 1 bit –8 bits per byte Total memory: 500K/8 = 62.5 K –versus 800 K separate chaining, 600 K open addressing Correctly spelled words will always hash to a used cell What is probability a misspelled word hashes to a used cell?

20 20 Rough Error Calculation Suppose hash function is optimal - hash is a random number Load factor  0.1 –Lower if several correctly spelled words hash to the same cell So probability that a misspelled word hashes to a used cell is  10%

21 21 Exact Error Calculation What is expected load factor?

22 22 Puzzler Suppose you have a HUGE hash table, that you often need to re-initialize to “empty”. How can you do this in small constant time, regardless of the size of the table?

23 23 A Random Hash… Extensible hashing –Hash tables for disk-based databases – minimizes number disk accesses Minimal perfect hash function –Hash a given set of n keys into a table of size n with no collisions –Might have to search large space of parameterized hash functions to find –Application: compilers One way hash functions –Used in cryptography –Hard (intractable) to invert: given just the hash value, recover the key

24 24 Databases A database is a set of records, each a tuple of values –E.g.: [ name, ss#, dept., salary ] How can we speed up queries that ask for all employees in a given department? How can we speed up queries that ask for all employees whose salary falls in a given range?

25 25 Hash Tables on Secondary Storage (Disks) Main differences: One bucket = one block, hence may hold multiple keys Open chaining: use overflow blocks when needed Closed chaining never used

26 26 Assume 1 bucket (block) stores 2 keys + pointers h(e)=0 h(b)=h(f)=1 h(g)=2 h(a)=h(c)=3 Hash Table Example e b f g a c 0 1 2 3

27 27 Search for a: Compute h(a)=3 Read bucket 3 1 disk access Searching in a Hash Table e b f g a c 0 1 2 3

28 28 Place in right bucket, if space E.g. h(d)=2 Insertion in Hash Table e b f g d a c 0 1 2 3

29 29 Create overflow block, if no space E.g. h(k)=1 More over- flow blocks may be needed Insertion in Hash Table e b f g d a c 0 1 2 3 k

30 30 Hash Table Performance Excellent, if no overflow blocks Degrades considerably when number of keys exceeds the number of buckets (I.e. many overflow blocks).

31 31 Extensible Hash Table Allows has table to grow, to avoid performance degradation Assume a hash function h that returns numbers in {0, …, 2 k – 1} Start with n = 2 i << 2 k, only look at first i most significant bits

32 32 Extensible Hash Table E.g. i=1, n=2 i =2, k=4 Note: we only look at the first bit (0 or 1) 0(010) 1(011) i=1 1 1 0 1

33 33 Insertion in Extensible Hash Table Insert 1110 0(010) 1(011) 1(110) i=1 1 1 0 1

34 34 Insertion in Extensible Hash Table Now insert 1010 Need to extend table, split blocks i becomes 2 0(010) 1(011) 1(110), 1(010) i=1 1 1 0 1

35 35 Insertion in Extensible Hash Table 0(010) 10(11) 10(10) i=2 1 2 00 01 10 11 11(10) 2

36 36 Insertion in Extensible Hash Table Now insert 0000, then 0101 Need to split block 0(010) 0(000), 0(101) 10(11) 10(10) i=2 1 2 00 01 10 11 11(10) 2

37 37 Insertion in Extensible Hash Table After splitting the block 00(10) 00(00) 10(11) 10(10) i=2 2 2 00 01 10 11 11(10) 2 01(01) 2

38 38 Extensible Hash Table How many buckets (blocks) do we need to touch after an insertion ? How many entries in the hash table do we need to touch after an insertion ?

39 39 Performance Extensible Hash Table No overflow blocks: access always O(1) –More precisely: exactly one disk I/O BUT: –Extensions can be costly and disruptive –After an extension table may no longer fit in memory


Download ppt "1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003."

Similar presentations


Ads by Google