Chapter 11. Hashing
Contents Introduction A Simple Hashing Algorithm Hashing Functions and Record Distributions How Much Extra Memory Should Be Used? Collision Resolution by Progressive Overflow Storing More Than One Record per Address: Buckets Making Deletions Other Collision Resolution Techniques Patterns of Record Access
1. Introduction O-notation O(1) O(N) : sequential searching O(log2N) O(logkN) : B-Tree (k : 리프 노드 크기) What is Hashing? a = h(K) h (hash function), K (key), a (home address) Example K = BASS h = (first char * second char) mod 1000 a = h(K) = (66 * 65) mod 1000 = 4,290 mod 1000 = 290
Introduction Collision Example key : LOWELL => a = (76 * 79) mod 1000 = 6,004 mod 1000 = 4 OLIVIER => a = (79 * 76) mod 1000 = 6,004 mod 1000 = 4 Several ways to reduce the number of collisions 1. Spread out the records Good hashing algorithms 2. Use extra memory 3. Put more than one record at a single address Buckets
2. A Simple Hashing Algorithm 3 Steps 1. Represent the key in numerical form 2. Fold and add 3. Divide by a prime number and use the remainder as the address Example Step 1. Represent the Key in Numerical Form LOWELL = 76 79 87 69 76 76 32 32 32 32 32 32 L O W E L L Blanks
A Simple Hashing Algorithm Example (계속) Step 2. Fold and Add 76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32 7679 + 8769 + 7676 + 3232 + 3232 = 30588 (30588+3232 = 33820 => 2byte Maximum 값 32767 을 초과하므로) 7679 + 8769 = 16448 => 16448 mod 19937 = 16448 16448 + 7676 = 24124 => 24124 mod 19937 = 4187 4187 + 3232 = 7419 => 7419 mod 19937 = 7419 7419 + 3232 = 10651 => 10651 mod 19937 = 10651 10651 + 3232 = 13883 => 13883 mod 19937 = 13883 Step 3. Divide by the Size of the Address Space a = s mod n (n : # of address in file) a = 13883 mod 100 = 83 a = 13883 mod 101 = 46
3. Hashing Functions and Record Distributions Distributing Records among Addresses 1 2 3 4 5 6 7 8 9 10 A B C D E F G Record Address Best (a) 1 2 3 4 5 6 7 8 9 10 A B C D E F G Record Address Worst (b) Acceptable Record Address 1 2 3 4 5 6 7 8 9 10 A B C D E F G (c) <Figure 11.3> Different distributions. (a) Uniform distribution(Best) (b) Worst case (c) Randomly distribution (Acceptable)
Hashing Functions and Record Distributions Some Other Hashing Methods Better than random Examine keys for a pattern 주민등록 번호 Divide the key by a prime number Random Square the key and take the middle 4532 => 2 0 5 2 0 9 Radix transformation
4. How Much Extra Memory Should Be Used ? Packing Density Example r = 75 records N = 100 address
How Much Extra Memory Should Be Used ? Predicting Collisions for Different Packing Densities Packing density (%) Synonyms (%) 10 40 70 90 100 4.8 17.6 28.1 34.1 36.8 <Table 11.2> Effect of packing density on the proportion of records not stored at their home addresses
5. Collision Resolution by Progressive Overflow Open addressing Linear probing address 3 York h(K) 1 2 Rosen Novak’s home address 3 Jasper York’s home address address 2 Novak h(K) 4 York
Collision Resolution by Progressive Overflow Search Length Key Home Address # of Access (Search Length) Adams Bates Cole Dean Evans 0 1 1 2 0 1 1 2 2 5 Adams 1 Bates 2 Cole 3 Dean 4 Evans 5
Collision Resolution by Progressive Overflow Search Length (계속) Example <Figure 11.7> Average search length versus packing density in a hashed file
6. Storing More Than One Record per Address : Buckets Key Home Address Green Hall Jenks King Land Marx Nutt 0 0 2 3 3 3 3 Green Hall 1 Jenks 2 King Land Marks 3 Nutt 4
Storing More Than One Record per Address : Buckets Effects of Buckets on Performance r : # of records N : # of addresses b : # of records in a bucket File without buckets File with buckets # of records # of addresses Bucket size Packing density Ratio of records to addresses r = 750 N = 1000 b = 1 0.75 r/N = 0.75 r = 750 N = 500 b = 2 0.75 r/N = 1.5
Storing More Than One Record per Address : Buckets <Table 11.4> Synonyms causing collisions as a percent of records for different packing densities and different bucket sizes Packing density Bucket size 1 2 5 10 20 % 50 % 80 % 100 % 9.4 21.3 31.2 36.8 2.2 10.4 20.4 27.1 0.1 2.5 10.3 17.6 0.0 0.4 5.3 12.5
7. Making Deletions 처음상태 Key Home Address Actual address Adams Jones Morris Smith 1 2 3 Adams Jones 1 Morris 2 Smith 3
Making Deletions (1) Tombstones for Handling Deletions Adams Jones 1 Morris 2 Smith 3 * Deletion of Morris Adams Jones 1 ### 2 Smith 3 “Smith는 찾을 수 없다” ### : tombstone This mark indicates that a record once lived there but no longer does
Making Deletions (2) Implications of Tombstones for Insertions Inserting “Smith” (3) Effects of Deletions and Additions on Performance Solution to problem of deteriorating average search length Reorganization
8. Other Collision Resolution Techniques (1) Double Hashing Second hashing function Increment(c) adding Seek time overhead
Other Collision Resolution Techniques (2) Chained Progressive Overflow Adams 1 Bates 2 Cole Key Home address Actual Address Search length(1) Search length(2) Adams Bates Cole Dean Evans Flint 0 1 0 1 4 0 0 1 2 3 4 5 1 1 3 3 1 6 1 1 2 2 1 3 3 Dean 4 Evans 5 Flint Adams 2 1 Bates 3 2 Cole 5 3 Dean -1 4 Evans -1 5 Flint -1
Other Collision Resolution Techniques (3) Chaining with a Separate Overflow Area Home address Primary data area Overflow area Adams Cole 2 1 Bates 1 Dean -1 2 Flint -1 3 4 Evans -1
Other Collision Resolution Techniques (4) Scatter Tables: Indexing Revisited Adams 1 1 2 3 4 Coles 3 Bates 4 Flint -1 Deans -1 Evans -1
Patterns of Record Access A small percentage of the records in a file account for a large percentage of the accesses : 80 / 20 Rule 80% of the accesses are performed on 20% of the records