Presentation is loading. Please wait.

Presentation is loading. Please wait.

File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.

Similar presentations


Presentation on theme: "File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11."— Presentation transcript:

1 File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11

2 File Processing - Indirect Address Translation MVNC2 Indirect Address Translation l Direct translation »Primary Key (PK) and the relative record position (RRP) are the same, we say there is a direct translation. »Simple direct access file systems use this technique.

3 File Processing - Indirect Address Translation MVNC3 Indirect Address Translation l Direct translation - problems »The PKs may not be numeric. –Names –Alpha numeric IDs

4 File Processing - Indirect Address Translation MVNC4 Indirect Address Translation l Direct translation - problems »Only a small percent of the possible range of PK's may actual have records assigned to them: –Consider a keyfield for an employee file is a 9 digit ID number. (E.g. Social Security Number) –The company has 200 employees. –Since the ID's may have any of the 10 9 values, The file will have to be huge (10 9 records!). Thus the file will have a packing density of: 200 records used 10 9 records allocated = = %

5 File Processing - Indirect Address Translation MVNC5 Indirect Address Translation l Hashing »A common technique of indirect translation is hashing. »A solution in which the broad range of PK values are transformed into the smaller range of RRP values. »Hashing uses a hashing function to map translate thne key values into the smaller range of the RRP values.

6 File Processing - Indirect Address Translation MVNC6 Indirect Address Translation l Hashing Algorithms »Development of a hashing function requires careful attention –The algorithm should distribute the keys as evenly as possible across the range of address. –Some different key MUST necessarily map to the same addresses

7 File Processing - Indirect Address Translation MVNC7 Key Transformation Algorithms l 3 general steps to convert a key to a RRP address: 1) If key is not numeric, convert it into a numeric form, without losing information. 2) Operate on the numeric key using an algorithm which converts the keys into a spread of numbers of the order of magnitude of the address numbers required. 3) The resulting numbers are multiplied by a constant which compresses the address into the precise range of addresses.

8 File Processing - Indirect Address Translation MVNC8 Key Transformation Algorithms l Example: »Key is a 9 Digit Number. »Destination file has 7000 records »Step 1 - Not needed (already a number) »Step 2 - Divide Key by to get remainder between »Step 3 - we multiply the value from 2 by.7 to put number within the range 0000 to 6999.

9 File Processing - Indirect Address Translation MVNC9 Key Transformation Algorithms l Example: »What would happen if we simply skip step 2, and simply compress the number from step 1? »What about clustered insertions? (Keys with contiguous values.)

10 File Processing - Indirect Address Translation MVNC10 Key Transformation Algorithms - Division l The key is divided by a number approximately equal to the number of available addresses, and the remainder is taken as the RRP. l A prime number or number with no small factors is used.

11 File Processing - Indirect Address Translation MVNC11 Key Transformation Algorithms - Division l Example: »records have 6-digit key, 5000 RRPs desired. »divide by 4997 and use remainder »consider key: » = 28 remainder »Use 2620 as RRP. l How do you suppose this method would work with clustered insertions?

12 File Processing - Indirect Address Translation MVNC12 Key Transformation Algorithms - Extraction l Select digits from different parts of key. l Example: »Records with 10-digit key, 5000 RRPs desired. »Choose 3 rd, 5 th, 8 th and 9 th digits: »Consider key = »Compress into RRP range: INT(8625 *.5) = Use 4312 as RRP.

13 File Processing - Indirect Address Translation MVNC13 Key Transformation Algorithms - Folding l Digits in the key are folded inward like folding paper. Then the digits are added. l Folding tends to be more appropriate for large keys.

14 File Processing - Indirect Address Translation MVNC14 Key Transformation Algorithms - Folding l Example »Let key be »Fold left at 4 th digit, right at 3 rd digit: »Results in 4137 and 735 »Add the two resulting values: = 4872 »Compress into RRP range: »4872 x.5 = Use 2436 as RRP.

15 File Processing - Indirect Address Translation MVNC15 Key Transformation Algorithms - Mid-square method l Square the key, and use the central digits of the result. l Example: »Let records have 6-digit key, and 5000 RRP's desired. »Key value of » > » central digits »Compress into RRP range: »1651 x.5 = 825. Use 825 as RRP.

16 File Processing - Indirect Address Translation MVNC16 Key Transformation Algorithms - Selection l The best way to choose a transform is to take the key set for the file and simulate using different transforms. l Choose the one which distributes the records most evenly. l The division method seems to be the best general transform.

17 File Processing - Indirect Address Translation MVNC17 Important hashing considerations l When designing a practical hashing scheme, several important issues must be addressed: l record distribution »A hashing function needs to be picked which will evenly distribute the records throughout the RRP range. »Different key sets will have different distribution patterns. »Thus the hashing function chosen will depend on the patterns of keys in the data set.

18 File Processing - Indirect Address Translation MVNC18 Important hashing considerations l synonyms »two or more PKs which transform to the same RRP address. »The the goal is to devise a hashing function for a given key set of keys which will minimize synonyms. »It is, however, statistically beyond reason to totally avoid synonyms. »Not only would all keys need to be known in advance, but only one algorithm in will work!

19 File Processing - Indirect Address Translation MVNC19 Important hashing considerations l collisions »When a new record hashes to a record already in use by another record. »The new record and the existing record are called synonyms. »The result is called an overflow. »A scheme must be devised to handle overflows efficiently.

20 File Processing - Indirect Address Translation MVNC20 Important hashing considerations l packing density »ratio of records stored in a file to addresses available in the file. »Typically the best packing density is 80-90%. »The larger the file, the less the probability of an overflow. »There is thus a trade-off between space and efficiency. space efficiency

21 File Processing - Indirect Address Translation MVNC21 Techniques for handling collisions l Strategies for collision resolution: 1. Create the file so that each address (physical record) can hold several logical records (usually synonyms). Called Composite Records or buckets. 2. Develop algorithms for relocating records which collide.

22 File Processing - Indirect Address Translation MVNC22 Composite Records or buckets l Reduce number of RRP’s, but increase the size of each to hold several records. l Each RRP (called a bucket) now holds several logical records

23 File Processing - Indirect Address Translation MVNC23 Composite Records or buckets l buckets are arrays of logical records. l bucket size - number of records/bucket l Now room for several synonyms in each bucket. l Probability of overflow is reduced. l Overflow now only occurs when bucket is full. l Overall file size need not increase, if bucket size 5, then reduce number of physical records by 5.

24 File Processing - Indirect Address Translation MVNC24 Composite Records or buckets l May be implemented by having file record be arrays of logical records l Example: Consider two half full files rec Probabity of Overflow?

25 File Processing - Indirect Address Translation MVNC25 Composite Records or buckets l Trade-offs »as bucket size increases, probability of a overflow is greatly reduced. »as bucket size increases, time to read in and scan bucket increases »Typical bucket sizes range from 5 to 30. »Ideal bucket size often a multiple of the disk sector or track size. »What is the extreme case of having the longest possible bucket?

26 File Processing - Indirect Address Translation MVNC26 Handling overflows l Increasing bucket size will reduce, but not eliminate overflows. They must be dealt with. l Many algorithms exist for handling overflows, including: 1. Progressive overflow 2. Separate overflow area 3. Chained Progressive overflow

27 File Processing - Indirect Address Translation MVNC27 Progressive overflow l Adding new record »If home address is full, try the next record. »If next address full, try next, and so one. »If at end of file, wrap around to record 0 »If search continues until home address again reached, file full.

28 File Processing - Indirect Address Translation MVNC28 Progressive overflow l Finding a record »If in home bucket, success! »Else if home bucket not full, search fails. »Else if home bucket full, go search next bucket. »Keep searching successive buckets until either found, or a non-full bucket is searched.

29 File Processing - Indirect Address Translation MVNC29 Progressive overflow l Finding a record »Note that as file fills, search length will increase. »What are some enhancements? –Each bucket has flag indicating if bucket has really overflowed

30 File Processing - Indirect Address Translation MVNC30 Progressive overflow l Delete record »Can't simply remove, or find may not work correctly »Must mark each record as used, unused, or deleted.

31 File Processing - Indirect Address Translation MVNC31 Progressive overflow l Evaluation »simple »robust »searches may get very long »clustering

32 File Processing - Indirect Address Translation MVNC32 Progressive overflow l Alternate version - skip x records each time, where x is prime relative to the number of records. l Reduces the problem of record clustering

33 File Processing - Indirect Address Translation MVNC33 Separate overflow area l Buckets contain pointers which may point to a record in a special overflow area. l Records (or buckets) are linked together in the overflow area as a linked list. l What happens if there are a lot of synonyms for a few home addresses?

34 File Processing - Indirect Address Translation MVNC34 Separate overflow area

35 File Processing - Indirect Address Translation MVNC35 Chained Progressive overflow l similar to progressive, but pointers link synonyms together for quicker searches.


Download ppt "File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11."

Similar presentations


Ads by Google