Presentation on theme: "Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter."— Presentation transcript:
Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter 6 Organizing File for Performance
Folk/Zoellick/Riccardi, File Structures 2 Outline Data compression Reclaiming space in files Record deletion Dynamic space reclaiming for fixed-length record Dynamic space reclaiming for variable-length record Storage fragmentation Internal sorting and binary search Keysorting
Folk/Zoellick/Riccardi, File Structures 3 Data Compression Data compression: to organize files into smaller size. –Use less storage, –Can be transmitted faster, –Can be processed faster sequentially. Encoding with a different notation –The “State” field in the address file requires two bytes. However, 50 states can be encoded using 6 bits. 50% space saving for each occurrence of the state field. –The compact notation is a redundancy reduction technique. –Costs: »The file is not readable by humans. »The overhead of encoding and decoding operations.
Folk/Zoellick/Riccardi, File Structures 4 Data Compression (cont’d) Suppressing repeating sequences –Suitable for sparse arrays or images with regions of same colors. –Run-length encoding: choose an unused byte value to indicate that a run- length code following that byte. –Encoding algorithm: »Read through the data (pixels or values) that make up the image or data content, copying the data values to the file in sequence, except where the same data value occurs more the once in the succession, »Where the same value occurs more than once in succession, substitute the following three entries: The special run-length code indicator, The data value that is repeated, and The number of times that the value is repeated. »Example, The encoded sequence is: ff ff ff ff
Folk/Zoellick/Riccardi, File Structures 5 Data Compression (cont’d) Variable length encoding –Letters with high frequency are encoded using shorter symbols. –Letters with low frequency are encoded using longer symbols. –Huffman code (for a set of seven letters): »four bits per letter (minimum 3 bits). –The string “abefd” is encoded as “ ”. –Huffman codes are used in some UNIX systems for data compression. Irreversible compression techniques –Voice coding –Some image coding scheme that change pixel granularity or reduce color quality
Folk/Zoellick/Riccardi, File Structures 6 Reclaiming Space in Files File organization with the following operations: –record insertion –record deletion –record modification Space reclaiming is needed when –deleting fixed-length and variable-length records –modifying variable-length records »can be treated as a deletion followed by an insertion
Folk/Zoellick/Riccardi, File Structures 7 Record Deletion Identifying deleted records –Place a special mark in each deleted record. Eg., place an asterisk (*) as the first field in a deleted record. »Before deletion Ames|John|123 Maple|Stillwater|OK|74075|... Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420| Brown|Martha|625 Kimbark|Des Moines|IA|50311|... »After deletion Ames|John|123 Maple|Stillwater|OK|74075|... *|rrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420| Brown|Martha|625 Kimbark|Des Moines|IA|50311|... –Keep the deleted records around for sometimes. »Delay the disk compaction. »Programs must be able to ignore the deleted records. »Allow to “undelete” records.
Folk/Zoellick/Riccardi, File Structures 8 Record Deletion (cont’d) Space reclamation: –Happens after accumulating a number of deleted records. –A simple solution is to copy the file by skipping the deleted records. »Suitable for both fixed-length and variable-length records. »After space reclamation Ames|John|123 Maple|Stillwater|OK|74075|... Brown|Martha|625 Kimbark|Des Moines|IA|50311|... –In place (not copying a file) space reclamation is more complicated and time consuming.
Folk/Zoellick/Riccardi, File Structures 9 Dynamic Space Reclaiming -- Fixed-Length Records An naive approach: When inserting a new record, –searching the file record by record; –if a deleted record is found, insert the new record in the place of the deleted record; –otherwise, insert the new record at the end of the file. Issues on reclaiming space quickly: –How to know immediately if there are empty slots in the file? –How to jump to one of those slots, if they exist? Linking all deleted records together using a linked list: pointer deleted record Head pointer deleted record deleted record pointer...
Folk/Zoellick/Riccardi, File Structures 1010 Dynamic Space Reclaiming -- Fixed-Length Records (cont’d) –Use the link list of the deleted records as a stack: –Add (push) a recently deleted record of RRN 3 to the top of the stack: –Remove a free space of RRN from the top of the stack for an inserted record: 2 RRN 5 Head pointer RRN 2 2 RRN 5 Head pointer RRN 2 5 RRN 3 2 RRN 5 Head pointer RRN 2
Folk/Zoellick/Riccardi, File Structures 11 Dynamic Space Reclaiming -- Fixed-Length Records (cont’d) –Use the link list of the deleted records as a stack: –Add (push) a recently deleted record of RRN 3 to the top of the stack: –Insert three new records to the space of the deleted records:
Dynamic Space Reclaiming -- Variable-Length Records An available list to store the deleted variable-length records: –How to link the deleted records together into a list? –How to add newly deleted records to the available list? –How to find and remove records from the available list when space is reclaimed? An available list of variable-length records HEAD.FIRST_AVAILABLE: Ames|John|123 Maple|Stillwater|OK|74075|64 Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420|45 Brown|Martha|625 Kimbark|Des Moines|IA|50311| Delete the second record: HEAD.FIRST_AVAILABLE: Ames|John|123 Maple|Stillwater|OK|74075|64 *| |45 Brown|Martha|625 Kimbark|Des Moines|IA|50311|
Folk/Zoellick/Riccardi, File Structures 1313 Dynamic Space Reclaiming -- Variable-Length Records (cont’d) When inserting a new record, we need to search the available list for a deleted record with large enough record length: –The current available list: –Insert a record of 55 bytes: Size 72 Size 68 Size 38 Size 47 Size 68 New Link Size 38 Size 47 Size 72 removed record:
Folk/Zoellick/Riccardi, File Structures 1414 Storage Fragmentation Internal fragmentation caused by fixed-length records: Ames|John|123 Maple|Stillwater|OK|74075| Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420| Brown|Martha|625 Kimbark|Des Moines|IA|50311| Internal fragmentation caused by variable-length records: –The inserted records is shorter than the deleted record HEAD.FIRST_AVAILABLE: Ames|John|123 Maple|Stillwater|OK|74075|64 Ham|Al|28 Elm| Ada|OK|70332| |45 Brown|Martha| 625 Kimbark|Des Moines|IA|50311| –Reclaim the used part of the deleted record: HEAD.FIRST_AVAILABLE: Ames|John|123 Maple|Stillwater|OK|74075|35 *| Ham|Al|28 Elm|Ada|OK|70332|45 Brown|Martha|625 Kimbark|Des Moines|IA|50311|
Folk/Zoellick/Riccardi, File Structures 1515 Storage Fragmentation (cont’d) External fragmentation caused by continuing to insert records so some space becomes too fragmented to be useful: –Insert a record of 25 bytes HEAD.FIRST_AVAILABLE: Ames|John|123 Maple|Stillwater|OK|74075|8 *| Lee|Ed |Rt 2|Ada|OK| Ham|Al|28 Elm|Ada|OK|70332|45 Brown |Martha|625 Kimbark|Des Moines|IA|50311| How to handle external fragmentation: –storage compaction: regenerate the file when external fragmentation becomes intolerable. –coalescing the holes: combine two record slots on the available list if they are physically adjacent. –placement strategy: adopt a placement strategy to minimize fragmentation.
Folk/Zoellick/Riccardi, File Structures 1616 Placement Strategies First-fit placement strategy: search the first available space which is large enough for the inserted record. –Least amount of work when we place a newly available space on the list. Best-fit placement strategy: search the smallest available which is large enough for the inserted record. –Order the available list in ascending order by size, then use the first-fit placement strategy. –After inserting the new record, the free area left over may be too small to be useful. May cause serious external fragmentation. –The small free slots are placed at the beginning of the available list. Make the search of the first-fit space increasingly long as time goes on. Worst-fit placement strategy: –Order the available list in descending order by size, then use first-fit placement strategy. »Always insert the new record to the first slot. If the first slot is not large enough. The new record is inserted to the end of the file. »Decrease the chance of external fragmentation.
Folk/Zoellick/Riccardi, File Structures 1717 Binary Search Search by guessing. –Use RRN to jump around Searching a file of n records: –the worst case: log n +1 comparisons, –the average case: log n +1/2 comparisons. Requirement –Works only for fixed-length records. –The records must be in order in the searching field.
Folk/Zoellick/Riccardi, File Structures 1818 Sorting a Disk File in RAM If the records are not in order, they must be sorted before we can use binary search. Consider any internal sorting algorithms: bubble sort, quick sort, bucket sort, etc. –If applied directly on data stored on disk, they require many disk accesses (seeking, rotational delay) and multiple passes over the list. Extremely slow –If the entire file can fit into RAM. Load the entire contents of the file into RAM and perform internal sorting. »Can access records sequentially. »Much faster if the file is stored sequentially. »This is an example of a general rule: minimizing disk access cost by forcing disk accesses into a sequential mode and performing complex, direct access in memory.
Folk/Zoellick/Riccardi, File Structures 1919 Limitations of Binary Searching and Internal Sorting Binary searching requires more than one or two disk accesses –Accessing records by relative record number (RRN), we can retrieve a record with a single disk access. –Ideally, we can combine RRN retrieval (single access) and search by key (ease of use). Keeping a file sorted is very expensive –If record insertion is as frequent as record search, it is expensive to keep records sorted. –Keep records unsorted and use sequential search. An internal sort works only on small files –It is not possible to read all records of a large file into the main memory. –Only load the keys to the main memory -- keysorting.
Keysorting Only load records keys into RAM. A KEYNODES[ ] array has two fields: KEY and RRN. There is a correspondence between KEYNODES[ ] and records in the actual file. Actual sorting process, simply sort the KEYNODES[ ] array according to the key field.
Limitation of Keysorting The keysort method requires two reads and one write for each record. –The first pass of reads can be done sequentially, sector by sector. –The second pass of reads cannot be done sequentially. It may requires many random seeks for these reads. –Since the write operations interleave with the reads in the second pass, these writes also require separate seeks. If only one copy of the records are kept in the disk, it is not an easy job to create a sorted version of the file from KEYNODES [ ] array. Solution: –Not to write the sorted file back to the disk. –Only write the KEYNODES [ ] array back to the disk as the index file.