Presentation is loading. Please wait.

Presentation is loading. Please wait.

File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.

Similar presentations


Presentation on theme: "File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon."— Presentation transcript:

1 File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon

2 File Processing - Organizing file for Performance MVNC2 Organizing Files for Performance l Data Compression l Reclaiming space in files l Fast Searching l Keysorting

3 File Processing - Organizing file for Performance MVNC3 Data Compression l Making files smaller »Use less storage, save space »Faster Transmission »Processed faster l Data Compression »encoding information more efficiently »Many techniques exist

4 File Processing - Organizing file for Performance MVNC4 Data Compression l Consider fields with fixed length or fixed set of values l A binary representation can save space »States - 50 states - 6 bits (one byte) »Zip - 0 to 99999. 17 bits (three bytes) l Called Compact Notation »Redundancy reduction

5 File Processing - Organizing file for Performance MVNC5 Data Compression l Cost of binary representations »file not readable as test »Processing time for conversion »All software must including appropriate/compatable encoding and decoding routines. »Potential lost of flexibility

6 File Processing - Organizing file for Performance MVNC6 Data Compression l Suppressing repreating sequences »Consider a picture –Series of pixels - each a color –Colors represented by 8 bit value –usually come in bunches, e.g. –24 23 22 22 22 22 22 25 25 25 25 25 25 65 65 66 66 66 66 »Run length encoding –Represent long runs with a prefix (FF) follwed by count, followed by color –24 23 FF 05 22 FF 06 25 65 65 FF 04 66 »Simple images would be small, busy images would be no bigger.

7 File Processing - Organizing file for Performance MVNC7 Data Compression l Assigning variable length codes »Some codes are more likely then others »Use shorter codes for often used values, longer ones for less used values. »Each code must have the property of a unique prefix –No code is the prefix of any other code –Thus we always know if we are at the end of a given code

8 File Processing - Organizing file for Performance MVNC8 Variable length codes l Example: Letter:abcdefg Prob:0.40.10.10.10.10.10.1 Code:10100110000000100100011 l Can be decoded with a binary tree! l Called Huffman code »Algorithm exists to easily create optimal code »Requires that a table of codes be mainted with file »Most often used for fixed codes »Example - Type 3 FAX

9 File Processing - Organizing file for Performance MVNC9 Data Compression l Irreversible Compression »Compression which losses some information »Example - compress a 400x400 image into a 100x100 image by averaging groups of 16 adjacent pixels »Saves space, but resolution of picture reduced »Used most often for visual or audio information (which has inherient redundancy)

10 File Processing - Organizing file for Performance MVNC10 Data Compression l Compression in UNIX »pack and unpack programs –Uses Huffman coding –25% to 40% savings on text files –much less on binary files –Uses “.z” file prefix »compress and uncompress programs –Uses Lempel-Ziv compression –No coding table needed - self coding –Uses “.Z” file prefix

11 File Processing - Organizing file for Performance MVNC11 Reclaiming space in files l Suppose a variable length record in the middle of a file is modified so it is: »Longer? »Shorter? l Suppose a record is »Added to to the middle? »Deleted from middle?

12 File Processing - Organizing file for Performance MVNC12 Reclaiming space in files l Record deletion and storage compaction l storage compaction »recovering unused space in a file »from deletion or from record size changing l Consider deleted records »Must be able to recognize deleted records »Have a special mark for record –e,g, asterisk in first charater in key field –May be undeleted if not overwritten!

13 File Processing - Organizing file for Performance MVNC13 Dealing with Deleted records l Occasional compaction l Dynamic maintanance

14 File Processing - Organizing file for Performance MVNC14 Occasional compaction l A process periodically run which reads file, and rewrites with no empty space. l Could happen every night automactically every night/week/month l File unavailable while operation underway.

15 File Processing - Organizing file for Performance MVNC15 Dynamic maintanance l Delete records by marking l Reuse deleted records a new records added, updated l Need: »Way of knowing if deleted records exist »Where deleted records are so we can jump right to them

16 File Processing - Organizing file for Performance MVNC16 Dynamic maintanance l Solution: linked list of deleted records »Each deleted record contains a mark, and a pointer to the next deleted record »The file header contains a pointer to the first deleted record.

17 File Processing - Organizing file for Performance MVNC17 Linked list of deleted records l Fixed-length records l Variable-length records

18 File Processing - Organizing file for Performance MVNC18 Linked list of deleted records l Fixed-length records »Simply maintain a stack of deleted records rooted in header record »Deletion - add to front of list »Addition - use record at front of list »Minimal list maintanance cost

19 File Processing - Organizing file for Performance MVNC19 Linked list of deleted records l Variable-length records »Store for each deleted record –Deletion Marker –link to nect deleted record –record size indicator

20 File Processing - Organizing file for Performance MVNC20 Variable-length records l Insertion »Which deleted record? l Deletion »Add records to list (stack?) »Where

21 File Processing - Organizing file for Performance MVNC21 Variable-length records - Insertion l Select and use a deleted record l Break up records »pick a record »If size of deleted record bigger, break into two - a record to use and a new, smaller, deleted record. »Put smaller deleted record back in list l Leave empty space at end »pick a record »If size of deleted record bigger, just leave empty space at end.

22 File Processing - Organizing file for Performance MVNC22 Variable-length records - Fragmentation l Recall fragmentation in Fixed-length records »At the end of fields if fixed length fields »At the end of records in variable length fields »Called internal fragmentation l Leaving space and the end of a variable length records also leads to internal fragmentation. l Breaking up variable length records get rid of fragmentation, right? Wrong!

23 File Processing - Organizing file for Performance MVNC23 Variable-length records - Fragmentation l As records get broken up, smaller and smaller pieces get left over. l These pieces are external fragmentation

24 File Processing - Organizing file for Performance MVNC24 Variable-length records - Insertion strategy l How to pick record to use? l First Fit »Use first deleted record found in list l Best Fit »Use deleted record closest in size l Worst Fit »Use deleted record that is largest »No good when not breaking up records!

25 File Processing - Organizing file for Performance MVNC25 Variable-length records - Insertion l How do we find the record with the desired size? »Search them ALL! »Keep the records in sorted order by record size –Increasing size facilitates Best fit –Decreasing size facilitates worst fit (just pick first in list) –This increases deletion time!

26 File Processing - Organizing file for Performance MVNC26 Variable-length records - Reducing fragmentation l Merge adjacent free records l How do we know if a newly deleted record is adjacent to a free record? »Search the deleted list »Keep deleted records sorted by position in file –This makes finding of adjacent free space trivial –Costs more at deletion time

27 File Processing - Organizing file for Performance MVNC27 Fast Searching l Binary Searching »O(log n), where n is number of records »requires file be sorted l Question - how do we sort file?

28 File Processing - Organizing file for Performance MVNC28 File Sorting l Sort in Ram »read in entire file - sort »Called internal sorting »Limited by size of memory

29 File Processing - Organizing file for Performance MVNC29 Binary Search - Problems l Binary searching requires more then one or two accesses »Accesses are VERY expensive »Access are very random (much seek time) »100,000 requires average of 16.5 accesses »We would like to approach the speed of a direct lookup!

30 File Processing - Organizing file for Performance MVNC30 Binary Search - Problems l Keeping a file sorted is expensive »Every record added must be entered in sorted order »Reordering is costly l Internal sorted is limited to small files »We will see there are sort methods to sort a file that will not fit in memory. But it is still expensive!

31 File Processing - Organizing file for Performance MVNC31 Keysorting l Rather then sorting file, we could sort an array of primary keys, where each key is accompanied by the address of the associated record. l Pointer could be a byte offset from start, or (if records fixed length) a RRN. l After sort keys, the file can be rewritten in order.

32 File Processing - Organizing file for Performance MVNC32 Keysorting l Advantages »Keys can be sorted in smaller space then whole file »Faster to sort (swap!) keys then entire records

33 File Processing - Organizing file for Performance MVNC33 Keysorting l Disadvantages »Still limited in size to key lists which fit in memory »Sequential processing cannot not take advantage of buffering!

34 File Processing - Organizing file for Performance MVNC34 Keysorting l Alternative - keeping sorted keylist,pointer structure around. l Is a type of index file! l Can be read in and searched in memory!

35 File Processing - Organizing file for Performance MVNC35 Key Sorted Index l Advantages »Keys and pointers can be searched in memery. Only one I/O per lookup! »File can be maintained in ANY order. Searching and key order sequential processing still possible.

36 File Processing - Organizing file for Performance MVNC36 Key Sorted Index l Disadvantages »Sequential processing cannot not take advantage of buffering! »Pinned records –Records in main file cannot change location without invalidating index file! –Must either maintain index in parallel, or rebuild!


Download ppt "File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon."

Similar presentations


Ads by Google