Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert.

Similar presentations


Presentation on theme: "1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert."— Presentation transcript:

1 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert ChanSC943 n File as a structured collection of related data A set of related data form a record  a file consists of records l Information about each student forms a record 320587Joe SmithSC953 184923Kathy LiEN923 249793Albert ChanSC943 l What is the meaning of each piece of information about each student?

2 2 DBMS Structures & Files DBMS StructuresFile Structures AttributeField TupleRecord RelationFile

3 3 Fields n Each record consists of a set of fields n Fields separate data units l Identification of the pieces of data in a record à 320587Joe SmithSC953 n Usually the same fields exist in all records in a file

4 4 Field Separation Alternatives n Fixed length fields l A given field (e.g., NAME) is the same size for all records l Easy and fast reading but wastes space 320587Joe SmithSC953 184923Kathy LeeEN923 249793Albert ChanSC943 n Length indicator at the beginning of each field l Also wastes space (at least 1 byte per field) l You have to know the length before you store 63205879Joe Smith2SC29513 61849238Kathy Li2EN29213 624979311Albert Chan2SC29413

5 5 Field Separation Alternatives n Separate fields with delimeters l Use white space characters (blank, new line, tab) l Easy to read, uses one byte per field, have to be careful in the choice of the delimeter |320587|Joe Smith|SC|95|3| |184923|Kathy Li|EN|92|3| |249793|Albert Chan|SC|94|3| n Use keywords l Each field has a keyword that indicates what the field is l Self describing but high space overhead ID=320587NAME=Joe SmithFACULTY=SCDEG=92YEAR=3 ID=184923NAME=Kathy LiFACULTY=ENDEG=92YEAR=3 ID= 249793NAME=Albert ChanFACULTY=SCDEG=94YEAR=3

6 6 Record Organization Alternatives n Fixed length records l All records are the same length 320587Joe SmithSC953 184923Kathy LeeEN923 249793Albert ChanSC943 l The number and size of fields in each record may be variable |320587|Joe Smith|SC|95|3| Padding |184923|Kathy Li|EN|92|3| Padding |249793|Albert Chan|SC|94|3| Padding

7 7 Record Organization Alternatives n Variable Length Records l Fixed number of fields à Count the fields to detect the end of record l Length field at the beginning à Put the length of each record in front of it à You have to buffer the record before writing 24320587|Joe Smith|SC|95|323184923|Kathy Li|EN|92|326249793|Albert Chan|SC|94|3

8 8 Record Organization Alternatives n Variable Length Records (cont’d) l Index the beginning à Build a secondary index that shows where each record begins 320587|Joe Smith|SC|95|3184923|Kathy Li|EN|92|3249793|Albert … 00 24 47 l End-of-record markers à Put a special end-of-record marker

9 9 Summary … File System HeaderRecord …Field consists of … File consists of

10 10 Accessing a File n Sequential access l Based on key values l Useful when file is small or most (all) of the file needs to be searched l Complexity O(n) where n is the number of disk reads l Block records to reduce n l Block size should match physical disk organization à multiples of sector size n Direct access l Based on relative record number (RRN) l Record-based file systems can jump to the record directly l Stream-based systems calculate byte offset = RRN * record length

11 11 Header Records n May be the same or different length than the rest of the records in the file n May contain information about the file l Number of records l Size of records l Date of file creation l Date of last file modification l Name of file creator/owner l Meta information à Formats of data à Origin of data à Units used à …

12 12 File Organization Issues Primary concern: Organizing files for improving performance n Data compression n Reclaiming space in files n Search and sorting n Indexing

13 13 Data Compression Encoding information to reduce size of files n Reversible compression l redundancy reduction à short notations: AB for Alberta l suppressing repeating sequence  22 23 24 24 24 24 24 24 25  22 23 ff 24 06 25 (images) l variable length coding (Huffman) à most frequently used letters with least length codes n Irreversible compression l from GIF to JPEG l save 20 ~ 90 %

14 14 Reclaiming Space in Files n File updates l record addition l record deletion l record modification n Requirements l how to recognize deleted records: à tombstone: * l how to utilize space left by deleted records à storage compaction –reconstruct the file to reclaim space occupied by all deleted records –how often ? à Available List

15 15 0 1 2 3 4 5 6 7 List Head Available List Consider fixed length records n Available list is a linked list of deleted records n Implemented as a stack n Use relative record number (RRN) for physical addresses AdamBarbPeterSusanBrendaSueTimJack 3 AdamBarbPeterBrendaSueJackTim3AdamBarbPeterBrendaSueJackTamerAdamBarbPeterBrendaSueJack 6 3

16 16 Variable Length Records Case n Problems l RRNs cannot be used l Fitting l Fragmentation à internal fragmentation: occurs if variable length records are stored in fixed size slots with padding à external fragmentation: split record leftover may be too small to hold any record n Solutions l An available list with the byte offset l Placement strategies l Storage compaction l Coalescing holes à combining adjacent slots to form a bigger one

17 17 Placement Strategies n First fit l unsorted list, the newly deleted record is put at the front l insertion uses the first one on the list that fits n Best fit l the list is sorted in ascending order l insertion uses the first one on the list that fits l too much fragmentation n Worst fit l the list is sorted in descending order l insertion always uses the first one if possible

18 18 Search Problem Find a record with a given key value n Sequential search: O(n) n Binary search: O(log n) l the file must be sorted l how to maintain the sorting order? à deleting, insertion l variable length records n Sorting l RAM sort: read the whole file into RAM, sort it, and then write it back to disk l Keysort: read the keys into RAM, sort keys in RAM and then rearrange records according to sorted keys l Index

19 19 Keysorting 320587Joe SmithSC953 184923Kathy LeeEN923 249793Albert ChanSC943 320587 1 184923 2 249793 3 Before sorting RRN 320587Joe SmithSC953 184923Kathy LeeEN923 249793Albert ChanSC943 184923 2 249793 3 320587 1 After sorting Problem: Now the physical file has to be rearranged

20 20 Indexing n A tool used to find things l book index, student record indexes l A function from keys to addresses n A record consisting of two fields l key: on which the index is searched l reference: location of data record associated with the key n Advantages l smaller size of the index file makes RAM index possible l binary search from files of variable length records l rearrange keys without moving records l multiple indexes à primary and secondary

21 21 Operations With an Indexed File n Create original index and data file n Load index file into RAM before using it n Rewrite index file after using it l file header n Update l insertion l deletion l update

22 22 Secondary Index Primary index CD # physical location ABG379... Composer index composer CD # BeethovenABG379 title CD # SymphonyABG379 Title index n Provides multiple views of records n Example: Consider a collection of music CDs

23 23 Primary vs Secondary Keys n Uniqueness l a primary key is a unique identification of a record l a secondary key may be associated with many records n Binding:association of key and address We may retrieve records using combinations of secondary keys FIND all records WHERE Composer = “ Beethoven” AND Title = “Symphony 9’

24 24 Binding n Association between a key and a physical address n Tight binding l bind early l the binding takes place when the file is24 constructed à advantage: high performance à disadvantage: updates n Lazy binding l bind later l the binding takes place when they are actually used à advantage: easy updates à safer: consistency n Primary index: tight binding; secondary index: later binding


Download ppt "1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert."

Similar presentations


Ads by Google