Fundamental File Structure Concepts

Fundamental File Structure Concepts

Record and Field Structure
A record is a collection of fields. A field is used to store information about some attribute. The question: when we write records, how do we organize the fields in the records: so that the information can be recovered so that we save space so that we can process efficiently to maximize record structure flexibility

Field Structure issues
What if Field values vary greatly Fields are optional

Field Delineation methods
Fixed length fields Include length with field Separate fields with a delimiter Include keyword expression to identify each field

Fixed length fields Easy to implement - use language record structures (no parsing) Fields must be declared at maximum length needed last first address city state zip “Yeakus Bill Pine Utica OH “

Include length with field
Begin field with length indicator If maximum field length <256, a byte can be used for length last first address city state zip Length bytes Yeakus Bill Pine B C 6C E

Optional Fields Fixed length Field length Delimiter Keywords
Leave blank Field length zero length field Delimiter Adjacent delimiters Keywords Just leave out

Reading a stream of fields
Need to break record into fields Fixed length can simply be read into record structure Others must be “parsed” with a parse algorithm

Record Structures How do we organize records in a file?
Records can be fixed length or variable length Fixed length allows simple direct access lookup Fixed may waste space Variable - how do we find a records position?

Record Structures Fixed Length Records
Fixed number of fields in records Variable length prefix each record with a length Use a second file to keep track of record start positions Place delimiter between records

Fixed Length Records All records same length
Record positions can be calculated for direct access reads. Does not imply the that the sizes or number of fields are fixed. Variable length records would lead to unused space.

Fixed number of fields in records
Field size could be fixed or variable Fixed results in fixed size records simply read directly into “struct” Variable sized fields delimited or field lengths Simply count fields while parsing

Variable length Records
prefix each record with a length Use a second file to keep track of record start positions Place delimiter between records

Prefix records with a length
Allows true variable length records Form of prefix: Character number (fixed length) Binary number (write integer without conversion) Must consider Maximum length No direct access (great for sequencial access)

Index of record start addresses
A second file is simply a list of offsets to successive records Since the offsets are fixed length, this file allows direct access, thereby allow direct access to main file. Problem Maintaining file (adding and deleting records) Cost of index

Place delimiter between records
Special character not used in record Allows efficient variable size No direct access Bible files - use ‘\n’ as delimiter

Binary data in files Binary reals and integers can be written, and read, from a file: Need to know byte size of variables used. “tsize” function returns data size

Binary data in files int rsize; char rec_buf[MAX]; fstream mf;
mf.open(“myfile.bin”,ios::binary| ios::out); … strcpy (rec_buf,”this is a test record”); rsize = strlen(rec_buf); mf.write(&rsize,sizeof(int)); // write the size mf. write(rec_buf,rsize); // write the record mf.close(); mf.open(“myfile.bin”,ios::binary| ios::in); mf. read(&rsize,sizeof(int)); // read the size mf. read(rec_buf,rsize); // read the record

Viewing Binary file data
Use the file dump utility (od - octal dump) od -xc <filename> x - hex output c - character output Useful for viewing what is actually in file

Using Classes to Manipulate Buffer
Three Classes delimited fields Length-based fields Fixed length fields

Record Access - Keys Attribute used to identify records
Often used to find records Standard or canonical form rules which keys must conform to prevents missing record because key in different form Example: all capitals Phone in form (nnn) nnn-nnnn

Record Access - Keys Keys can distinct - uniquely identify records
Primary keys one-to-one relationship between key value and possible entities represented SSN, Student ID Keys can identify a collection of records Secondary keys one-to-many relationship City, position, department

Record Access - Keys Primary key desired characteristics
unique among collection of entities dataless - what if some entities have not value of this type (e.g. SSN) unchanging

Record access Performance of access method
how do we compare techniques? Must be careful what events we count. “big-oh” notation gives us a way to factor out all but the most significant factors

Record Access - timing Sequential searching
Consider file of 4000 records What if no blocking done, and one record per block? (500 bytes records, 512 byte blocks) What if cluster size set to 8? always requires O(n), but search is faster by a constant factor

Sequential searching Usually NOT the best method Sometimes it is best:
Searching for some ASCII pattern (grep) Small files Files rarely searched Searching on secondary key, and a large percentage of records match (say 25%)

Unix Tools for sequential file processing
cat - display a file wc - count lines, words, and characters grep - find lines in file(s) which match regular expression.

Direct Access Move “directly” to record without scanning preceding data Different languages/OS’s support different models: Byte offset model Programmer must specify offset to record, and record size to read. Supports variable size records, skip sequential processing Relative Record Number (RRN) model File has a fixed record size (declared at creation time) Records are specified by a record number File modeled as a collection of components Higher level of abstraction

Direct Access Different language support RRN support Byte offset PL/I
COBOL Pascal (files are modeled as a collection of components (records) FORTRAN Byte offset C

Choosing Record Sizes for Direct Access
Fixed Length Fields Very easy to parse records - just read into record structure! Each field must be maximum length needed! Thus record must be as long all the maximum fields last first address city state zip “Yeakus Bill Pine Utica OH “

Header Records The first record in a direct file may be used to store special information Number of records used. Location of first record in key order sequence. Location of first empty record File record structure (meta-data) In languages with the RRN model Pascal, variant record facility must be used In C++, the header record can be of different size from the rest of the file records.

Header Records Consider a file of persons
Header record contains 2 byte number of record count. Header size is 32, record size is 67 class Person { public: char LastName [11]; char FirstName [11]; char Address [16]; char City [16]; char State [3]; char ZipCode [10]; } class head { public: short rec_count; char fill[30]; };

Header Records Must be written when file created
Must be rewritten when file changed Must be read when file is opened

IOS - I/O streams in C++

IOS - I/O streams in C++ #include <iostream.h>
As the iostream class hierarchy diagram shows, ios is the base class for all the input/output stream classes. You will not use ios directly, rather you will be using many of the inherited member functions and data members.

IOS - I/O streams in C++ basefield adjustfield floatfield
Data Members (static) — Public Members basefield Mask for obtaining the conversion base flags (dec, oct, or hex). adjustfield Mask for obtaining the field padding flags (left, right, or internal). floatfield Mask for obtaining the numeric format (scientific or fixed).

IOS - I/O streams in C++ Flag and Format Access Functions — Public Members flags Sets or reads the stream’s format flags. setf Manipulates the stream’s format flags. unsetf Clears the stream’s format flags. fill Sets or reads the stream’s fill character. precision Sets or reads the stream’s floating-point format display precision. width Sets or reads the stream’s output field width.

IOS - I/O streams in C++ Status-Testing Functions — Public Members
good Indicates good stream status. bad Indicates a serious I/O error. eof Indicates end of file. fail Indicates a serious I/O error or a possibly recoverable I/O formatting error. rdstate Returns the stream’s error flags. clear Sets or clears the stream’s error flags.

IOS - I/O streams in C++ ios Manipulators dec hex oct binary text
Causes the interpretation of subsequent fields in decimal format (the default mode). hex Causes the interpretation of subsequent fields in hexadecimal format. oct Causes the interpretation of subsequent fields in octal format. binary Sets the stream’s mode to binary (stream must have an associated filebuf buffer). text Sets the stream’s mode to text, the default mode (stream must have an associated filebuf buffer).

IOS - I/O streams in C++ Parameterized Manipulators (#include <iomanip.h> required) setiosflags Sets the stream’s format flags. resetiosflags Resets the stream’s format flags. setfill Sets the stream’s fill character. setprecision Sets the stream’s floating-point display precision. setw Sets the stream’s field width (for the next field only).

File Access and Organization
File Organization Variable Length Records Fixed Length Records Field Structures (size bytes, delimiters, fixed) File Access Sequential access Direct access Indexed access

Interaction between organization and access Can the file be divided into fields? Is there a higher level of organization to the file (meta data)? Do all records have to have the same number of fields, bytes? How do we distinguish one record from the next? How do we recognize if a fixed length record holds real data or not?

There is a often a trade-off between space and time Fixed length records - allow direct access, waste space Variable require sequential search We also must consider the typical use of the file - what are the desired access patterns Selection of a particular organization has implications on the allowable types of access

Portability and Standardization
Differences among Languages Fixed sized records versus byte addressable access Differences among Machine Architectures Byte order of binary data May be high order or low order byte first

Byte order of binary data
High order first: (Big Endian) A long int: say 45 is stored in memory. It is stored as: D Sun’s, Network protocols Low order first (Little Endian) It is stored as: 2D PC’s, VAX’s

Byte order of binary data
If binary data is written to a file, it is written in the order stored in memory If the data is later read by a system with a different ordering, the number will be incorrect! For the sake of portability, files should be written in an agreed upon format (probably Big Endian)

Fundamental File Structure Concepts

Similar presentations

Presentation on theme: "Fundamental File Structure Concepts"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fundamental File Structure Concepts

Similar presentations

Presentation on theme: "Fundamental File Structure Concepts"— Presentation transcript:

Similar presentations

About project

Feedback