Fundamental File Structure Concepts

Slides:



Advertisements
Similar presentations
INSTRUCTION SET ARCHITECTURES
Advertisements

1 File Structure n File as a stream of characters l No structure l Consider students registered in a course Joe SmithSC Kathy LeeEN Albert.
CPSC 231 Managing Files of Records (D.H.) 1 Learning Objectives Concept of key - primary and secondary keys. Sequential versus direct access. RRN Use of.
The Fundamentals of C++ Basic programming elements and concepts JPC and JWD © 2002 McGraw-Hill, Inc.
Managing Files of Records CS 3050, Spring /4/2007 Dr Melanie Martin.
File Structure Fundamentals (D.H.)1 Learning Objectives Field and record organization Index file C++ code that deals with field and record organization.
Chapter 3: Input/Output
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
FORMATTED INPUT OUTPUT. Topics to be discussed……………….. Formatted Console I/O operationsFormatted Console I/O operations Defining field width :Width()Defining.
February 1 & 31 Csci 2111: Data and File Structures Week4, Lectures 1 & 2 Fundamental File Structure Concepts & Managing Files of Records.
Fundamental File Structure Concepts & Managing Files of Records
Prof. Yousef B. Mahdy , Assuit University, Egypt File Organization Prof. Yousef B. Mahdy Chapter -4 Data Management in Files.
You gotta be cool. Stream Stream Output Stream Input Unformatted I/O with read, gcount and write Stream Manipulators Stream Format States Stream Error.
Implementation of a Stored Program Computer ITCS 3181 Logic and Computer Systems 2014 B. Wilkinson Slides2.ppt Modification date: Oct 16,
Operating Systems COMP 4850/CISG 5550 File Systems Files Dr. James Money.
File Processing - Fundamental concepts MVNC1 Fundamental File Structure Concepts Chapter 4.
CSC141- Introduction to Computer Programming Teacher: AHMED MUMTAZ MUSTEHSAN Lecture – 31 Thanks for Lecture Slides: C How to Program by Paul Deital &
Chapter 3 – Variables and Arithmetic Operations. Variable Rules u Must declare all variable names –List name and type u Keep length to 31 characters –Older.
Chapter 3: Input/Output
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Chapter -7 Basic function of Input/output system basics and file processing Stream classes : I/O Streams. A stream is a source or destination for collection.
Chapter 3: Input/Output. Objectives In this chapter, you will: – Learn what a stream is and examine input and output streams – Explore how to read data.
Comp 335 File Structures Fundamental File Structure Concepts.
Chapter 3: Input/Output. Objectives In this chapter, you will: – Learn what a stream is and examine input and output streams – Explore how to read data.
Introduction Every program takes some data as input and generate processed data as out put . It is important to know how to provide the input data and.
Chapter 3 Data Representation
28 Formatted Output.
Chapter 14: Sequential Access Files
C Formatted Input/Output
Topic 2 Input/Output.
Module 11: File Structure
CPSC 231 Organizing Files for Performance (D.H.)
Introduction to C++ Programming
CHP - 9 File Structures.
BASIC ELEMENTS OF A COMPUTER PROGRAM
CS522 Advanced database Systems
CPS120: Introduction to Computer Science
CPS120: Introduction to Computer Science
University of Central Florida COP 3330 Object Oriented Programming
CPSC 231 Managing Files of Records (D.H.)
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Attributes and Domains
Revision Lecture
A Closer Look at Instruction Set Architectures
Chapter 21 - C++ Stream Input/Output
Chapter 11: File System Implementation
Standard Input/Output Streams
Standard Input/Output Streams
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Lecture 13 Input/Output Files.
2.1 Parts of a C++ Program.
files Dr. Bhargavi Goswami Department of Computer Science
C++ Data Types Data Type
ECEG-3202 Computer Architecture and Organization
Chapter 3 Input output.
Introduction to C++ Programming
Differences between Java and C
Advanced UNIX progamming
File Storage and Indexing
A Simple Two-Pass Assembler
ECEG-3202 Computer Architecture and Organization
Files Management – The interfacing
Programming with ANSI C ++
Chapter 2: Introduction to C++.
Introduction to Data Structure
ECE 352 Digital System Fundamentals
Fundamental Programming
The Fundamentals of C++
Input/Output Streams, Part 2
Presentation transcript:

Fundamental File Structure Concepts

Record and Field Structure A record is a collection of fields. A field is used to store information about some attribute. The question: when we write records, how do we organize the fields in the records: so that the information can be recovered so that we save space so that we can process efficiently to maximize record structure flexibility

Field Structure issues What if Field values vary greatly Fields are optional

Field Delineation methods Fixed length fields Include length with field Separate fields with a delimiter Include keyword expression to identify each field

Fixed length fields Easy to implement - use language record structures (no parsing) Fields must be declared at maximum length needed 10 10 15 15 2 9 last first address city state zip “Yeakus Bill 123 Pine Utica OH43050 “

Include length with field Begin field with length indicator If maximum field length <256, a byte can be used for length last first address city state zip Length bytes Yeakus Bill 123 Pine 06 59 65 61 6B 75 73 04 42 69 6C 6C 08 31 32 33 20 50 69 6E 64 . .

Separate fields with a delimiter Use a special character not used in data space, comma, tab Also special ASCII char’s: Field Separator (fs) 1C Here we use “|” Also need a end of record delimiter: “#” “Yeakus|Bill|123 Pine|Utica|OH|43050#“

Include keyword expression Keywords label each fields A self-describing structure Allows LOTS of flexibility Uses lots of space “LAST=Yeakus|FIRST=Bill|ADDRESS=123 Pine| CITY=Utica|STATE=OH|ZIP=43050#“

Optional Fields Fixed length Field length Delimiter Keywords Leave blank Field length zero length field Delimiter Adjacent delimiters Keywords Just leave out

Reading a stream of fields Need to break record into fields Fixed length can simply be read into record structure Others must be “parsed” with a parse algorithm

Record Structures How do we organize records in a file? Records can be fixed length or variable length Fixed length allows simple direct access lookup Fixed may waste space Variable - how do we find a records position?

Record Structures Fixed Length Records Fixed number of fields in records Variable length prefix each record with a length Use a second file to keep track of record start positions Place delimiter between records

Fixed Length Records All records same length Record positions can be calculated for direct access reads. Does not imply the that the sizes or number of fields are fixed. Variable length records would lead to unused space.

Fixed number of fields in records Field size could be fixed or variable Fixed results in fixed size records simply read directly into “struct” Variable sized fields delimited or field lengths Simply count fields while parsing

Variable length Records prefix each record with a length Use a second file to keep track of record start positions Place delimiter between records

Prefix records with a length Allows true variable length records Form of prefix: Character number (fixed length) Binary number (write integer without conversion) Must consider Maximum length No direct access (great for sequencial access)

Index of record start addresses A second file is simply a list of offsets to successive records Since the offsets are fixed length, this file allows direct access, thereby allow direct access to main file. Problem Maintaining file (adding and deleting records) Cost of index

Place delimiter between records Special character not used in record Allows efficient variable size No direct access Bible files - use ‘\n’ as delimiter

Binary data in files Binary reals and integers can be written, and read, from a file: Need to know byte size of variables used. “tsize” function returns data size

Binary data in files int rsize; char rec_buf[MAX]; fstream mf; mf.open(“myfile.bin”,ios::binary| ios::out); … strcpy (rec_buf,”this is a test record”); rsize = strlen(rec_buf); mf.write(&rsize,sizeof(int)); // write the size mf. write(rec_buf,rsize); // write the record mf.close(); mf.open(“myfile.bin”,ios::binary| ios::in); mf. read(&rsize,sizeof(int)); // read the size mf. read(rec_buf,rsize); // read the record

Viewing Binary file data Use the file dump utility (od - octal dump) od -xc <filename> x - hex output c - character output Useful for viewing what is actually in file

Using Classes to Manipulate Buffer Three Classes delimited fields Length-based fields Fixed length fields

Record Access - Keys Attribute used to identify records Often used to find records Standard or canonical form rules which keys must conform to prevents missing record because key in different form Example: all capitals Phone in form (nnn) nnn-nnnn

Record Access - Keys Keys can distinct - uniquely identify records Primary keys one-to-one relationship between key value and possible entities represented SSN, Student ID Keys can identify a collection of records Secondary keys one-to-many relationship City, position, department

Record Access - Keys Primary key desired characteristics unique among collection of entities dataless - what if some entities have not value of this type (e.g. SSN) unchanging

Record access Performance of access method how do we compare techniques? Must be careful what events we count. “big-oh” notation gives us a way to factor out all but the most significant factors

Record Access - timing Sequential searching Consider file of 4000 records What if no blocking done, and one record per block? (500 bytes records, 512 byte blocks) What if cluster size set to 8? always requires O(n), but search is faster by a constant factor

Sequential searching Usually NOT the best method Sometimes it is best: Searching for some ASCII pattern (grep) Small files Files rarely searched Searching on secondary key, and a large percentage of records match (say 25%)

Unix Tools for sequential file processing cat - display a file wc - count lines, words, and characters grep - find lines in file(s) which match regular expression.

Direct Access Move “directly” to record without scanning preceding data Different languages/OS’s support different models: Byte offset model Programmer must specify offset to record, and record size to read. Supports variable size records, skip sequential processing Relative Record Number (RRN) model File has a fixed record size (declared at creation time) Records are specified by a record number File modeled as a collection of components Higher level of abstraction

Direct Access Different language support RRN support Byte offset PL/I COBOL Pascal (files are modeled as a collection of components (records) FORTRAN Byte offset C

Choosing Record Sizes for Direct Access Fixed Length Fields Very easy to parse records - just read into record structure! Each field must be maximum length needed! Thus record must be as long all the maximum fields 10 10 15 15 2 9 last first address city state zip “Yeakus Bill 123 Pine Utica OH43050 “

Choosing Record Sizes for Direct Access Variable length fields Each field can be any length since some can be long, others short, overall record size may be shorter. This gives more flexibility to fields length Records must be parsed, space wasted for delimiter or length bytes. Yeakus|Bill|123|Pine|Utica|OH43050 Snivenloppinsky|Helmut|12232 Galmentary Avenue|Spotsdale|NY|11232

Header Records The first record in a direct file may be used to store special information Number of records used. Location of first record in key order sequence. Location of first empty record File record structure (meta-data) In languages with the RRN model Pascal, variant record facility must be used In C++, the header record can be of different size from the rest of the file records.

Header Records Consider a file of persons Header record contains 2 byte number of record count. Header size is 32, record size is 67 class Person { public: char LastName [11]; char FirstName [11]; char Address [16]; char City [16]; char State [3]; char ZipCode [10]; } class head { public: short rec_count; char fill[30]; };

Header Records Must be written when file created Must be rewritten when file changed Must be read when file is opened

IOS - I/O streams in C++

IOS - I/O streams in C++ #include <iostream.h> As the iostream class hierarchy diagram shows, ios is the base class for all the input/output stream classes. You will not use ios directly, rather you will be using many of the inherited member functions and data members.

IOS - I/O streams in C++ basefield adjustfield floatfield Data Members (static) — Public Members basefield Mask for obtaining the conversion base flags (dec, oct, or hex). adjustfield Mask for obtaining the field padding flags (left, right, or internal). floatfield Mask for obtaining the numeric format (scientific or fixed).

IOS - I/O streams in C++ Flag and Format Access Functions — Public Members flags Sets or reads the stream’s format flags. setf Manipulates the stream’s format flags. unsetf Clears the stream’s format flags. fill Sets or reads the stream’s fill character. precision Sets or reads the stream’s floating-point format display precision. width Sets or reads the stream’s output field width.

IOS - I/O streams in C++ Status-Testing Functions — Public Members good Indicates good stream status. bad Indicates a serious I/O error. eof Indicates end of file. fail Indicates a serious I/O error or a possibly recoverable I/O formatting error. rdstate Returns the stream’s error flags. clear Sets or clears the stream’s error flags.

IOS - I/O streams in C++ ios Manipulators dec hex oct binary text Causes the interpretation of subsequent fields in decimal format (the default mode). hex Causes the interpretation of subsequent fields in hexadecimal format. oct Causes the interpretation of subsequent fields in octal format. binary Sets the stream’s mode to binary (stream must have an associated filebuf buffer). text Sets the stream’s mode to text, the default mode (stream must have an associated filebuf buffer).

IOS - I/O streams in C++ Parameterized Manipulators (#include <iomanip.h> required) setiosflags Sets the stream’s format flags. resetiosflags Resets the stream’s format flags. setfill Sets the stream’s fill character. setprecision Sets the stream’s floating-point display precision. setw Sets the stream’s field width (for the next field only).

File Access and Organization File Organization Variable Length Records Fixed Length Records Field Structures (size bytes, delimiters, fixed) File Access Sequential access Direct access Indexed access

File Access and Organization Interaction between organization and access Can the file be divided into fields? Is there a higher level of organization to the file (meta data)? Do all records have to have the same number of fields, bytes? How do we distinguish one record from the next? How do we recognize if a fixed length record holds real data or not?

File Access and Organization There is a often a trade-off between space and time Fixed length records - allow direct access, waste space Variable require sequential search We also must consider the typical use of the file - what are the desired access patterns Selection of a particular organization has implications on the allowable types of access

Portability and Standardization Differences among Languages Fixed sized records versus byte addressable access Differences among Machine Architectures Byte order of binary data May be high order or low order byte first

Byte order of binary data High order first: (Big Endian) A long int: say 45 is stored in memory. It is stored as: 00 00 00 2D Sun’s, Network protocols Low order first (Little Endian) It is stored as: 2D 00 00 00 PC’s, VAX’s

Byte order of binary data If binary data is written to a file, it is written in the order stored in memory If the data is later read by a system with a different ordering, the number will be incorrect! For the sake of portability, files should be written in an agreed upon format (probably Big Endian)