Subject Name: File Structures

Subject Name: File Structures
Subject Code: 10IS63 Engineered for Tomorrow

Organization of Files for Performance, Indexing
Chapter 3 Organization of Files for Performance, Indexing Prepared By: Swetha G Department: Information Science & Engg Date : 02 / 03 / 2015

Overview Data Compression Internal sorting & Binary search Indexing
Object oriented support for indexing Secondary Indexes and related theory.

This can be done through,
This chapter deals with file organization to improve the performance. This can be done through, a) Compression involves making files smaller by encoding, b) Reclaim unused space for improving performance, c) Reclaim space created by deleting / upgrading record, d) Reordering files by sorting for better performance.

Data Compression Reasons for data compression
Uses less storage space – Cost saving. Data transmission is faster, decreasing access time processing faster sequentially Compression involves encoding. Different encoding Techniques are : 1) Use Notation 2) Suppressing Repeated sequences 3) Assigning variable length codes 4) Irreversible compression Techniques 5) UNIX compression tools.

Using a different notation
Can be used for Fixed-Length fields records. Data is compressed , by finding optimal value for the size of the field. – Compact Notation Example : original state field notation is 16 bits, but we can encode with 6bit notation because of the # of all states are 50 Cons. By using binary encoding, text becomes unreadable by human. cost in encoding and decoding time. Incorporating encoding & decoding modules increase the complexity of s/w. Can be used for particular application only.

Suppressing repeating sequences
Consider, Run-length encoding algorithm read through pixels, copying pixel values to file in sequence, except the same pixel value occurs more than once in succession when the same value occurs more than once in succession, substitute the following three bytes special run-length code indicator ( say ff ) pixel value repeated the number of times that value is repeated Example 22 23 ff ff

Suppressing repeating sequences
Pros: Reduces redundancy Algorithm is simple Used for Text, sparse matrix, instrument data Cons. Not guarantee any particular amount of space savings Under some circumstances, compressed image is larger than original image

Assigning variable-length codes
Assigning symbols / codes for values that occur more frequently. Morse code: For some values occurring more frequently than others Frequently occurring value should take the least amount of space ex, a use “. “ , e use “-” Here, table of alphabets, its equivalent code is built once, and used for all the communication. – static table Huffman coding Similar to Morse code, but uses dynamically built table. determine probabilities of each value occurring build binary tree with search path for each value more frequently occurring values are given shorter search paths in tree Builds the tree for each communication.

Example of Huffman coding
Letter: a b c d e f g Prob: Code: Assuming “a” is occuring more frequently. To encode the string “abde” =

Irreversible compression techniques
Is a technique where, data once encoded, it cannot be regenerated again. This a method, where some information will be lost. Used less common in data files and more for images. Example : 1) Shrinking raster image 400-by-400 pixels to 100-by-100 pixels 1 pixel for every 16 pixels 2) Speech compression voice coding (the lost information is of no little or no value)

Compression in UNIX System V uses
pack & unpack using Huffman codes to compress files after compress file, appends “.z” to end of packed file Berkeley UNIX uses compress & uncompress using Lempel-Ziv method after compress file, appends “.Z” to end of compressed file

Reclaiming Spaces in files
When a record is deleted, space is created. When a variable– length record is updated, that is larger that the original record, updated record will be moved to the end, creating space. When a variable – length record is updated, that is smaller that the original record, space is again left. This space created is known as fragmentation. Fragmentation costs for the performance in terms of storage and assess. Hence there should be provision to reclaim the space.

Record deletion and Storage compaction
Storage compaction makes the files smaller by looking for spaces with no data and recovering this space. This situation occurs when a record is deleted. To reuse the space, Mark the deleted record with special symbol ( fig b) Have a program identifies this symbol, to reuse space to insert a new record. 3. Also the program needs to include logic, to free / clean up the memory used by the deleted record.

Deleting Fixed-length Records for Reclaiming Space Dynamically
To reuse the space from deleted records as soon as possible deleted records must be marked in special way Find the deleted space To make record reuse quickly, we need a way to know immediately if there are empty slots in the file a way to jump directly to one of those slots if they exist To do this, one can use Avail List. * Avail list : is a Linked lists working as Stack for avail list This list that is made up of deleted records

Avail List Stack

Deleting Fixed-length Records for Reclaiming Space Dynamically(3)
Linking and stacking deleted records arranging and rearranging links are used to make one available record slot point to the next second field of deleted record points to next record

Sample file showing linked list of deleted records
Initially, Avail List contains 7 records, and RRN is deleted. RRN 3 is hence marked with * If RRN 5 is deleted, then Initially 3 was in AVAIL List, not 5 is also in AVAIL list. Hence Node with RRN 5, has RRN 3 in the address field.

To insert a new Record, one needs to check, AVAIL list
To insert a new Record, one needs to check, AVAIL list. Is the Header node’s link is -1, then there are no deleted record, else, it contains the address of the node deleted. Here, in this example, node RRN 1 is deleted, and is the first one to be reused. Hence, List works like Stack. Figure below shows the order of node reusability.

Deleting Variable-length Records
As the size of the records varies, Avail list of variable-length records has byte count of record at beginning of each record use byte offset instead of RRN To Insert new record, Search thru the AVAIL list, comparing the size of the new record, with the size of the deleted node. If node with right size (> = big enough), then use the byte offset, jump to the location to reuse.

Storage Fragmentation
Internal fragmentation (in fixed-length record) waste space within a record in variable-length records, minimize wasted space by doing away with internal fragmentation External fragmentation (in variable-length record) unused space outside or between individual records three possible solutions Storage compaction Coalescing the holes Minimizing fragmentation by adopting placement strategy

External Fragmentation Example
Coalescing the holes: Smaller external fragments are mostly not used. Hence these smaller fragments on Avail list are combined to form larger fragment, as to reuse. Fragments are combined only if they are physically adjacent.

Placement Strategies First-fit select the first available record slot
suitable when lost space is due to internal fragmentation Best-fit select the available record slot closest in size avail list in ascending order Worst-fit select the largest record slot avail list in descending order suitable when lost space is due to external fragmentation

Binary Search Goal: Minimize the number of disk accesses
int Binary Search ( Fixed Record File &file, Rec Type &obj, Key Type &key) // binary search for key. { int low = 0; int high = file . NumRecs() - 1; while (low <= high) int guess = (high - low)/2; file . Read By RRN ( obj , guess); if( obj . Key ( ) = = key) return 1; // record found if( obj . Key ( ) < key) high = guess - 1; // search before guess else low = guess + 1; // search after guess } return 0; // loop ended without finding key

Classes and Methods for Binary Search
Class KeyType {public int operator == (KeyType &); int operator < (KeyType &); }; class RecType {public: KeyType Key();}; class FixedRecordFile{public: int NumRecs(); int ReadByRRN (RecType & Record, int RRN);

Limitations of Binary search
Sorting a disk file in RAM read the entire file from disk to memory use internal sort (=sort in memory) UNIX sort utility uses internal sort Limitations of binary search & internal sort binary search requires more than one or two access c.f.) single access by RRN keeping a file sorted is very expensive an internal sort works only on small files

disk memory Internal Sort Read the entire file Sort in memory unsorted

Key Sorting & Its Limitations
So called, “tag sort” : sorted thing is “key” only Sorting procedure Read only the keys into memory Sort the keys Rearrange the records in file by the sorted keys Advantage less RAM than internal sort Disadvantages(=Limitations) reading records in disk twice is required a lot of seeking for records for constructing a new(sorted) file

Before sorting After sorting

Pseudocode for keysort
Program: keysort open input file as IN_FILE create output file as OUT_FILE read header record from IN_FILE and write a copy to OUT_FILE REC_COUNT := record count from header record /* read in records; set up KEYNODES array */ for i := 1 to REC_COUNT read record from IN_FILE into BUFFER extract canonical key and place it in KEYNODES[i].KEY KEYNODES[i].KEY = I /* sort KEYNODES[].KEY, thereby ordering RRNs correspondingly */ sort(KEYNODES, REC_COUNT) /* read in records according to sorted order, and write them out in this order */ seek in IN_FILE to record with RRN of KEYNODES[I].RRN write BUFFER contents to OUT_FILE close IN_FILE and OUT_FILE end PROGRAM

Pinned records Records that are referenced to physical location of themselves by other records Not free to alter physical location of records for avoiding dangling references Pinned records make sorting more difficult and sometimes impossible solution: use index file, while keeping actual data file in original order

Indexing Index: a data structure which associates given key values with corresponding record numbers It is usually physically separate from the file (unlike for indexed sequential files tight binding). Linear indexes (like indexes found at the back of books) Index records are ordered by key value as in an ordered relative file Best algorithm for finding a record with a specific key value is binary search Addition requires reorganization

Advantages of Indexing
Fast Random Accesses Uniform Access Speed Allow users to impose order on a file without actually rearranging the file Provide multiple access paths to a file Give user keyed access to variable-length record files

Features of Index file :
Index file contains Key and Reference field. It is fixed-size , sorted record Data file: not sorted because it is entry sequenced Record addition is quick (faster than a sorted file) Record search can be done quickly with index file than with a sorted data file. Index file can be kept the in memory for faster access. Class TextIndex encapsulates the index data and index operations

Class TextIndex{ public: TextIndex(int maxKeys = 100, int unique = 1); int Insert(const char*ckey, int recAddr); //add to index int Remove(const char* key); //remove key from index int Search(const char* key) const; //search for key, return recAddr void Print (ostream &) const; protected: int MaxKeys; // maximum num of entries int NumKeys;// actual num of entries char **Keys; // array of key values int* RecAddrs; // array of record references int Find (const chat* key) const; int Init (int maxKeys, int unique); int Unique;// if true --> each key must be unique }

Template Class for I/O Object
Using Template, we can work with 2 objects of 2 different classes. Class IOBuffer provides common methods for Pack and UnPack functions. Template Class RecordFile we want to make the following code possible Person p; RecordFile pFile; pFile.Read(p); Recording r; RecordFile rFile; rFile.Read(r); difficult to support files for different record types without having to modify the class the actual declarations and calls RecordFile <Person> pFile; pFile.Read(p); RecordFile <Recording> rFile; rFile.Read(p);

Template Class RecordFile
template <class RecType> class RecordFile : public BufferFile { public: int Read(RecType& record, int recaddr = -1); int Write(const RecType& record, int recaddr = -1); int Append(const RecType& record); RecordFile(IOBuffer& buffer) : BufferFile(buffer) {} };

Basic Operations on Index File
Create the original empty index and data files Load the index file into memory Rewrite the index file from memory Add records to the data file and index Delete records from the data file Update records in the data file Update the index to reflect changes in the data file Retrieve records

Creating the files Create empty files (index file and data file) Created as empty files with header records Create method in class BufferFile Loading the index into memory loading/storing objects are supported in the IOBuffer classes need to choose a particular buffer class to use for an index file Rewriting the index file from memory part of the Close operation on an IndexedFile write back index object to the index file should protect the index when failure write changes when out-of-date(use status flag) Implementation Rewind and Write operations of class BufferFile

Record Deletion : using TextIndex::Delete
Record Addition + Record Deletion : using TextIndex::Delete data file: the records need not be moved index: delete entry really or just mark it Record Updating (2 categories) The update changes the value of the key field delete/add approach reorder both the index and the data file The update does not affect the key field no rearrangement of the index file may need to reconstruct the data file Add an entry to the index Requires rearrangement if in memory, no file access using TextIndex.Insert Add a new record to data file using RecordFile<Recording>::Write

Template class supporting all the operations.
Template <class RecType> class TextIndexedFile { public: int Read(RecType& record); // read next record int Read(char* key, RecType& record) // read by key int Append(const RecType& record); int Update(char* oldKey, const RecType& record); int Create(char* name, int mode=ios::in|los::out); int Open(char* name, int mode=ios::in|los::out); int Close(); TextIndexedFile(IOBuffer & buffer, int keySize, int maxKeys=100); ~TextIndexedFile(); // close and delete protected: TextIndex Index; BufferFile IndexFile; TextIndexBuffer IndexBuffer; RecordFile<RecType> DataFile; char * FileName; // base file name for file int SetFileName(char* fName, char*& dFileName, char*&IdxFName); };

Indexes that are too large to hold in memory
Store large indexes on secondary storage (large linear index) Disadvantages binary searching of the index requires several seeks. index rearrangement requires shifting or sorting records on second storage Alternatives Hashed organization. Tree-structured index (e.g. B-tree). Advantages over the use of a data file sorted by key even if the index is on the secondary storage. Can use a binary search. Sorting and maintaining the index is less expensive than doing the data file. Can rearrange the keys without moving the data records.

Index by Multiple Keys People don’t want to search only by primary key, therefore, use Secondary Key Secondary Key entries can be duplicate. Secondary Key Index in-turn uses / refers primary key index Operations on Secondary Key Index. Record Addition Addition on new record, needs to insert entry into secondary index table. Secondary index is stored in canonical form Can contain duplicate keys If many duplicate secondary key are found, perform local ordering on primary key.

Record Deletion (2 cases)
a) Secondary index references directly record delete both primary index and secondary index rearrange both indexes b) Secondary index references primary key delete only primary index leave intact the reference to the deleted record advantage : fast disadvantage : deleted records take up space

Record Updating Primary key index serves as a kind of protective buffer a) Secondary index references directly record update all files containing record’s location b) Secondary index references primary key (1) affect secondary index only when either primary or secondary key is changed when changes the secondary key -> rearrange the secondary key index when changes the primary key -> update all reference field ->may require reordering the secondary index when confined to other fields -> do not affect the secondary key index

Problems with secondary Index
a secondary key leads to a set of one or more primary keys Disadvantages of 2nd-ary index structure rearrange when new record is added Duplicated entry leads to space wastage. The above problem can be solved by the following solutions: Solution A: By an array of references Solution B: By linking the list of references – Inverted List

Secondary key Set of primary key references Revised composer index
Array of References This method avoids rearranging of new index added, rather than rearranging fields. Disadvantages : * limited reference array * internal fragmentation BEETHOVEN ANG DG DG RCA2626 COREA WAR23699 DVORAK COL31809 PROKOFIEV LON2312 RIMSKY-KORSAKOV MER75016 SPRINGSTEEN COL38358 SWEET HONEY IN THE R FF245 Secondary key Set of primary key references Revised composer index

Inverted lists Guidelines for better solution
no reorganization when adding no limitation for duplicate key no internal fragmentation Solution B: by Linking the list of references Below figure shows the inverted lists. As secondary keys can be duplicate, each primary key can be a node referencing to the secondary key, this way, it encourages many duplicates. PROKOFIEV ANG36193 LON2312

Advantages of Inverted Lists:
rearranges only when secondary key changes rearrangement is quick less penalty associated with keeping the secondary index file on secondary storage (less need for sorting) Label ID List file not need to be sorted reusing the space of deleted record is easy Disadvantages of Inverted Lists: same secondary key references may not be physically grouped lack of locality could involve a large amount of seeking solution: reside in memory same Label ID list can hold the lists of a number of secondary index files if too large in memory, can load only a part of it

Selective Index: Selective Index indices on a subset of records
It contains only some part of entire index provide a selective view useful when contents of a file fall into several categories e.g. 20 < Age < 30 and $1000 < Salary When to bind the key indexes to the physical address of its associated record? Binding deals about what point key binds to the physical address of the associated record. 1. Primary key are bound at the time of file construction (Tight, in-the-data binding) 2. Secondary Keys are bound at the time need.

Tight binding provides faster access.
But if the binding is directly to the data file, -> binding tightly, then changes to the data file results in modifications to all the bound index files. Hence Post pone Binding. Postpone binding until a record is actually retrieved (Retrieval-time binding) minimal reorganization & safe approach mostly for secondary key

Thank you

Subject Name: File Structures

Similar presentations

Presentation on theme: "Subject Name: File Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Subject Name: File Structures

Similar presentations

Presentation on theme: "Subject Name: File Structures"— Presentation transcript:

Similar presentations

About project

Feedback