CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

Indexing.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Chapter 4 : File Systems What is a file system?
February 1 & 31 Csci 2111: Data and File Structures Week4, Lectures 1 & 2 Organizing Files for Performance.
Comp 335 File Structures Reclaiming and Reusing File Space Techniques for File Maintenance.
Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter.
February 1 & 31 Files Organizing Files for Performance.
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Dr. Kalpakis CMSC 661, Principles of Database Systems Representing Data Elements [12]
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
Greedy Algorithms (Huffman Coding)
Chap6. Organizing Files for Performance. Chapter Objectives(1)  Look at several approaches to data compression  Look at storage compaction as a simple.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
LEARNING OBJECTIVES Index files.
12.5 Record Modifications Jayalakshmi Jagadeesan Id 106.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
1 File Structure n File as a stream of characters l No structure l Consider students registered in a course Joe SmithSC Kathy LeeEN Albert.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Memory Management (continued) CS-3013 C-term Memory Management CS-3013 Operating Systems C-term 2008 (Slides include materials from Operating System.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
Lossless Data Compression Using run-length and Huffman Compression pages
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Chapter 7 Indexing Objectives: To get familiar with: Indexing
Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
Memory Allocation CS Introduction to Operating Systems.
File StructuresSNU-OOPSLA Lab.1 Chap6. Organizing Files for Performance 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주 File structures by Folk, Zoellick.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
File Organization Techniques
File Implementation. File System Abstraction How to Organize Files on Disk Goals: –Maximize sequential performance –Easy random access to file –Easy.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Prof. Yousef B. Mahdy , Assuit University, Egypt File Organization Prof. Yousef B. Mahdy Chapter -4 Data Management in Files.
Huffman Encoding Veronica Morales.
File Processing - Indexing MVNC1 Indexing Jim Skon.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Subject: Operating System.
CSCI-375 Operating Systems Lecture Note: Many slides and/or pictures in the following are adapted from: slides ©2005 Silberschatz, Galvin, and Gagne Some.
©Silberschatz, Korth and Sudarshan11.1Database System Concepts Chapter 11: Storage and File Structure File Organization Organization of Records in Files.
CS4432: Database Systems II Record Representation 1.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright © 2004 Pearson Education, Inc.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
Chapter 5 Record Storage and Primary File Organizations
CS 257: Database System Principles Variable length data and record BY Govind Kalyankar Class Id: 107.
Module 11: File Structure
CPSC 231 Organizing Files for Performance (D.H.)
Lecture 7 Data Compression
Chapter 11: File System Implementation
9/12/2018.
Chapter 11: File System Implementation
CS Introduction to Operating Systems
Database Implementation Issues
Disk storage Index structures for files
Chapter 11: File System Implementation
Files Management – The interfacing
DATABASE IMPLEMENTATION ISSUES
Chap6. Organizing Files for Performance
Indexing 4/11/2019.
Chapter 11: File System Implementation
Database Implementation Issues
Database Implementation Issues
Presentation transcript:

CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting. Indexing.

CPSC 231 Organizing Files for Performance (D.H.) 2 Data Compression Data compression is a technique for encoding information in a file is such way as to take up less space Why perform data compression? –using less storage results in cost savings –using less storage saves time accessing data

CPSC 231 Organizing Files for Performance (D.H.) 3 Data Compression Techniques Using a Different Notation Suppressing Repeating Sequences Assigning Variable-length codes Irreversible Compression Techniques

CPSC 231 Organizing Files for Performance (D.H.) 4 Redundancy Reduction Example: struct Person { char firstName[15]; char lastName[15]; char address[24]; char city[15]; char state[3]; }; In this example field state takes up only three bytes per record. But since the are only 50 states there is no need to use 24 bits (=3*8) but it is sufficient to use 6. WHY?

CPSC 231 Organizing Files for Performance (D.H.) 5 Redundancy reduction -cont. Thus we can use one byte (instead of 3 or 2) to encode the state name and save 2/3 or 1/2 space for this field. Fixed length fields are always good candidates for the use of this technique.

CPSC 231 Organizing Files for Performance (D.H.) 6 Pros and Cons of the Redundancy Reduction Cons: –Encoding is binary thus unreadable by humans –Encoding/decoding modules are required when processing data which adds complexity to the processing software Pros: –It can prove very beneficial (i.e. it can save a lot of space) for a particular application : if the files are large if the files are mostly processed by just one application

CPSC 231 Organizing Files for Performance (D.H.) 7 Suppressing Repeating Sequences Example: a black and white image of the sky most of the sky is black if the picture is represented as an array of pixels than the black parts would be represented by 0’s (no color, or no brightness) instead of repeating 0’s a lot of times we can use an encoding that keeps track of the number of 0’s. this picture is a sparse array (array in which most entries are 0’)

CPSC 231 Organizing Files for Performance (D.H.) 8 Suppressing Repeating Sequences -cont. Run-length encoding is a compression method in which runs of repeated codes are replaced by count of the number of repetitions of the code, followed by the code that is repeated. Example : – –can be encoded as: –13 14 ff where ff is a run-length encoding indicator, 00 is the (pixel) value repeated, and 08 is the number of repetitions.

CPSC 231 Organizing Files for Performance (D.H.) 9 Pros and Cons of Run-Length Encoding Cons it does not guarantee space savings Pros simple in some application (such as image processing) space savings could be substantial

CPSC 231 Organizing Files for Performance (D.H.) 10 Variable - Length Encoding Variable length encoding is a scheme in which the codes are of different lengths. More frequently occurring codes are given shorter length and more frequently occurring codes are given longer lengths. Example: –Morse Code (letter e and t, are most frequent in English thus they are assigned a dot (.) and a dash (-)) –Huffman encoding - a variable length encoding in which the lengths of the codes are based on the probability of the their occurrence (binary tree structure)

CPSC 231 Organizing Files for Performance (D.H.) 11 Irreversible Compression Irreversible Compression techniques are based on losing (sacrificing) some information. –Example: 400-by-400 pixel image is compressed to 100-by-100 size. The original information cannot be restored once the data have been compressed using an irreversible compression technique.

CPSC 231 Organizing Files for Performance (D.H.) 12 Reclaiming Space in Files Problem –Once a variable length record is deleted from a file, the space left but it cannot be easily used. WHY? File modification can take one of the following forms: –Record addition –Record updating –Record deletion

CPSC 231 Organizing Files for Performance (D.H.) 13 Record Deletion and Storage Compaction This method consists of two steps: –marking the record for deletion –compacting the file later on once there is a number of deleted records –Example: (file or records storing colors) –Original file: blue|magenta|red|green|yellow –After deleting magenta record : blue|*agenta|red|green|yellow –After deleting green record: blue|*agenta|red|*reen|yellow –After compaction: blue|red|yellow

CPSC 231 Organizing Files for Performance (D.H.) 14 External Fragmentation and Compaction External fragmentation is a wasted space in a file that occurs outside or in between records. (See the previous example.) Compaction is a method of eliminating external fragmentation by sliding all the records together so there is no space between them. (See the previous example.)

CPSC 231 Organizing Files for Performance (D.H.) 15 Deleting Fixed-Length Records This method consists of the following steps: –marking the record for deletion –placing the deleted record on the list of available records by using: a linked list (e.g. implemented as a queue or a stack) it is possible to use RRNs since the records are of fixed size Example: –Head -> RRN=3->RRN=5->RRN=-1 (EOL)

CPSC 231 Organizing Files for Performance (D.H.) 16 Deleting Fixed-Length Records-Cont. A list of available records is called an avail list. You can dedicate the first field of the deleted record to indicate that the record is deleted by placing a special character there (e.g. “*”) and you can use another field to keep a pointer to (or an RRN of) the next available record.

CPSC 231 Organizing Files for Performance (D.H.) 17 Internal Fragmentation Internal fragmentation is wasted (unused) space inside of records or sectors. Fixed length records structures often result in internal fragmentation.

CPSC 231 Organizing Files for Performance (D.H.) 18 Deleting Variable-Length Records This method consists of the following steps: –marking the record for deletion (e.g. use “*”) –placing the deleted record on the list of available records by using –recording the size of the record in the avail list –using the offset in the file to locate the record (not RRN) WHY? Head -> (Offset, size)->(offset, size) ->(-1, -1).

CPSC 231 Organizing Files for Performance (D.H.) 19 Storage Fragmentation As stated before fixed size record structure causes internal fragmentation. Variable size record structure does not cause internal fragmentation but is causes external fragmentation.

CPSC 231 Organizing Files for Performance (D.H.) 20 Eliminating and reducing external fragmentation Compaction (explained earlier) Coalescing the holes = combining adjacent records to create a new record Using a successful placement policy: first fit (use the first available record that is big enough) - (O.K. when dealing with internal frag.) best fit (use the smallest record that is big enough) - O.K. when dealing with internal fragmentation. worst fit (use the biggest available record, and put the rest of this record on the avail list)

CPSC 231 Organizing Files for Performance (D.H.) 21 Finding records quickly in files using keys Sequential search = reading records in the file in the serial order until the searched record is found. slow, good for small files requires on the average reading of n/2 records before the sought record is found (For n=2000, this is 1000).

CPSC 231 Organizing Files for Performance (D.H.) 22 Binary Search Binary search =locating the searched record in a sorted list of records by repeatedly selecting the middle element of the list, and dividing the list in half until the sought record is found. much faster than sequential search requires on the average reading of 1+  log 2 n  records before the sought record is found. (For n=2000 this is 11). COMPARE THIS WITH SEQUENTIAL SEARCH!

CPSC 231 Organizing Files for Performance (D.H.) 23 Cons of Binary Search It may require a lot of seek time because the read records are NOT sequential. The file has to be sorted - keeping it sorted might prove expensive, especially if a lot of new records are being added. A memory sort can be performed on relatively small files.

CPSC 231 Organizing Files for Performance (D.H.) 24 Keysort Keysort - a method of sorting a file that holds only keys and pointers to the records in main memory, NOT the entire file. The sorted list of keys is used to sort the file on the disk by rewriting it to a new file. Keysorting’s main disadvantage is that rearranging the entire file on the disk can be slower than reading a sequential file.

CPSC 231 Organizing Files for Performance (D.H.) 25 Pinned Records A record is pinned when there are other records pointing to its physical location. Another disadvantage of Keysorting is that it might move the pinned records thus resulting in the phenomenon called “dangling pointers”,i.e. the pointers that point to nonexistent records.

CPSC 231 Organizing Files for Performance (D.H.) 26 Indexing Instead of rearranging the entire file it is sufficient to write the sorted list of keys with pointers of records to secondary storage. This list of sorted keys with pointers to the records in the data file is called an index. Indexing solved most of problems associated with binary searching and keysorting.