February 1 & 31 Files Organizing Files for Performance.

Slides:



Advertisements
Similar presentations
Indexing.
Advertisements

Chapter 6: Memory Management
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Chapter 4: Trees Part II - AVL Tree
February 1 & 31 Csci 2111: Data and File Structures Week4, Lectures 1 & 2 Organizing Files for Performance.
Comp 335 File Structures Reclaiming and Reusing File Space Techniques for File Maintenance.
Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter.
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Chap6. Organizing Files for Performance. Chapter Objectives(1)  Look at several approaches to data compression  Look at storage compaction as a simple.
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
A Data Compression Algorithm: Huffman Compression
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
1 File Structure n File as a stream of characters l No structure l Consider students registered in a course Joe SmithSC Kathy LeeEN Albert.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 9: Huffman Codes
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Chapter 7 Indexing Objectives: To get familiar with: Indexing
Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
File StructuresSNU-OOPSLA Lab.1 Chap6. Organizing Files for Performance 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주 File structures by Folk, Zoellick.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Subject: Operating System.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Chapter 5 Record Storage and Primary File Organizations
CS4432: Database Systems II
CHAPTER 51 LINKED LISTS. Introduction link list is a linear array collection of data elements called nodes, where the linear order is given by means of.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
LINKED LISTS.
Module 11: File Structure
CPSC 231 Organizing Files for Performance (D.H.)
Subject Name: File Structures
Lecture 7 Data Compression
Data Structures Interview / VIVA Questions and Answers
Subject Name: File Structures
Disk Storage, Basic File Structures, and Buffer Management
Chapter 9: Huffman Codes
Chap6. Organizing Files for Performance
Presentation transcript:

February 1 & 31 Files Organizing Files for Performance

2 Overview In this lecture, we continue to focus on file organization, but with a different motivation. This time we look at ways to organize or re-organize files in order to improve performance.

3 Outline We will be looking at four different issues: –Data Compression: how to make files smaller –Reclaiming space in files that have undergone deletions and updates –Sorting Files in order to support binary searching ==> Internal Sorting –A better Sorting Method: KeySorting

4 Data Compression I: An Overview Question: Why do we want to make files smaller? Answer: –To use less storage, i.e., saving costs –To transmit these files faster, decreasing access time or using the same access time, but with a lower and cheaper bandwidth –To process the file sequentially faster.

5 Data Compression II: Suppressing Repeating Sequences ==> Redundancy Compression When the data is represented in a Sparse array, we can use a type of compression called: run-length encoding. Procedure: –Read through the array in sequence except where the same value occurs more than once in succession. –When the same value occurs more than once, substitute the following 3 bytes in order: The special run-length code indicator The values that is repeated; and The number of times that the value is repeated. –See next slide

The encoded sequence is: ff ff ff ff No guarantee that space will be saved!!!

7 Data Compression III: Assigning Variable-Length Code Principle: Assign short codes to the most frequent occurring values and long ones to the least frequent ones. The code-size cannot be fully optimized as one wants codes to occur in succession, without delimiters between them, and still be recognized. This is the principle used in the Morse Code As well, it is used in Huffman Coding. ==> Used for compression in Unix (see next slide).

8 Variable length encoding –Letters with high frequency are encoded using shorter symbols. –Letters with low frequency are encoded using longer symbols. –Huffman code (for a set of seven letters): four bits per letter (minimum 3 bits). The string “abefd” is encoded as “ ”. –Huffman codes are used in some UNIX systems for data compression

9 Data Compression IV: Irreversible Compression Techniques Irreversible Compression is based on the assumption that some information can be sacrificed. [Irreversible compression is also called Entropy Reduction]. Example: Shrinking a raster image from 400-by-400 pixels to 100-by-100 pixels. The new image contains 1 pixel for every 16 pixels in the original image. There is usually no way to determine what the original pixels were from the one new pixel. In data files, irreversible compression is seldom used. However, it is used in image and speech processing.

10 Data Compression V: Compression in Unix I: Huffman Coding (pack and unpack) Suppose messages are made of letters a, b, c, d, and e, which appear with probabilities.12,.4,.15,.08, and.25, respectively. We wish to encode each character into a sequence of 0’s and 1’s so that no code for a character is the prefix for another. Answer (using Huffman’s algorithm given on the next slide): a=1111, b=0, c=110, d=1110, e=10.

11 Constructing Huffman Codes (A FOREST is a collection of TREES; each TREE has a root and a weight) While there is more than one TREE in the FOREST { i= index of the TREE in FOREST with smallest weight; j= index of the TREE in FOREST with 2nd smallest weight; Create a new node with left child FOREST(i)--> root and right child FOREST(j)--> root Replace TREE i in FOREST by a tree whose root is the new node and whose weight is FOREST(i)--> weight + FOREST(j)--> weight Delete TREE j from FOREST }

12 Reclaiming Space in Files I: Record Deletion and Storage Compaction Recognizing Deleted Records Reusing the space from the record ==> Storage Compaction. Storage Compaction: After deleted records have accumulated for some time, a special program is used to reconstruct the file with all the deleted approaches. Storage Compaction can be used with both fixed- and variable-length records.

13 Record Deletion (cont’d) Space reclamation: –A simple solution is to copy the file by skipping the deleted records. Suitable for both fixed-length and variable-length records. After space reclamation Ames|John|123 Maple|Stillwater|OK|74075|... Brown|Martha|625 Kimbark|Des Moines|IA|50311|... –In place (not copying a file) space reclamation is more complicated and time consuming.

14 An naive approach: When inserting a new record, –searching the file record by record; –if a deleted record is found, insert the new record in the place of the deleted record; –otherwise, insert the new record at the end of the file. Issues on reclaiming space quickly: –How to know immediately if there are empty slots in the file? –How to jump to one of those slots, if they exist? Linking all deleted records together using a linked list: pointer deleted record Head pointer deleted record deleted record pointer...

15 –Use the link list of the deleted records as a stack: –Add (push) a recently deleted record of RRN 3 to the top of the stack: –Remove a free space of RRN from the top of the stack for an inserted record: 2 RRN 5 Head pointer RRN 2 2 RRN 5 Head pointer RRN 2 5 RRN 3 2 RRN 5 Head pointer RRN 2

16 –Use the link list of the deleted records as a stack: –Add (push) a recently deleted record of RRN 3 to the top of the stack: –Insert three new records to the space of the deleted records:

An available list to store the deleted variable-length records: –How to link the deleted records together into a list? –How to add newly deleted records to the available list? –How to find and remove records from the available list when space is reclaimed? An available list of variable-length records HEAD.FIRST_AVAILABLE: Ames|John|123 Maple|Stillwater|OK|74075|64 Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420|45 Brown|Martha|625 Kimbark|Des Moines|IA|50311| Delete the second record: HEAD.FIRST_AVAILABLE: Ames|John|123 Maple|Stillwater|OK|74075|64 *| |45 Brown|Martha|625 Kimbark|Des Moines|IA|50311|

18 When inserting a new record, we need to search the available list for a deleted record with large enough record length: –The current available list: –Insert a record of 55 bytes: Size 72 Size 68 Size 38 Size 47 Size 68 New Link Size 38 Size 47 Size 72 removed record:

19 Reclaiming Space in Files II: Deleting Fixed- Length Records for Reclaiming Space Dynamically RECAPPING: In some applications, it is necessary to reclaim space immediately. To do so, we can: –Mark deleted records in some special ways –Find the space that deleted records once occupied so that we can reuse that space when we add records. –Come up with a way to know immediately if there are empty slots in the file and jump directly to them. Solution: Use an avail linked list in the form of a stack. Relative Record Numbers (RRNs) play the role of pointers.

20 Reclaiming Space in Files III: Deleting Variable-Length Records for Reclaiming Space Dynamically Same ideas as for Fixed-Length Records, but a different implementation must be used. In particular, we must keep a byte count of each record and the links to the next records on the avail list cannot be the RRNs. As well, the data structure used for the avail list cannot be a stack since we have to make sure that when re-using a record it is of the right size.

21 Reclaiming Space in Files IV: Storage Fragmentation Wasted Space within a record is called internal Fragmentation. Variable-Length records do not suffer from internal fragmentation. However, external fragmentation is not avoided. 3 ways to deal with external fragmentation: –Storage Compaction –Coalescing the holes –Use a clever placement strategy

22 First-fit placement strategy: NON-sorted - search the first available space which is large enough for the inserted record. –Least amount of work when we place a newly available space on the list. Best-fit placement strategy: (Sorted) search the smallest available which is large enough for the inserted record. –Order the available list in ascending order by size, then use the first-fit placement strategy. –After inserting the new record, the free area left over may be too small to be useful. May cause serious external fragmentation. –The small free slots are placed at the beginning of the available list. Make the search of the first-fit space increasingly long as time goes on.

23 Worst-fit placement strategy: –(Sorted – descending) Order the available list in descending order by size, then use first-fit placement strategy. Always insert the new record to the first slot. If the first slot is not large enough. The new record is inserted to the end of the file. Decrease the chance of external fragmentation.

24 Reclaiming Space in Files V: Placement Strategies II Some general remarks about placement strategies: –Placement strategies only apply to variable-length records –If space is lost due to internal fragmentation, the choice is first fit and best fit. A worst fit strategy truly makes internal fragmentation worse. –If the space is lost due to external fragmentation, one should give careful consideration to a worst-fit strategy.

25 Finding Things Quickly I: Overview I The cost of Seeking is very high. This cost has to be taken into consideration when determining a strategy for searching a file for a particular piece of information. The same question also arises with respect to sorting, which often is the first step to searching efficiently. Rather than simply trying to sort and search, we concentrate on doing so in a way that minimizes the number of seeks.

26 Finding things Quickly II: Overview II So far, the only way we have to retrieve or find records quickly is by using their RRN (in case the record is of fixed-length). Without a RRN or in the case of variable- length records, the only way, so far, to look for a record is by doing a sequential search. This is a very inefficient method. We are interested in more efficient ways to retrieve records based on their key-value.

27 Finding things Quickly III: Binary Search The binary search: Let’s assume that the file is sorted and that we are looking for record whose key is Kelly in a file of 1000 fixed- length records. 1 2 … : Johnson 750 2: Monroe Next Comparison

28 Finding things Quickly IV: Binary Search versus Sequential Search When sequential search is used, doubling the number of records in the file doubles the number of comparisons required for sequential search. When binary search is used, doubling the number of records in the file only adds one more guess to our worst case. In order to use binary search, though, the file first has to be sorted. This can be very expensive.

29 Finding things Quickly V: Sorting a Disk File in Memory If the entire content of a file can be held in memory, then we can perform an internal sort. Sorting in memory is very efficient. However, if the file does not hold entirely in memory, any sorting algorithm will require a large number of seeks. Sorting would, thus, be extremely slow. Unfortunately, this is often the case, and solutions have to be found.

30 Finding things Quickly VI: The limitations of Binary Search and Internal Sorting Binary Search requires more than one or two accesses. Accessing a record using the RRN can be done with a single access ==> We would like to achieve RRN retrieval performance while keeping the advantage of key access. Keeping a file sorted is very expensive: in addition to searching for the right location for the insert, once this location is founds, we have to shift records to open up the space for insertion.

31 Finding things Quickly VII: KeySorting Overview: when sorting a file in memory, the only thing that really needs sorting are record keys. Internal Sorting only works on small files. ==> Keysorting

Only load records keys into RAM. A KEYNODES[ ] array has two fields: KEY and RRN. There is a correspondence between KEYNODES[ ] and records in the actual file. Actual sorting process, simply sort the KEYNODES[ ] array according to the key field.

33 Keysort algorithms work like internal sort, but with 2 important differences: –Rather than read an entire record into a memory array, we simply read each record into a temporary buffer, extract the key and then discard. –If we want to write the records in sorted order, we have to read them a second time.

34 Finding things Quickly VIII: Limitation of the KeySort Method Writing the records in sorted order requires as many random seeks as there are records. Since writing is interspersed with reading, writing also requires as many seeks as there are records. Solution: Why bother to write the file of records in key order: simply write back the sorted index.

35

36