External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.

Slides:



Advertisements
Similar presentations
Csci 2111: Data and File Structures Week2, Lecture 1 & 2
Advertisements

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
CS 400/600 – Data Structures External Sorting.
The Memory Hierarchy fastest, but small under a microsecond, random access, perhaps 512Mb Typically magnetic disks, magneto­ optical (erasable), CD­ ROM.
CS4432: Database Systems II Data Storage - Lecture 2 (Sections 13.1 – 13.3) Elke A. Rundensteiner.
Input/Output Management and Disk Scheduling
1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
1 Storage Hierarchy Cache Main Memory Virtual Memory File System Tertiary Storage Programs DBMS Capacity & Cost Secondary Storage.
SECTIONS 13.1 – 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
CS4432: Database Systems II Lecture 2 Timothy Sutherland.
CPSC-608 Database Systems Fall 2009 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #5.
Disks.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
Storage. The Memory Hierarchy fastest, but small under a microsecond, random access, perhaps 512Mb Access times in milliseconds, great variability. Unit.
Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage • Secondary Storage is usually: –anything outside of “primary memory” –storage that.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
1 Lecture 7: Data structures for databases I Jose M. Peña
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
CS 346 – Chapter 10 Mass storage –Advantages? –Disk features –Disk scheduling –Disk formatting –Managing swap space –RAID.
Lecture 11: DMBS Internals
Chapter Trees and External Storage © John Urrutia 2014, All Rights Reserved1.
1 6 Further System Fundamentals (HL) 6.2 Magnetic Disk Storage.
Introduction to Database Systems 1 Storing Data: Disks and Files Chapter 3 “Yea, from the table of my memory I’ll wipe away all trivial fond records.”
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
I/O Management and Disk Structure Introduction to Operating Systems: Module 14.
ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.
OSes: 11. FS Impl. 1 Operating Systems v Objectives –discuss file storage and access on secondary storage (a hard disk) Certificate Program in Software.
Database Management Systems,Shri Prasad Sawant. 1 Storing Data: Disks and Files Unit 1 Mr.Prasad Sawant.
Indexing.
IDA / ADIT Databasteknik Databaser och bioinformatik Data structures and Indexing (I) Fang Wei-Kleiner.
Chapter 8 External Storage. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices  Disk drives  Tape.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Indexing CS 400/600 – Data Structures. Indexing2 Memory and Disk  Typical memory access: 30 – 60 ns  Typical disk access: 3-9 ms  Difference: 100,000.
CPSC-608 Database Systems Fall 2015 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
Lecture 6 : External Sorting Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
Internal and External Sorting External Searching
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
CS399 New Beginnings Jonathan Walpole. Disk Technology & Secondary Storage Management.
Chapter 9 I/O System. 2 Input/Output System I/O Major objectives are: Take an application I/O request and send it to the physical device. Then, take whatever.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
Programmer’s View of Files Logical view of files: –An a array of bytes. –A file pointer marks the current position. Three fundamental operations: –Read.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Lecture 3 Secondary Storage and System Software I
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
File organization Secondary Storage Devices Lec#7 Presenter: Dr Emad Nabil.
File Organization Record Storage and Primary File Organization
Chapter 2: Computer-System Structures
Lecture 16: Data Storage Wednesday, November 6, 2006.
CPSC-608 Database Systems
Lecture 11: DMBS Internals
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
Secondary Storage Management Brian Bershad
Secondary Storage Management Hank Levy
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
Presentation transcript:

External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary storage is SLOW, about 250,000 times slower. (i.e. can do 250,000 CPU instructions in one I/O time). Primary storage is volatile. Secondary storage is permanent.

Golden Rule of File Processing Minimize the number of disk accesses!! Arrange information so that you get what you want with few disk accesses. Arrange information so you minimize future disk accesses. An organization for data on disk is often called a file structure. Disk based space/time tradeoff: Compress information to save time by reducing disk accesses.

Disk Drives Track - a circle around a disk that holds information. Sector - arc portion of a track. Interleaving factor - Physical distance between logically adjacent sectors on a track, to allow for processing of sector data. Locality of Reference - If a record is read from a disk, the next request is likely to come from near the same place in the file. Cluster - smallest unit of allocation; usually several sectors. Extent - group of physically contiguous clusters Internal Fragmentation - wasted space within a sector if the record size is not the sector size.

Access Time Seek Time - time for I/O head to reach desired track. Based on distance between I/O head and desired track. –F(n)=t*n+s where t is time to traverse one track and s is the startup time for the I/O head. Rotational delay (latency) - time for data to rotate to I/O head position. Determined by disk RPM. Transfer time - time for data to move under the I/O head. Determined by RPM and size of information to transfer.

Disk Access time example 675 Mbyte disk drive –15 platters --> 45 Mbyte/platter –612 tracks/platter –150 sectors/track --> 512 bytes/sector –8 sectors/cluster -> 4K/cluster -> 18 clusters/track –1 track seek -->.08ms –seek startup 3 ms –1 revolution --> 16.7ms –Interleaving factor of 3 -> 3 revolutions to read 1 track (50.1 msec) How long to read a file of 128K divided into 256 records of 512 bytes? –Uses 2 tracks (150 sectors on one 106 on other) –The average distance for seek is disk size/3 (not 2!)

Access Time total time = initial seek + second seek +2*(rotational delay+transfer time) total = (612/3*.08+3) + (.08+3) + 2*(.5+3)*16.7= Assumes clusters are contiguous and tracks are contiguous If clusters are randomly spread across disk –time = 32*(612/3* /2+24/150*16.7) = msec –24/150 is the part of the disk to read (8 sectors with an interleaving factor of 3).

Buffers Read time for one track –612/3* (3+.5)*16.7 = 77.8 Read time for one sector –612/3* (.5+1/150)*16.7 = 27.8 Read time for one byte –612/3* (.5)*16.7 = 27.7 Nearly all disk drives read/write one sector every time there is an I/O access. The information in a sector is stored in a buffer in the operating system.

Buffer Pools A series of buffers used by an application to cache disk data is called a buffer pool. Double buffering - read data from disk while CPU is processing the previous buffer. Same with writing; store written data into buffer while previous buffer is being physically placed on disk. Many buffers - allows for large differences between I/O time and CPU time. Sometimes I/O may fill a buffer faster than CPU can empty it.

Programmer’s View of Files Logical view of files: –An array of bytes. –A file pointer marks the current position. Three fundamental operations: –Read bytes from current position (move file pointer). –Write byes to current position (move file pointer). –Set file pointer to specified position.

External Sorting Problem: sort data sets too large to fit in main memory. Assume the data is stored on a disk drive. To sort, portions of the data must be brought into main memory, processed, and returned to disk. An external sort should minimize disk accesses.

External Computation Model Secondary storage is divided into equal sized blocks. The basic I/O operation transfers one block of information. Under certain circumstances, reading blocks of a file in sequential order is more efficient. –Minimize seek time. –File physically sequential. –Head does not move between accesses - no timesharing

More Model Typically, the time to perform a single block I/O operation is enough to Quicksort the contents of the block. Most systems have single drive, so must sort on a single drive. Need to minimize the number of block I/O operations.

Key Sorting Often records are large while keys are small Approach 1 –read in records, sort them, write them out Approach 2 (resembles pointer sort) –Read in only the key values –Store with each key the location on disk of its associated record. –Read in records in key order and write them out in sorted order

External sort: Simple Mergesort Quicksort requires random access to the entire set of records A better process for external data is a modified Mergesort algorithm This processes n elements in O(log n) passes. A group of sorted records is called a run.

External Mergesort Algorithm Split the file into two files. Read a block from each file. Take first record from each block, output them in sorted order. Take next record from each block, output them to a second file in sorted order. Repeat until finished, alternating between output files. Read new input blocks as needed. Now have runs of size 2 Repeat steps 2-5 except the input files have groups of 2 that need to be merged. Each pass provides runs of twice the size.

Problems Is each pass through then input and output files truly sequential? –If the files are all on the same disk, then the heads will be moving from one input file to another input file to an output file to another output file. –Very helpful if each file has its own disk head. How do we reduce the number of passes. –At the beginning, read in as much data as possible and sort it internally and just write it out. Can merge more than 2 runs at a time. - multiway merging.

THE MERGESORT The Mergesort process has 2 phases: –Break the file into large initial runs. –Merge the runs together to make a single sorted list

Replacement Selection This method tries to maximize the size of initial runs Break available memory into an array for a heap, an input buffer and an output buffer. Fill the array from disk. Make a min-heap Send the smallest value to the output buffer Read next key If new key is greater than last output value –replace the root with this key and “heapify” else –replace the root with the last key and “heapify” –add next record to a new heap (end of the array).