Chapter 9, Disks and Files

Name: Chapter 9, Disks and Files
Uploaded: 2017-10-18T03:08:19+00:00
Duration: PTM35S39
Channel: Juan Harlow
Description: Chapter 9, Disks and Files

Chapter 9, Disks and Files
The Storage Hierarchy Disks Mechanics Performance RAID Disk Space Management Buffer Management Files of Records Format of a Heap File Format of a Data Page Format of Records

Learning objectives Given disk parameters, compute storage needs and read times Given a reminder about what each level means, be able to derive any figures on the RAID performance slide Describe the pros and cons of alternative structures for files, pages and records

A (Very) Simple Hardware Model
CPU chip register file ALU system bus memory bus main memory bus interface I/O bridge I/O bus Expansion slots for other devices such as network adapters. USB controller graphics adapter disk controller mouse keyboard monitor disk

Storage Options Registers Caches Main Memory Hard Disk / Flash Tape
Capacity Access Time Cost Registers Caches Main Memory Hard Disk / Flash Tape 1k-2k bytes 1 Tc Way Expensive 10s -1000s K Bytes 2-20 Tc $10 / MByte G Bytes 300 – 1000 Tc $0.03 / MB (eBay) 100s G Bytes 10 ms = 30M Tc $0.10/ GB (eBay) Infinite Forever Way Cheap

Cache - SDRAM may be multiple levels!
Memory “Hierarchy” Upper Level Capacity Access Time Cost Staging Xfer Size Faster 1k-2k bytes 1 Tc Way Expensive Registers Instr. Operands prog./compiler 1-8 bytes 10s -1000s K Bytes 2-20 Tc $10 / MByte Cache - SDRAM may be multiple levels! cache cntl 8-128 bytes Blocks G Bytes 300 – 1000 Tc $0.03 / MB (eBay) Memory - DRAM OS 4K+ bytes Pages 100s G Bytes 10 ms = 30M Tc $0.10/ GB (eBay) Disk user/operator Gbytes Files Larger Infinite Forever Way Cheap Tape Lower Level

Why Does “Hierarchy” Work?
Locality: Program access a relatively small portion of the address space at any instant of time Two Different Types Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

9.1 The Memory Hierarchy Typical storage hierarchy as used by a RDBMS:
Primary storage: Main memory (RAM) for currently used data Secondary storage: Disk, Flash Memory for the main database What are other reasons besides cost to use disk? Tertiary storage Tapes, DVDs for archiving older versions of the data Other factors Caches at every level Controllers, protocols Network connections What are other reasons? Persistence – want databases to stay around; Size – 32-bit addressing insufficient for many databases 19

What is FLASH Memory, Anyway?
Floating gate transitor Presence of charge => “0” Erase Electrically or UV (EPROM) Peformance Reads like DRAM (~ns) Writes like DISK (~ms). Write is a complex operation

Components of a Disk platters are always spinning (say, 120rps).
Spindle Disk head Tracks platters are always spinning (say, 120rps). one head reads/writes at any one time. to read a record: position arm (seek) engage head wait for data to spin by read (transfer data) Sector Platters Arm movement 120 rps = 120 r/s x 1min/60 seconds = 7200rpm Arm assembly 21

More terminology Each track is made up of fixed size sectors.
Spindle Disk head Tracks Each track is made up of fixed size sectors. Page size is a multiple of sector size. A platter typically has data on both surfaces. All the tracks that you can reach from one position of the arm is called a cylinder (imaginary!). Sector Platters Arm movement Arm assembly 21

Disks Technology Background
CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none Seagate , 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters (in 3.5” form factor) Bandwidth: 86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes

Typical Disk Drive Statistics (2008)
Sector size: 512 bytes Seek time Average ms Track to track ms Average Rotational Delay - 3 to 5 ms (rotational speed 10,000 RPM to 5,400RPM) Transfer Time - Sustained data rate msec per 8K page, or MB/second Density 12-18 GB/in2

Disk Capacity Capacity: maximum number of bits that can be stored.
Expressed in units of gigabytes (GB), where 1 GB = 10^9 bytes Capacity is determined by: Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density. Modern disks partition tracks into disjoint subsets called recording zones Each track in a zone has the same number of sectors, determined by the circumference of innermost track. Each zone has a different number of sectors/track

Cost of Accessing Data on Disk
Time to access (read/write) a disk block: Taccess = Tavg seek + Tavg rotation + Tavg transfer seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head) Half a rotation, on average transfer time (actually moving data to/from disk surface) Key to lower I/O cost: reduce seek/rotation delays! No way to avoid transfer time… Textbook measures query cost by NUMBER of page I/Os Implies all I/Os have the same cost, and that CPU time is free This is a common simplification. Real DBMSs (in the optimizer) would consider sequential vs. random disk reads Because sequential reads are much faster and would count CPU time. 22

Disk Parameters Practice
A 2-platter disk rotates at 7,200 rpm. Each track contains 256KB. How many cylinders are required to store an 8 Gigabyte file? What is the average rotational delay, in milliseconds? page 3-2 2**33 = 2**3 x 2**30 bytes in the file 2**18 = 2**8 x 2**10 bytes per track So 4K = 2**12 tracks in the file 4 tracks per cylinder So 1K cylinders per file 1/7200 minutes/rotation x 60 seconds/minute = 1/120 seconds/rotation Average rotational delay is half a rotation, or 1/240 seconds = 4.2 msecs

Disk Access Time Example
Given: Rotational rate = 7,200 RPM Average seek time = 9 ms. Avg # sectors/track = 400. Derived: Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 ms Taccess = 9 ms + 4 ms ms Important points: Access time dominated by seek time and rotational latency. First bit in a sector is the most expensive, the rest are free. SRAM access time is about 4 ns/doubleword, DRAM about 60 ns Disk is about 40,000 times slower than SRAM, 2,500 times slower than DRAM.

So, How far away is the data?
From

Block, page and record sizes
Block – According to text, smallest unit of I/O. Page – often used in place of block. “typical” record size: commonly hundreds, sometimes thousands of bytes Unlike the toy records in textbooks “typical” page size 4K, 8K

Effect of page size on read time
Suppose rotational delay is 4ms, average seek time 6 ms, transfer speed .5msec/8K. This graph shows the time required to read 1Gig of data for different page sizes.

Why the difference? What accounts for the difference, in times to read one Gigabyte, on the previous graph? Assume: rotational delay 4ms, average seek time 6 ms, transfer speed .5msec/8K Transfer time (230/213 8K blocks) (.5msec/8K) = 66 secs ~= one minute How many reads? Page size 8K: there are 230/213 = 217 = 128K reads Page size 64K, there are 1/8th that many reads = 16K reads Time taken by rotational delays and seeks Each read requires a rotational delay and a seek, totalling 10 msec. 8K: (128K reads)  (10msec/read) = 1,311 secs ~= 22 minutes 64K: 1/8 of that, or 164 secs ~= 3 minutes

Moral of the Story As page size increases, read (and write) time reduces to transfer time, a big savings. So why not use a huge page size? Wastes memory space if you don’t need all that is read Wastes read time if you don’t need all that is read What applications could use a large page size? Those that sequentially access data The problem with a small page size is that pages get scattered across the disk. Turn the page…. Page size is set by the OS because of the virtual memory system’s importance. Most server-class OSes support larger size pages, up to megabytes in size.

Faster I/O, even with a small page size
Even if the page size is small, you can achieve fast I/O by storing a file’s data as follows: Consecutive pages on same track, followed by Consecutive tracks on same cylinder, followed by Consecutive cylinders adjacent to each other First two incur no seek time or rotational delay, seek for third is only one-track. What is saved with this storage pattern? How is this storage pattern obtained? Disk defragmenter and its relatives/predecessors Also places frequently used files near the spindle When data is in this storage pattern, the application can do sequential I/O Otherwise it must do random I/O

More Hardware Issues 9. Disks Disk Controllers
Interface from Disks to bus Checksums, remap bad sectors, driver mgt, etc Interface Protocols and MB per second xfer rates IDE/EIDE/ATA/PATA, SATA -133 SCSI -640 BUT for a single device, SCSI is inferior Faster network technologies such as Fibre Channel Storage Area Networks (SANs) Disk farm networked to servers Servers can be heterogeneous – a primary advantage Centralized management

Dependability Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR)

Example calculating reliability
If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Example: Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF)

Example calculating reliability
Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF):

9.Disks 9.2 RAID [587] Disk Array: Arrangement of several disks that gives abstraction of a single, large disk. Goals: Increase performance and reliability. Two main techniques: Data striping: Data is partitioned; size of a partition is called the striping Unit. Partitions are distributed over several disks. Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails.

Data Striping CPUs go fast, disks don’t. How can disks keep up?
CPUs do work in parallel. Can disks? Answer: Partition data across D disks (see next slide). If Partition unit is a page: A single page I/O request is no faster Multiple I/O requests can run at aggregated bandwidth Number of pages in a partition unit called the depth of the partition. Contrary to text, partition units of a bit are almost never used and partition units of a byte are rare.

Data Striping (RAID Level 0)
0 D 2D … 0 1 D+1 2D+1 … 2 D+2 2D+2… 2 D-1 2D-1 3D-1 … D-1 ... Disk Disk Disk Disk D-1

Redundancy Striping is seductive, but remember reliability!
MTTF of a disk is about 6 years If we stripe over 24 disks, what is MTTF? Solution: redundancy Parity: corrects single failures Others: detect where the failure is, and corrects multiple failures But failure location is provided by controller Redundancy may require more than one check bit Redundancy makes writes slower – why?

RAID Levels Standardized by SNIA (www.snia.org ) Vary in practice
For each level, decide (assume single user) Number of disks required to hold D disks of data. Speedup s (compared to 1 disk) for S/R (Sequential/Random) R/W (Reads/Writes) Random: each I/O is one block Sequential: Each I/O is one stripe Number of disks/blocks that can fail w/o data loss Level 0: Block Striped, No redundancy Picture is 2 slides back

JBOD, RAID Level 1 ... JBOD: Just a Bunch of Disks
… ... Disk Disk Disk Disk D-1 … … … 1 … … Level 1: Mirrored (two identical JBODs – no striping)

RAID Level 0+1: Stripe + Mirror
D 2D … 1 D+1 2D+1 … 2 D+2 2D+2 D-1 2D-1 3D-1 … D-1 ... Disk Disk Disk Disk D-1 D 2D … 1 D+1 2D+1 … 2 D+2 2D+2 … D-1 2D-1 3D-1 … D-1 ... Disk D Disk D+1 Disk D Disk 2D-1

RAID Level 4 ... Block-Interleaved Parity (not common)
One check disk, uses one bit of parity. How to tell if there is a failure, or which disk failed? Read-modify-write Disk D is a bottleneck 0 D 2D … 0 1 D+1 2D+1 … 2 D+2 2D+2… 2 D-1 2D-1 3D-1 … D-1 ... Disk Disk Disk Disk D Disk D P P P P …

RAID Level 5 ... Level 5: Block-Interleaved Distributed Parity
D-2 2D-2 P … … D-1 P 3D-2 … … P 2D-1 3D … … ... Disk Disk Disk D Disk D Disk D Level 6: Like 5, but 2 parity bits/disks Can survive loss of 2 disks/blocks

Notation on the next slide
#Disks Number of disks required to hold D disks worth of data using this RAID level Reads/Write speedup of blocks in a single file: SR: Sequential Read RR: Random read SW: Sequential write RW: Random write Failure Tolerance How many disks can fail without loss of data Internal Data s = Blocks transferred in the time it takes to transfer one block of data from one disk. These numbers are theoretical! YMMV…and vary significantly!

RAID Performance Level #Disks SR RR SW RW 1 0+1 5 D s=D 1sD 2D s=2
speedup RR SW RW Failure Tolerance D s=D 1sD 1 2D s=2 s=1** D* 0+1 s=2D 2s2D s=D** 1sD** 5 D+1 Varies *If no two are copies of each other ** note – can’t write both mirrors at once – why?

Small Writes on Levels 4 and 5
Levels 4 and 5 require a read-modify-write cycle for all writes, since the parity block must be read and modified. On small writes this can be very expensive This is another justification for Log Based File Systems (see your OS course)

Which RAID Level is best?
If data loss is not a problem Level 0 If storage cost is not a problem Level 0+1 Else Level 5 Software Support Linux: 0,1,4,5 ( ) Windows: 0,1,5 ( )

9.Disks 9.3, 9.4.1: Covered earlier 24

9.Disks 9.4.2 DBMS vs. OS File System OS does disk space & buffer mgmt: why not let OS manage these tasks? [715] Differences in OS support: portability issues Some limitations, e.g., files can’t span disks. Buffer management in DBMS requires ability to: pin a page in buffer pool, force a page to disk (important for implementing CC & recovery), adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations. Sometimes MRU is the best replacement policy: For example, for a scan or a loop that does not fit. 8

9.Disks 9.5 Files of Records Page or block is OK when doing I/O, but higher levels of DBMS operate on records, and files of records. FILE: A collection of pages, each containing a collection of records. Must support: insert/delete/modify record read a particular record (specified using record id) scan all records (possibly with some conditions on the records to be retrieved) 13

9.5.1 Unordered (Heap) Files
9.Disks 9.5.1 Unordered (Heap) Files Simplest file structure contains records in no particular order. As file grows and shrinks, disk pages are allocated and de-allocated. To support record level operations, we must: keep track of the pages in a file keep track of free space on pages keep track of the records on a page There are at least two alternatives for keeping track of heap files. 14

Heap File Implemented as a List
9.Disks Heap File Implemented as a List Data Page Data Page Data Page Full Pages Header Page Data Page Data Page Data Page Pages with Free Space The header page id and Heap file name must be stored someplace. Each page contains 2 `pointers’ plus data. 15

Heap File Using a Page Directory
9.Disks Heap File Using a Page Directory Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY The entry for a page can include the number of free bytes on the page. The directory is a collection of pages; linked list implementation is just one alternative. Much smaller than linked list of all HF pages! 16

Comparing Heap File Implementations
Assume 100 directory entries per page. U full pages, E pages with free space D directory pages Then D = (U+E) /100 Note that D is two orders of magnitude less than U or E Cost to find a page with enough free space List: E/2 Directory: (D/2) + 1 Cost to Move a page from Full to Free (e.g., when a record is deleted) List: 3, Directory: 1 Can you think of some other operations?

9.6 Page Formats: Fixed Length Records
9.Disks 9.6 Page Formats: Fixed Length Records Slot 1 Slot 1 Slot 2 Slot 2 . . . Free Space . . . Slot N Slot N Slot M N 1 . . . 1 1 M M number of records number of slots PACKED UNPACKED, BITMAP 11

Packed vs Unpacked Page Formats
Record ID (RID, TID) = (page#, slot#) , in all page formats Note that indexes are filled with RIDs Data entries in alternatives 2 and 3 are (key, RID..) Packed stores more records RIDs change when a record is deleted This may not be acceptable. Unpacked RID does not change Less data movement when deleting

Page Formats: Variable Length Records
9.Disks Page Formats: Variable Length Records Rid = (i,N) Page i Rid = (i,2) Rid = (i,1) 20 16 24 N Pointer to start of free space N # slots SLOT DIRECTORY 12

Slotted Page Format Intergalactic Standard, for fixed length records also. How to deal with free space fragmentation? Pack records. lazily Note that RIDs don’t change How are updates handled which expand the size of a record? Forwarding flag to new location postgresql-8.3.1\src\include\storage\bufpage.h

9.7 Record Formats: Fixed Length
9.Disks 9.7 Record Formats: Fixed Length F1 F2 F3 F4 L1 L2 L3 L4 Base address (B) Address = B+L1+L2 Information about field types same for all records in a file; stored in system catalogs. Finding i’th field does not require scan of record. 9

Record Formats: Variable Length
9.Disks Record Formats: Variable Length Two alternative formats (# fields is fixed): F F F F4 4 $ Fields Delimited by Special Symbols Field Count F F F F4 Array of Field Offsets Second offers direct access to i’th field, efficient storage of nulls (special don’t know value); small directory overhead. 10

Chapter 9, Disks and Files

Similar presentations

Presentation on theme: "Chapter 9, Disks and Files"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 9, Disks and Files

Similar presentations

Presentation on theme: "Chapter 9, Disks and Files"— Presentation transcript:

Similar presentations

About project

Feedback