Presentation on theme: "6/3/20141 PSUs CS 587 Chapter 9, Disks and Files The Storage Hierarchy Disks Mechanics Performance RAID Disk Space Management Buffer Management Files of."— Presentation transcript:
6/3/20141 PSUs CS 587 Chapter 9, Disks and Files The Storage Hierarchy Disks Mechanics Performance RAID Disk Space Management Buffer Management Files of Records Format of a Heap File Format of a Data Page Format of Records
6/3/20142 PSUs CS 587 Learning objectives Given disk parameters, compute storage needs and read times Given a reminder about what each level means, be able to derive any figures on the RAID performance slideslide Describe the pros and cons of alternative structures for files, pages and records
6/3/20143 PSUs CS 587 A (Very) Simple Hardware Model main memory I/O bridge bus interface ALU register file CPU chip system busmemory bus disk controller graphics adapter USB controller mousekeyboardmonitor disk I/O bus Expansion slots for other devices such as network adapters.
6/3/20144 PSUs CS 587 Storage Options 1k-2k bytes 1 Tc Way Expensive 10s -1000s K Bytes 2-20 Tc $10 / MByte G Bytes 300 – 1000 Tc $0.03 / MB (eBay) 100s G Bytes 10 ms = 30M Tc $0.10/ GB (eBay) Capacity Access Time Cost Infinite Forever Way Cheap Registers Caches Main Memory Hard Disk / Flash Tape
6/3/20145 PSUs CS 587 Memory Hierarchy 1k-2k bytes 1 Tc Way Expensive 10s -1000s K Bytes 2-20 Tc $10 / MByte G Bytes 300 – 1000 Tc $0.03 / MB (eBay) 100s G Bytes 10 ms = 30M Tc $0.10/ GB (eBay) Capacity Access Time Cost Infinite Forever Way Cheap Registers Cache - SDRAM may be multiple levels! Memory - DRAM Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Size prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 4K+ bytes user/operator Gbytes Upper Level Lower Level Faster Larger
6/3/20146 PSUs CS 587 Why Does Hierarchy Work? Locality: Program access a relatively small portion of the address space at any instant of time Two Different Types Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
6/3/20147 PSUs CS 587 9.1 The Memory Hierarchy Typical storage hierarchy as used by a RDBMS: Primary storage: Main memory (RAM) for currently used data Secondary storage: Disk, Flash Memory for the main database http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf What are other reasons besides cost to use disk? Tertiary storage Tapes, DVDs for archiving older versions of the data Other factors Caches at every level Controllers, protocols Network connections
6/3/20148 PSUs CS 587 What is FLASH Memory, Anyway? Floating gate transitor Presence of charge => 0 Erase Electrically or UV (EPROM) Peformance Reads like DRAM (~ns) Writes like DISK (~ms). Write is a complex operation
6/3/20149 PSUs CS 587 Components of a Disk Platters platters are always spinning (say, 120rps). one head reads/writes at any one time. to read a record: position arm (seek) engage head wait for data to spin by read (transfer data) Spindle Disk head Arm movement Arm assembly Tracks Sector
6/3/201410 PSUs CS 587 More terminology v Each track is made up of fixed size sectors. v Page size is a multiple of sector size. v A platter typically has data on both surfaces. v All the tracks that you can reach from one position of the arm is called a cylinder (imaginary!). Platters Spindle Disk head Arm movement Arm assembly Tracks Sector
6/3/201411 PSUs CS 587 Disks Technology Background Seagate 373453, 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64000 (80X) Bits/Inch: 533,000 (60X) Four 2.5 platters (in 3.5 form factor) Bandwidth: 86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25 platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none
6/3/201412 PSUs CS 587 Typical Disk Drive Statistics (2008) Sector size: 512 bytes Seek time Average 4-10 ms Track to track.6-1.0 ms Average Rotational Delay - 3 to 5 ms (rotational speed 10,000 RPM to 5,400RPM) Transfer Time - Sustained data rate 0.3- 0.1 msec per 8K page, or 25-75 MB/second Density 12-18 GB/in 2
6/3/201413 PSUs CS 587 Disk Capacity Capacity: maximum number of bits that can be stored. Expressed in units of gigabytes (GB), where 1 GB = 10^9 bytes Capacity is determined by: Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density. Modern disks partition tracks into disjoint subsets called recording zones Each track in a zone has the same number of sectors, determined by the circumference of innermost track. Each zone has a different number of sectors/track
6/3/201414 PSUs CS 587 Cost of Accessing Data on Disk Time to access (read/write) a disk block: Taccess = Tavg seek + Tavg rotation + Tavg transfer seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head) Half a rotation, on average transfer time (actually moving data to/from disk surface) Key to lower I/O cost: reduce seek/rotation delays! No way to avoid transfer time… Textbook measures query cost by NUMBER of page I/Os Implies all I/Os have the same cost, and that CPU time is free This is a common simplification. Real DBMSs (in the optimizer) would consider sequential vs. random disk reads Because sequential reads are much faster and would count CPU time.
6/3/201415 PSUs CS 587 Disk Parameters Practice A 2-platter disk rotates at 7,200 rpm. Each track contains 256KB. How many cylinders are required to store an 8 Gigabyte file? What is the average rotational delay, in milliseconds?
6/3/201416 PSUs CS 587 Disk Access Time Example Given: Rotational rate = 7,200 RPM Average seek time = 9 ms. Avg # sectors/track = 400. Derived: Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 ms Taccess = 9 ms + 4 ms + 0.02 ms Important points: Access time dominated by seek time and rotational latency. First bit in a sector is the most expensive, the rest are free. SRAM access time is about 4 ns/doubleword, DRAM about 60 ns Disk is about 40,000 times slower than SRAM, 2,500 times slower than DRAM.
6/3/201417 PSUs CS 587 So, How far away is the data? From http://research.microsoft.com/~gray/papers/AlphaSortSigmod.doc
6/3/201418 PSUs CS 587 Block, page and record sizes Block – According to text, smallest unit of I/O. Page – often used in place of block. typical record size: commonly hundreds, sometimes thousands of bytes Unlike the toy records in textbooks typical page size 4K, 8K
6/3/201419 PSUs CS 587 Effect of page size on read time Suppose rotational delay is 4ms, average seek time 6 ms, transfer speed.5msec/8K. This graph shows the time required to read 1Gig of data for different page sizes.
6/3/201420 PSUs CS 587 Why the difference? What accounts for the difference, in times to read one Gigabyte, on the previous graph? Assume: rotational delay 4ms, average seek time 6 ms, transfer speed.5msec/8K Transfer time (2 30 /2 13 8K blocks) (.5msec/8K) = 66 secs ~= one minute How many reads? Page size 8K: there are 2 30 /2 13 = 2 17 = 128K reads Page size 64K, there are 1/8 th that many reads = 16K reads Time taken by rotational delays and seeks Each read requires a rotational delay and a seek, totalling 10 msec. 8K: (128K reads) (10msec/read) = 1,311 secs ~= 22 minutes 64K: 1/8 of that, or 164 secs ~= 3 minutes
6/3/201421 PSUs CS 587 Moral of the Story As page size increases, read (and write) time reduces to transfer time, a big savings. So why not use a huge page size? Wastes memory space if you dont need all that is read Wastes read time if you dont need all that is read What applications could use a large page size? Those that sequentially access data The problem with a small page size is that pages get scattered across the disk. Turn the page….
6/3/201422 PSUs CS 587 Faster I/O, even with a small page size Even if the page size is small, you can achieve fast I/O by storing a files data as follows: Consecutive pages on same track, followed by Consecutive tracks on same cylinder, followed by Consecutive cylinders adjacent to each other First two incur no seek time or rotational delay, seek for third is only one-track. What is saved with this storage pattern? How is this storage pattern obtained? Disk defragmenter and its relatives/predecessors Also places frequently used files near the spindle When data is in this storage pattern, the application can do sequential I/O Otherwise it must do random I/O
6/3/201423 PSUs CS 587 More Hardware Issues Disk Controllers Interface from Disks to bus Checksums, remap bad sectors, driver mgt, etc Interface Protocols and MB per second xfer rates IDE/EIDE/ATA/PATA, SATA -133 SCSI -640 BUT for a single device, SCSI is inferior Faster network technologies such as Fibre Channel Storage Area Networks (SANs) Disk farm networked to servers Servers can be heterogeneous – a primary advantage Centralized management 9. Disks
6/3/201424 PSUs CS 587 Dependability Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics 1. Mean Time To Failure ( MTTF ) measures Reliability 2. Failures In Time ( FIT ) = 1/MTTF, the rate of failures Traditionally reported as failures per billion hours of operation Mean Time To Repair ( MTTR ) measures Service Interruption Mean Time Between Failures ( MTBF ) = MTTF+MTTR Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR)
6/3/201425 PSUs CS 587 Example calculating reliability If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Example: Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF)
6/3/201426 PSUs CS 587 Example calculating reliability Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF):
6/3/201427 PSUs CS 587 9.2 RAID  Disk Array: Arrangement of several disks that gives abstraction of a single, large disk. Goals: Increase performance and reliability. Two main techniques: Data striping: Data is partitioned; size of a partition is called the striping Unit. Partitions are distributed over several disks. Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails. 9.Disks
6/3/201428 PSUs CS 587 Data Striping CPUs go fast, disks dont. How can disks keep up? CPUs do work in parallel. Can disks? Answer: Partition data across D disks (see next slide). If Partition unit is a page: A single page I/O request is no faster Multiple I/O requests can run at aggregated bandwidth Number of pages in a partition unit called the depth of the partition. Contrary to text, partition units of a bit are almost never used and partition units of a byte are rare.
6/3/201429 PSUs CS 587 Data Striping (RAID Level 0) 0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 D-1 2D-1 3D-1 … D-1... Disk 0 Disk 1 Disk 2 Disk D-1
6/3/201430 PSUs CS 587 Redundancy Striping is seductive, but remember reliability! MTTF of a disk is about 6 years If we stripe over 24 disks, what is MTTF? Solution: redundancy –Parity: corrects single failures –Others: detect where the failure is, and corrects multiple failures –But failure location is provided by controller –Redundancy may require more than one check bit Redundancy makes writes slower – why?
6/3/201431 PSUs CS 587 RAID Levels Standardized by SNIA (www.snia.org )www.snia.org Vary in practice For each level, decide (assume single user) Number of disks required to hold D disks of data. Speedup s (compared to 1 disk) for S/R (Sequential/Random) R/W (Reads/Writes) Random: each I/O is one block Sequential: Each I/O is one stripe Number of disks/blocks that can fail w/o data loss Level 0: Block Striped, No redundancy Picture is 2 slides back
6/3/201432 PSUs CS 587 JBOD, RAID Level 1 JBOD: Just a Bunch of Disks 0123…230123…23... Disk 0 Disk 1 Disk 2 Disk D-1 0123…0…0123…0… 0123…1…0123…1… 0123…340123…34 Level 1: Mirrored (two identical JBODs – no striping)
6/3/201433 PSUs CS 587 RAID Level 0+1: Stripe + Mirror 0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 D-1 2D-1 3D-1 … D-1... Disk 0 Disk 1 Disk 2 Disk D-1 0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 D-1 2D-1 3D-1 … D-1... Disk D Disk D+1 Disk D+2 Disk 2D-1
6/3/201434 PSUs CS 587 RAID Level 4 Block-Interleaved Parity (not common) – One check disk, uses one bit of parity. – How to tell if there is a failure, or which disk failed? – Read-modify-write – Disk D is a bottleneck 0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 D-1 2D-1 3D-1 … D-1... Disk 0 Disk 1 Disk 2 Disk D-1 Disk D P P P P …
6/3/201435 PSUs CS 587 RAID Level 5 Level 5: Block-Interleaved Distributed Parity 0 D 2D … … 1 D+1 2D+1 … … D-2 2D-2 P … … D-1 P 3D-2 … …... Disk 0 Disk 1 Disk D-2 Disk D-1 Disk D P 2D-1 3D-1 … … Level 6: Like 5, but 2 parity bits/disks Can survive loss of 2 disks/blocks
6/3/201436 PSUs CS 587 Notation on the next slide #Disks Number of disks required to hold D disks worth of data using this RAID level Reads/Write speedup of blocks in a single file: SR: Sequential Read RR: Random read SW: Sequential write RW: Random write Failure Tolerance How many disks can fail without loss of data Internal Data s = Blocks transferred in the time it takes to transfer one block of data from one disk. These numbers are theoretical! YMMV…and vary significantly!
6/3/201437 PSUs CS 587 RAID Performance Level#DisksSR speedup RR speedup SW speedup RW speedup Failure Tolerance 0 Ds=D1 s Ds=D1 s D0 1 2Ds=2 s=1** D* 0+1 2Ds=2D2 s 2Ds=D**1 s D**D* 5 D+1s=D1 s Ds=DVaries1 *If no two are copies of each other ** note – cant write both mirrors at once – why?
6/3/201438 PSUs CS 587 Small Writes on Levels 4 and 5 Levels 4 and 5 require a read-modify-write cycle for all writes, since the parity block must be read and modified. On small writes this can be very expensive This is another justification for Log Based File Systems (see your OS course)
6/3/201439 PSUs CS 587 Which RAID Level is best? If data loss is not a problem Level 0 If storage cost is not a problem Level 0+1 Else Level 5 Software Support Linux: 0,1,4,5 ( http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html ) http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html Windows: 0,1,5 ( http://www.techimo.com/articles/index.pl?photo=149 ) http://www.techimo.com/articles/index.pl?photo=149
6/3/201441 PSUs CS 587 9.4.2 DBMS vs. OS File System OS does disk space & buffer mgmt: why not let OS manage these tasks?  Differences in OS support: portability issues Some limitations, e.g., files cant span disks. Buffer management in DBMS requires ability to: pin a page in buffer pool, force a page to disk (important for implementing CC & recovery), adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations. Sometimes MRU is the best replacement policy: For example, for a scan or a loop that does not fit. 9.Disks
6/3/201442 PSUs CS 587 9.5 Files of Records Page or block is OK when doing I/O, but higher levels of DBMS operate on records, and files of records. FILE : A collection of pages, each containing a collection of records. Must support: insert/delete/modify record read a particular record (specified using record id ) scan all records (possibly with some conditions on the records to be retrieved) 9.Disks
6/3/201443 PSUs CS 587 9.5.1 Unordered (Heap) Files Simplest file structure contains records in no particular order. As file grows and shrinks, disk pages are allocated and de-allocated. To support record level operations, we must: keep track of the pages in a file keep track of free space on pages keep track of the records on a page There are at least two alternatives for keeping track of heap files. 9.Disks
6/3/201444 PSUs CS 587 Heap File Implemented as a List The header page id and Heap file name must be stored someplace. Each page contains 2 `pointers plus data. Header Page Data Page Data Page Data Page Data Page Data Page Data Page Pages with Free Space Full Pages 9.Disks
6/3/201445 PSUs CS 587 Heap File Using a Page Directory The entry for a page can include the number of free bytes on the page. The directory is a collection of pages; linked list implementation is just one alternative. Much smaller than linked list of all HF pages ! Data Page 1 Data Page 2 Data Page N Header Page DIRECTORY 9.Disks
6/3/201446 PSUs CS 587 Comparing Heap File Implementations Assume 100 directory entries per page. U full pages, E pages with free space D directory pages Then D = (U+E) /100 Note that D is two orders of magnitude less than U or E Cost to find a page with enough free space List: E/2 Directory: (D/2) + 1 Cost to Move a page from Full to Free (e.g., when a record is deleted) List: 3, Directory: 1 Can you think of some other operations?
6/3/201447 PSUs CS 587 9.6 Page Formats: Fixed Length Records Slot 1 Slot 2 Slot N... N M1 0 M... 3 2 1 PACKED UNPACKED, BITMAP Slot 1 Slot 2 Slot N Free Space Slot M 11 number of records number of slots 9.Disks
6/3/201448 PSUs CS 587 Packed vs Unpacked Page Formats Record ID (RID, TID) = (page#, slot#), in all page formats Note that indexes are filled with RIDs Data entries in alternatives 2 and 3 are (key, RID..) Packed stores more records RIDs change when a record is deleted This may not be acceptable. Unpacked RID does not change Less data movement when deleting
6/3/201449 PSUs CS 587 Page Formats: Variable Length Records Page i Rid = (i,N) Rid = (i,2) Rid = (i,1) Pointer to start of free space SLOT DIRECTORY N... 2 1 201624 N # slots 9.Disks
6/3/201450 PSUs CS 587 Slotted Page Format Intergalactic Standard, for fixed length records also. How to deal with free space fragmentation? Pack records. lazily Note that RIDs dont change How are updates handled which expand the size of a record? Forwarding flag to new location http://www.postgresql.org/docs/8.3/interactive/st orage-page-layout.html http://www.postgresql.org/docs/8.3/interactive/st orage-page-layout.html postgresql-8.3.1\src\include\storage\bufpage.h
6/3/201451 PSUs CS 587 9.7 Record Formats: Fixed Length Information about field types same for all records in a file; stored in system catalogs. Finding ith field does not require scan of record. Base address (B) L1L2L3L4 F1F2F3F4 Address = B+L1+L2 9.Disks
6/3/201452 PSUs CS 587 Record Formats: Variable Length Two alternative formats (# fields is fixed): * Second offers direct access to ith field, efficient storage of nulls (special dont know value); small directory overhead. 4$$$$ Field Count Fields Delimited by Special Symbols F1 F2 F3 F4 Array of Field Offsets 9.Disks