Presentation on theme: "Chapter 9, Disks and Files"— Presentation transcript:
1Chapter 9, Disks and Files The Storage HierarchyDisksMechanicsPerformanceRAIDDisk Space ManagementBuffer ManagementFiles of RecordsFormat of a Heap FileFormat of a Data PageFormat of Records
2Learning objectivesGiven disk parameters, compute storage needs and read timesGiven a reminder about what each level means, be able to derive any figures on the RAID performance slideDescribe the pros and cons of alternative structures for files, pages and records
3A (Very) Simple Hardware Model CPU chipregister fileALUsystem busmemory busmainmemorybus interfaceI/ObridgeI/O busExpansion slots forother devices suchas network adapters.USBcontrollergraphicsadapterdiskcontrollermousekeyboardmonitordisk
4Storage Options Registers Caches Main Memory Hard Disk / Flash Tape CapacityAccess TimeCostRegistersCachesMain MemoryHard Disk / FlashTape1k-2k bytes1 TcWay Expensive10s -1000s K Bytes2-20 Tc$10 / MByteG Bytes300 – 1000 Tc$0.03 / MB (eBay)100s G Bytes10 ms = 30M Tc$0.10/ GB (eBay)InfiniteForeverWay Cheap
5Cache - SDRAM may be multiple levels! Memory “Hierarchy”Upper LevelCapacityAccess TimeCostStagingXfer SizeFaster1k-2k bytes1 TcWay ExpensiveRegistersInstr. Operandsprog./compiler1-8 bytes10s -1000s K Bytes2-20 Tc$10 / MByteCache - SDRAM may be multiple levels!cache cntl8-128 bytesBlocksG Bytes300 – 1000 Tc$0.03 / MB (eBay)Memory - DRAMOS4K+ bytesPages100s G Bytes10 ms = 30M Tc$0.10/ GB (eBay)Diskuser/operatorGbytesFilesLargerInfiniteForeverWay CheapTapeLower Level
6Why Does “Hierarchy” Work? Locality:Program access a relatively small portion of the address space at any instant of timeTwo Different TypesTemporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
79.1 The Memory Hierarchy Typical storage hierarchy as used by a RDBMS: Primary storage: Main memory (RAM) for currently used dataSecondary storage: Disk, Flash Memory for the main databaseWhat are other reasons besides cost to use disk?Tertiary storage Tapes, DVDs for archiving older versions of the dataOther factorsCaches at every levelControllers, protocolsNetwork connectionsWhat are other reasons? Persistence – want databases to stay around; Size – 32-bit addressing insufficient for many databases19
8What is FLASH Memory, Anyway? Floating gate transitorPresence of charge => “0”Erase Electrically or UV (EPROM)PeformanceReads like DRAM (~ns)Writes like DISK (~ms). Write is a complex operation
9Components of a Disk platters are always spinning (say, 120rps). SpindleDisk headTracksplatters are always spinning (say, 120rps).one head reads/writes at any one time.to read a record:position arm (seek)engage headwait for data to spin byread (transfer data)SectorPlattersArm movement120 rps = 120 r/s x 1min/60 seconds = 7200rpmArm assembly21
10More terminology Each track is made up of fixed size sectors. SpindleDisk headTracksEach track is made up of fixed size sectors.Page size is a multiple of sector size.A platter typically has data onboth surfaces.All the tracks that you can reach from one position of the arm is called a cylinder (imaginary!).SectorPlattersArm movementArm assembly21
12Typical Disk Drive Statistics (2008) Sector size: 512 bytesSeek timeAverage msTrack to track msAverage Rotational Delay - 3 to 5 ms(rotational speed 10,000 RPM to 5,400RPM)Transfer Time - Sustained data ratemsec per 8K page, or MB/secondDensity12-18 GB/in2
13Disk Capacity Capacity: maximum number of bits that can be stored. Expressed in units of gigabytes (GB), where 1 GB = 10^9 bytesCapacity is determined by:Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track.Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment.Areal density (bits/in2): product of recording and track density.Modern disks partition tracks into disjoint subsets called recording zonesEach track in a zone has the same number of sectors, determined by the circumference of innermost track.Each zone has a different number of sectors/track
14Cost of Accessing Data on Disk Time to access (read/write) a disk block:Taccess = Tavg seek + Tavg rotation + Tavg transferseek time (moving arms to position disk head on track)rotational delay (waiting for block to rotate under head)Half a rotation, on averagetransfer time (actually moving data to/from disk surface)Key to lower I/O cost: reduce seek/rotation delays!No way to avoid transfer time…Textbook measures query cost by NUMBER of page I/OsImplies all I/Os have the same cost, and that CPU time is freeThis is a common simplification.Real DBMSs (in the optimizer) would consider sequential vs. random disk readsBecause sequential reads are much fasterand would count CPU time.22
15Disk Parameters Practice A 2-platter disk rotates at 7,200 rpm. Each track contains 256KB.How many cylinders are required to store an 8 Gigabyte file?What is the average rotational delay, in milliseconds?page 3-22**33 = 2**3 x 2**30 bytes in the file2**18 = 2**8 x 2**10 bytes per trackSo 4K = 2**12 tracks in the file4 tracks per cylinderSo 1K cylinders per file1/7200 minutes/rotation x 60 seconds/minute = 1/120 seconds/rotationAverage rotational delay is half a rotation, or 1/240 seconds = 4.2 msecs
16Disk Access Time Example Given:Rotational rate = 7,200 RPMAverage seek time = 9 ms.Avg # sectors/track = 400.Derived:Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms.Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 msTaccess = 9 ms + 4 ms msImportant points:Access time dominated by seek time and rotational latency.First bit in a sector is the most expensive, the rest are free.SRAM access time is about 4 ns/doubleword, DRAM about 60 nsDisk is about 40,000 times slower than SRAM,2,500 times slower than DRAM.
18Block, page and record sizes Block – According to text, smallest unit of I/O.Page – often used in place of block.“typical” record size: commonly hundreds, sometimes thousands of bytesUnlike the toy records in textbooks“typical” page size 4K, 8K
19Effect of page size on read time Suppose rotational delay is 4ms, average seek time 6 ms, transfer speed .5msec/8K.This graph shows the time required to read 1Gig of data for different page sizes.
20Why the difference?What accounts for the difference, in times to read one Gigabyte, on the previous graph?Assume: rotational delay 4ms, average seek time 6 ms, transfer speed .5msec/8KTransfer time(230/213 8K blocks) (.5msec/8K) = 66 secs ~= one minuteHow many reads?Page size 8K: there are 230/213 = 217 = 128K readsPage size 64K, there are 1/8th that many reads = 16K readsTime taken by rotational delays and seeksEach read requires a rotational delay and a seek, totalling 10 msec.8K: (128K reads) (10msec/read) = 1,311 secs ~= 22 minutes64K: 1/8 of that, or 164 secs ~= 3 minutes
21Moral of the StoryAs page size increases, read (and write) time reduces to transfer time, a big savings.So why not use a huge page size?Wastes memory space if you don’t need all that is readWastes read time if you don’t need all that is readWhat applications could use a large page size?Those that sequentially access dataThe problem with a small page size is that pages get scattered across the disk. Turn the page….Page size is set by the OS because of the virtual memory system’s importance. Most server-class OSes support larger size pages, up to megabytes in size.
22Faster I/O, even with a small page size Even if the page size is small, you can achieve fast I/O by storing a file’s data as follows:Consecutive pages on same track, followed byConsecutive tracks on same cylinder, followed byConsecutive cylinders adjacent to each otherFirst two incur no seek time or rotational delay, seek for third is only one-track.What is saved with this storage pattern?How is this storage pattern obtained?Disk defragmenter and its relatives/predecessorsAlso places frequently used files near the spindleWhen data is in this storage pattern, the application can do sequential I/OOtherwise it must do random I/O
23More Hardware Issues 9. Disks Disk Controllers Interface from Disks to busChecksums, remap bad sectors, driver mgt, etcInterface Protocols and MB per second xfer ratesIDE/EIDE/ATA/PATA, SATA -133SCSI -640BUT for a single device, SCSI is inferiorFaster network technologies such as Fibre ChannelStorage Area Networks (SANs)Disk farm networked to serversServers can be heterogeneous – a primary advantageCentralized management
24DependabilityModule reliability = measure of continuous service accomplishment (or time to failure). 2 metricsMean Time To Failure (MTTF) measures ReliabilityFailures In Time (FIT) = 1/MTTF, the rate of failuresTraditionally reported as failures per billion hours of operationMean Time To Repair (MTTR) measures Service InterruptionMean Time Between Failures (MTBF) = MTTF+MTTRModule availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9)Module availability = MTTF / ( MTTF + MTTR)
25Example calculating reliability If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modulesExample: Calculate FIT and MTTF for10 disks (1M hour MTTF per disk)1 disk controller (0.5M hour MTTF)and 1 power supply (0.2M hour MTTF)
26Example calculating reliability Calculate FIT and MTTF for10 disks (1M hour MTTF per disk)1 disk controller (0.5M hour MTTF)and 1 power supply (0.2M hour MTTF):
279.Disks9.2 RAID Disk Array: Arrangement of several disks that gives abstraction of a single, large disk.Goals: Increase performance and reliability.Two main techniques:Data striping: Data is partitioned; size of a partition is called the striping Unit. Partitions are distributed over several disks.Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails.
28Data Striping CPUs go fast, disks don’t. How can disks keep up? CPUs do work in parallel. Can disks?Answer: Partition data across D disks (see next slide).If Partition unit is a page:A single page I/O request is no fasterMultiple I/O requests can run at aggregated bandwidthNumber of pages in a partition unit called the depth of the partition.Contrary to text, partition units of a bit are almost never used and partition units of a byte are rare.
29Data Striping (RAID Level 0) 0 D 2D … 01 D+1 2D+1 …2 D+2 2D+2… 2D-1 2D-1 3D-1 … D-1...Disk Disk Disk Disk D-1
30Redundancy Striping is seductive, but remember reliability! MTTF of a disk is about 6 yearsIf we stripe over 24 disks, what is MTTF?Solution: redundancyParity: corrects single failuresOthers: detect where the failure is, and corrects multiple failuresBut failure location is provided by controllerRedundancy may require more than one check bitRedundancy makes writes slower – why?
31RAID Levels Standardized by SNIA (www.snia.org ) Vary in practice For each level, decide (assume single user)Number of disks required to hold D disks of data.Speedup s (compared to 1 disk) forS/R (Sequential/Random) R/W (Reads/Writes)Random: each I/O is one blockSequential: Each I/O is one stripeNumber of disks/blocks that can fail w/o data lossLevel 0: Block Striped, No redundancyPicture is 2 slides back
32JBOD, RAID Level 1 ... JBOD: Just a Bunch of Disks …...Disk Disk Disk Disk D-1… …… 1 ……Level 1: Mirrored (two identical JBODs – no striping)
33RAID Level 0+1: Stripe + Mirror D 2D …1D+1 2D+1…2D+2 2D+2D-1 2D-1 3D-1 …D-1...Disk Disk Disk Disk D-1D 2D …1D+1 2D+1…2D+2 2D+2…D-1 2D-1 3D-1 …D-1...Disk D Disk D+1 Disk D Disk 2D-1
34RAID Level 4 ... Block-Interleaved Parity (not common) One check disk, uses one bit of parity.How to tell if there is a failure, or which disk failed?Read-modify-writeDisk D is a bottleneck0 D 2D … 01 D+1 2D+1 …2 D+2 2D+2… 2D-1 2D-1 3D-1 … D-1...Disk Disk Disk Disk D Disk DP P P P …
35RAID Level 5 ... Level 5: Block-Interleaved Distributed Parity D-2 2D-2 P … …D-1 P 3D-2 … …P 2D-1 3D … …...Disk Disk Disk D Disk D Disk DLevel 6: Like 5, but 2 parity bits/disksCan survive loss of 2 disks/blocks
36Notation on the next slide #DisksNumber of disks required to hold D disks worth of data using this RAID levelReads/Write speedup of blocks in a single file:SR: Sequential ReadRR: Random readSW: Sequential writeRW: Random writeFailure ToleranceHow many disks can fail without loss of dataInternal Datas = Blocks transferred in the time it takes to transfer one block of data from one disk.These numbers are theoretical!YMMV…and vary significantly!
37RAID Performance Level #Disks SR RR SW RW 1 0+1 5 D s=D 1sD 2D s=2 speedupRRSWRWFailureToleranceDs=D1sD12Ds=2s=1**D*0+1s=2D2s2Ds=D**1sD**5D+1Varies*If no two are copies of each other** note – can’t write both mirrors at once – why?
38Small Writes on Levels 4 and 5 Levels 4 and 5 require a read-modify-write cycle for all writes, since the parity block must be read and modified.On small writes this can be very expensiveThis is another justification for Log Based File Systems (see your OS course)
39Which RAID Level is best? If data loss is not a problemLevel 0If storage cost is not a problemLevel 0+1ElseLevel 5Software SupportLinux: 0,1,4,5 (http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html )Windows: 0,1,5 (http://www.techimo.com/articles/index.pl?photo=149 )
419.Disks9.4.2 DBMS vs. OS File SystemOS does disk space & buffer mgmt: why not let OS manage these tasks? Differences in OS support: portability issuesSome limitations, e.g., files can’t span disks.Buffer management in DBMS requires ability to:pin a page in buffer pool, force a page to disk (important for implementing CC & recovery),adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations.Sometimes MRU is the best replacement policy: For example, for a scan or a loop that does not fit.8
429.Disks9.5 Files of RecordsPage or block is OK when doing I/O, but higher levels of DBMS operate on records, and files of records.FILE: A collection of pages, each containing a collection of records. Must support:insert/delete/modify recordread a particular record (specified using record id)scan all records (possibly with some conditions on the records to be retrieved)13
439.5.1 Unordered (Heap) Files 9.Disks9.5.1 Unordered (Heap) FilesSimplest file structure contains records in no particular order.As file grows and shrinks, disk pages are allocated and de-allocated.To support record level operations, we must:keep track of the pages in a filekeep track of free space on pageskeep track of the records on a pageThere are at least two alternatives for keeping track of heap files.14
44Heap File Implemented as a List 9.DisksHeap File Implemented as a ListDataPageDataPageDataPageFull PagesHeaderPageDataPageDataPageDataPagePages withFree SpaceThe header page id and Heap file name must be stored someplace.Each page contains 2 `pointers’ plus data.15
45Heap File Using a Page Directory 9.DisksHeap File Using a Page DirectoryDataPage 1HeaderPageDataPage 2DataPage NDIRECTORYThe entry for a page can include the number of free bytes on the page.The directory is a collection of pages; linked list implementation is just one alternative.Much smaller than linked list of all HF pages!16
46Comparing Heap File Implementations Assume100 directory entries per page.U full pages, E pages with free spaceD directory pagesThen D = (U+E) /100Note that D is two orders of magnitude less than U or ECost to find a page with enough free spaceList: E/2 Directory: (D/2) + 1Cost to Move a page from Full to Free (e.g., when a record is deleted)List: 3, Directory: 1Can you think of some other operations?
48Packed vs Unpacked Page Formats Record ID (RID, TID) = (page#, slot#) , in all page formatsNote that indexes are filled with RIDsData entries in alternatives 2 and 3 are (key, RID..)Packedstores more recordsRIDs change when a record is deletedThis may not be acceptable.UnpackedRID does not changeLess data movement when deleting
50Slotted Page FormatIntergalactic Standard, for fixed length records also.How to deal with free space fragmentation?Pack records. lazilyNote that RIDs don’t changeHow are updates handled which expand the size of a record?Forwarding flag to new locationpostgresql-8.3.1\src\include\storage\bufpage.h
519.7 Record Formats: Fixed Length 9.Disks9.7 Record Formats: Fixed LengthF1F2F3F4L1L2L3L4Base address (B)Address = B+L1+L2Information about field types same for all records in a file; stored in system catalogs.Finding i’th field does not require scan of record.9
52Record Formats: Variable Length 9.DisksRecord Formats: Variable LengthTwo alternative formats (# fields is fixed):F F F F44$Fields Delimited by Special SymbolsFieldCountF F F F4Array of Field OffsetsSecond offers direct access to i’th field, efficient storageof nulls (special don’t know value); small directory overhead.10