Presentation on theme: "5. Disk, Pages and Buffers Why Not Store Everything in Main Memory"— Presentation transcript:
1 5. Disk, Pages and Buffers Why Not Store Everything in Main Memory 5. Disk, Pages and Buffers Why Not Store Everything in Main Memory? Why use disks at all?Main memory costs too much. $1000 will buy you either1 GB of RAM or 1 TGB of disk today (maybe more!)(~1000 as much disk as RAM per dollar) .Main memory is volatile.We want data to be saved between program runs (data persistence or residuality)Main memory is smaller.System disks typically hold many orders of magnitude more data than RAM (TBs vs GBs)The typical storage hierarchy is:Main memory (RAM) for data in current use.Disk for the main database (secondary storage).Tapes for archiving older versions of the data (tertiary storage).19
2 Disks Secondary storage device of choice. Main advantage over tapes: random access vs. sequentialData stored and retrieved in units called disk blocks or pagesUnlike RAM, time to retrieve a disk page varies depending upon location on disk.Caveat: In NUMA (non-uniform memory access) machines, e.g., NDSU CHPC’s SGI ALTIX, even in RAM, time to retrieve depends upon location (which brick or quad)Therefore, relative placement of pages on disk has a major impact on DBMS performance!20
3 Arm assembly moves in/out to position a head on a desired track. Components of a DiskCollection of tracks under the heads at any one time is a cylinder.Only one head reads/writes at any one time.SpindleTracks (bothsides of platter)Block size i(smallest unit of transfer) is a multiple of sector size (which is fixed).Disk headArm movementArm assemblySectorcylinderPlatters21
4 Accessing a Disk PageDelay time to access (read/write) data on a disk block:seek time (moving arms to position disk head on track)rotational delay (waiting for block to rotate under head)transfer time (actually moving of electronic data to/from disk surface)Seek time and rotational delay dominate.Seek times can vary from about 1 to 20msecRotational delay can vary from 0 to 10msecTransfer rate can be 1msec per 4KB pageKey to lower I/O cost: reduce seek/rotation delays! Hardware vs. software solutions?22
5 Arranging Pages on Disk (clustering) `Next’ block concept:blocks on same track, followed byblocks on same cylinder, followed byblocks on adjacent cylinderBlocks in a file should be arranged sequentially on disk (by the above notions of `next’), to minimize seek and rotational delay.For a sequential scan, pre-fetching several pages at a time is a big win!23
6 RAID (redundant array of independent disks) RAID Disk Array: Arrangement of several disks that gives the abstraction of a single, large disk.RAID Goals: Increase performance (more concurrent read/write heads) and reliability (redundant data copies are kept)RAID's two main techniques:Data striping: Data is partitioned and “striped” across disks;size of a partition is called the striping unit.Partitions are distributed over several disks allowing more read/write heads to operate in parallel.Redundancy:Redundant info allows reconstruction of data if 1 disk fails.
7 RAID Levels Level 0: Block striping but no redundancy Disk1 Disk Disk Disk4Level 0: Block striping but no redundancy(e.g., Blocks 1,2,3,4 on Disks 1,2,3,4 resp.)Faster reads (more r/w heads working in parallel)Disk1 Disk Disk Disk4Level 1: Mirroring (2 identical copies)Each disk has a mirror image disk (check disk)Parallel reads, but a write involves 2 disks.Improved durabilityDisk1 Disk Disk Disk4Level 0+1: (sometimes called level 10)Block Stripingand MirroringFaster reads plus improved durabilityLevel 2: simple bit-striping, Not used these days.
8 RAID Levels 3,4,5 Level 3: Bit-Interleaved Parity ParityDisk Disk1 Disk Disk3 Disk4Level 3: Bit-Interleaved ParityStriping Unit = 1 bit bits 1,2,3, e.g., on disks 1,2,3,4, resp.1 check disk Each read/write request involves all disks; disk array can process 1 request at a time (but very rapidly)ParityDisk Disk1 Disk Disk3 Disk4Level 4: Block-Interleaved ParityStriping Unit=1 block. blocks 1,2,3,41 check diskParallel reads possible for small requestslarge requests can utilize full bandwidthWrites involve modifying block and check diskDisk Disk1 Disk Disk3 Disk4Level 5: Block-Interleaved Distributed ParitySimilar to Level 4, but parity blocks distributed over all disks (striping unit = block)eliminates Parity Disk hot-spot
9 Buffer Management in a DBMS Page Requests from Higher LevelsOccupiedframefree frameBUFFER POOL (page frames)DBMAIN MEMORYDISKDisk_mgr transfers pages betweenpage-frame > diskchoice of frame for a page isdictated by replacement policyData must be in RAM (buffer) for DBMS to operate on it!LookupTable of <frame#, pageid> pairs is maintained.4
10 When a Page is Requested If requested page (from a higher level) is not in buffer pool:Choose a frame for replacementIf frame is dirty (has been changed while in RAM), write it to disk firstRead requested page into that frame (and update the LookupTable)Pin the page (designate it as temporarily non-replaceable) and return its address to requesting higher level layer process.If requests can be predicted (e.g., in sequential scans),pages can be pre-fetched several at a time!5
11 More on Buffer Management The requestor of a page, when it is done with that page, must unpin it (actually decrement its pin count) and set dirty bit if page has been modified.Because a page in the buffer pool may be requested concurrently by many higher layer processes,pin count is used (LookupTable has <frame#, pg-ID, pincnt>)A page is a candidate for replacement iff pin count = 0.A note: CC & recovery subsystem may force additional I/O when a frame is chosen for replacement. (e.g., to implement a Write-Ahead Log protocol; more later on that.)6
12 Buffer Replacement write read Frame is chosen for replacement by a replacement policy:Least-recently-used (LRU) or Most-Recently-Used (MRU) or…An example is given below, showing that knowledge of access pattern by the buffer manager, can be important – e.g., with LRU:Extent (multi-block) pre-fetching (and extent writing) would alleviate this situation considerably.BUFFER POOLThese 6 reads fill the buffer. With LRU, every new read requires a write, flushing a frame (assuming all 6 pages have been change (dirtied))1371489writeread101112DISK pages …7
13 Buffer Replacement Policy Policy can have big impact on # of I/O’s; depends on access patternSequential flooding is the bad situation caused by LRU + repeated sequential scansCan happen when # buffer frames < # pages in sequentially scan.Each page request causes a flush, whereas,MRU + repeated sequential scans would not.Given a file with 7 blocks to be read sequentially and repeatedly.Note that, after a while, every page to be read, was just flushed.BUFFER POOL1272nd scan now needs page-2, but it was just flushed! Etc.Second scan begins, requiring page-1, but it was just flushed!Pgs 1-6 read in order. To read page-7, LRU flushes page 17
14 DBMS vs. OS File SystemOS can do disk space and buffer management. Why not let OS manage these tasks?Differences in OS support: portability issuesSome OS limitations, e.g., files can’t span disks.Buffer management in DBMS requires ability to:pin a page in buffer pool, force a page to disk (important for implementing CC & recovery),adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations.8
15 Record Formats: Fixed Length fieldsF1F2F3F4L1L2L3L4Field lengthsBase address (B)Address = B+L1+L2Info about field types same for all records in a filecan access via offsets stored in system catalogs9
16 Record Formats: Variable Length Two alternative formats (assuming # fields is fixed):4$FieldCountFields Delimited by Special SymbolsF F F F4F F F F4Array of Field OffsetsThe 2nd alternative offers direct access to ith field, efficient storageof nulls (special don’t know value); small directory overhead.10
17 Page Formats: Fixed Length Records RecSlot 1RecSlot 2RecSlot N. . .NPACKEDpg formatFreeSpacenumberof records. . .M1MUNPACKED, BITMAPSlot 1Slot 2Slot NSlot Mnumberof recordsRecord ID (RID) = <page id, slot #>.In PACKED, moving records for free space mgmt changes RID. That may not be acceptable (RIDs are to be permanent IDs).11
18 UNPACKED, RECORD POINTER Page Format (for Variable Length Records) Rid = (i,N)*Page iRid = (i,2)*Rid = (i,1)*Pointer to start of free space…N# of record slotsSLOT DIRECTORYNCan move records on page without changing RID; so, attractive for fixed-length records too.12
Your consent to our cookies if you continue to use this website.