Presentation on theme: "5. Disk, Pages and Buffers Why Not Store Everything in Main Memory? Why use disks at all? Main memory costs too much. $1000 will buy you either – 1 GB."— Presentation transcript:
5. Disk, Pages and Buffers Why Not Store Everything in Main Memory? Why use disks at all? Main memory costs too much. $1000 will buy you either – 1 GB of RAM or 1 TGB of disk today (maybe more!) (~1000 as much disk as RAM per dollar). Main memory is volatile. –We want data to be saved between program runs (data persistence or residuality) Main memory is smaller. System disks typically hold many orders of magnitude more data than RAM (TBs vs GBs) The typical storage hierarchy is: – Main memory (RAM) for data in current use. – Disk for the main database (secondary storage). – Tapes for archiving older versions of the data (tertiary storage).
Disks Secondary storage device of choice. Main advantage over tapes: random access vs. sequential Data stored and retrieved in units called disk blocks or pages Unlike RAM, time to retrieve a disk page varies depending upon location on disk. Caveat: In NUMA (non-uniform memory access) machines, e.g., NDSU CHPCs SGI ALTIX, even in RAM, time to retrieve depends upon location (which brick or quad) Therefore, relative placement of pages on disk has a major impact on DBMS performance!
Components of a Disk Platters Arm assembly moves in/out to position a head on a desired track. Only one head reads/writes at any one time. Spindle Sector Block size i(smallest unit of transfer) is a multiple of sector size (which is fixed). Tracks (both sides of platter) Disk head Arm movement Arm assembly cylinder Collection of tracks under the heads at any one time is a cylinder.
Accessing a Disk Page Delay time to access (read/write) data on a disk block: – seek time ( moving arms to position disk head on track ) – rotational delay ( waiting for block to rotate under head ) – transfer time ( actually moving of electronic data to/from disk surface ) Seek time and rotational delay dominate. – Seek times can vary from about 1 to 20msec – Rotational delay can vary from 0 to 10msec – Transfer rate can be 1msec per 4KB page Key to lower I/O cost: reduce seek/rotation delays! Hardware vs. software solutions?
Arranging Pages on Disk (clustering) `Next block concept: – blocks on same track, followed by – blocks on same cylinder, followed by – blocks on adjacent cylinder Blocks in a file should be arranged sequentially on disk (by the above notions of `next), to minimize seek and rotational delay. For a sequential scan, pre-fetching several pages at a time is a big win!
RAID (redundant array of independent disks) RAID Disk Array: Arrangement of several disks that gives the abstraction of a single, large disk. RAID Goals: Increase performance (more concurrent read/write heads) and reliability (redundant data copies are kept) RAID's two main techniques: – Data striping: Data is partitioned and striped across disks; size of a partition is called the striping unit. Partitions are distributed over several disks allowing more read/write heads to operate in parallel. – Redundancy: Redundant info allows reconstruction of data if 1 disk fails.
RAID Levels Level 0: Block striping but no redundancy (e.g., Blocks 1,2,3,4 on Disks 1,2,3,4 resp.) Faster reads (more r/w heads working in parallel) Disk1 Disk2 Disk3 Disk4 1 2 3 4 Disk1 Disk2 Disk3 Disk4 4 3 4 3 2 2 1 1 Disk1 Disk2 Disk3 Disk4 3 4 3 4 1 2 1 2 Level 1: Mirroring (2 identical copies) – Each disk has a mirror image disk (check disk) – Parallel reads, but a write involves 2 disks. – Improved durability Level 2: simple bit-striping, Not used these days. Level 0+1: (sometimes called level 10) Block Striping and Mirroring – Faster reads plus improved durability
RAID Levels 3,4,5 Level 3: Bit-Interleaved Parity – Striping Unit = 1 bit bits 1,2,3,4 e.g., on disks 1,2,3,4, resp. Disk0 Disk1 Disk2 Disk3 Disk4 ParityDisk Disk1 Disk2 Disk3 Disk4 1 2 3 4 Level 4: Block-Interleaved Parity – Striping Unit= 1 block. blocks 1,2,3,4 – 1 check disk – Parallel reads possible for small requests – large requests can utilize full bandwidth – Writes involve modifying block and check disk ParityDisk Disk1 Disk2 Disk3 Disk4 1 2 3 4 Level 5: Block-Interleaved Distributed Parity – Similar to Level 4, but parity blocks distributed over all disks (striping unit = block) – eliminates Parity Disk hot-spot 1 check disk. Each read/write request involves all disks; disk array can process 1 request at a time (but very rapidly)
Buffer Management in a DBMS Data must be in RAM (buffer) for DBMS to operate on it! LookupTable of pairs is maintained. Page Requests from Higher Levels Occupied frame free frame BUFFER POOL (page frames) DB MAIN MEMORY DISK Disk_mgr transfers pages between page-frame > disk choice of frame for a page is dictated by replacement policy
When a Page is Requested If requested page (from a higher level) is not in buffer pool: – Choose a frame for replacement – If frame is dirty (has been changed while in RAM), write it to disk first – Read requested page into that frame (and update the LookupTable) Pin the page (designate it as temporarily non-replaceable) and return its address to requesting higher level layer process. * If requests can be predicted (e.g., in sequential scans), pages can be pre-fetched several at a time!
More on Buffer Management The requestor of a page, when it is done with that page, must unpin it (actually decrement its pin count) and set dirty bit if page has been modified. Because a page in the buffer pool may be requested concurrently by many higher layer processes, – pin count is used (LookupTable has ) – A page is a candidate for replacement iff pin count = 0. A note: CC & recovery subsystem may force additional I/O when a frame is chosen for replacement. (e.g., to implement a Write-Ahead Log protocol; more later on that.)
Buffer Replacement Frame is chosen for replacement by a replacement policy: – Least-recently-used (LRU) or Most-Recently-Used (MRU) or… – An example is given below, showing that knowledge of access pattern by the buffer manager, can be important – e.g., with LRU: DISK pages 1 2 3 4 5 6 7 8 9 10 11 12 13 14… BUFFER POOL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 write read These 6 reads fill the buffer. With LRU, every new read requires a write, flushing a frame (assuming all 6 pages have been change (dirtied)) – Extent (multi-block) pre-fetching (and extent writing) would alleviate this situation considerably.
Buffer Replacement Policy Policy can have big impact on # of I/Os; depends on access pattern Sequential flooding is the bad situation caused by LRU + repeated sequential scans – Can happen when # buffer frames < # pages in sequentially scan. – Each page request causes a flush, whereas, – MRU + repeated sequential scans would not. – Given a file with 7 blocks to be read sequentially and repeatedly. – Note that, after a while, every page to be read, was just flushed. 1 2 3 4 5 6 7 BUFFER POOL 1 2 3 4 5 6 7 1 2 Pgs 1-6 read in order. To read page-7, LRU flushes page 1 Second scan begins, requiring page-1, but it was just flushed! 2nd scan now needs page-2, but it was just flushed! Etc.
DBMS vs. OS File System OS can do disk space and buffer management. Why not let OS manage these tasks? Differences in OS support: portability issues Some OS limitations, e.g., files cant span disks. Buffer management in DBMS requires ability to: – pin a page in buffer pool, force a page to disk (important for implementing CC & recovery), – adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations.
Record Formats: Fixed Length Info about field types same for all records in a file –can access via offsets stored in system catalogs Base address (B) L1L1 L2L2 L3L3 L4L4 F1F1 F2F2 F3F3 F4F4 Address = B+L 1 +L 2 fields Field lengths
Record Formats: Variable Length Two alternative formats (assuming # fields is fixed): The 2 nd alternative offers direct access to i th field, efficient storage of nulls (special dont know value); small directory overhead. 4$$$$ Field Count Fields Delimited by Special Symbols F1 F2 F3 F4 Array of Field Offsets
Page Formats: Fixed Length Records *Record ID (RID) =. *In PACKED, moving records for free space mgmt changes RID. That may not be acceptable (RIDs are to be permanent IDs). RecSlot 1 RecSlot 2 RecSlot N... N PACKED pg format Free Space number of records... M1 0 M... 3 2 1 UNPACKED, BITMAP Slot 1 Slot 2 Slot N Slot M 11 number of records
UNPACKED, RECORD POINTER Page Format ( for Variable Length Records) *Can move records on page without changing RID; so, attractive for fixed-length records too. Page i SLOT DIRECTORY N... 2 1 Rid = (i,N) * Rid = (i,2) * Rid = (i,1) * Pointer to start of free space N # of record slots …