Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Storage and Disk Access

Similar presentations

Presentation on theme: "Data Storage and Disk Access"— Presentation transcript:

1 Data Storage and Disk Access
CMPT 454

2 Data Storage and Disk Access
Memory hierarchy Hard disks Architecture Processing requests Writing to disk Hard disk reliability and efficiency RAID Solid State Drives Buffer management Data storage

3 Memory

4 DBMS and Memory Cache Main Memory Disk Virtual Memory File System
Tertiary Storage

5 Memory Hierarchy Primary memory: volatile
Main memory Cache Secondary memory: non-volatile Solid State Drive (SSD) Magnetic Disk (Hard Disk Drive, HDD) Tertiary memory: non-volatile CD/DVD Tape - sequential access Usually used as backup or for long-term storage cost speed

6 Main Memory vs Secondary Storage
Speed Main memory is much faster than secondary memory 10 – 100 nanoseconds to move data in main memory to milliseconds 10 milliseconds to read a block from an HDD 0.1 milliseconds to read a block from a SSD Cost Main memory is around 100 times more expensive than secondary memory SSDs are more expensive than HDDs

7 Main Memory vs Secondary Storage
System limitations On a 32 bit system only 232 bytes can be directly referenced Many databases are larger than that Volatility Data must be maintained between program executions which requires non-volatile memory Nonvolatile storage retains its contents when the device is turned off, or if there is a power failure Main memory is volatile, secondary storage is not

8 Hard Disk Drives

9 Magnetic Disks Database data is usually stored on disks
A database will often be too large to be retained in main memory When a query is processed data will need to be retrieved from storage Data is stored on disk blocks Also referred to as blocks, or in relation to the OS, pages A contiguous sequence of bytes and The unit in which data is written to, and read from Block size is typically between 4 and 16 kilobytes

10 Magnetic Disk Structure
A hard disk consists of a number of platters Platters can store data on either one or both of its surfaces so is referred to as Single-sided or double sided Surfaces are composed of concentric rings called tracks The set of all tracks with the same diameter is called a cylinder Sectors are arcs of a track And are typically 4 kilobytes in size Block size is set when the disk is initialized, usually a small multiple of the sector size (hence 4 to 16 kilobytes)

11 *statistics for Western Digital Caviar Black 1 TB hard drive
Diagram of a Disk surfaces platter 3* 2* track cylinder *statistics for Western Digital Caviar Black 1 TB hard drive

12 Disk Heads Data is transferred to or from a surface by a disk head
There is one disk head for each surface These disk heads are moved as a unit (called a disk head array) Therefore all the heads are in identical positions with respect to their surfaces To read or write a block a disk head must be positioned over it Only one disk head can read or write at a time

13 Disk Anatomy the disk spins – around 7,200rpm disk head array track
moves in and out platters

14 Disk Controller Disk drives are controlled by a processor called a disk controller which Controls the actuator that moves the head assembly Selects sectors and determines when the disk has rotated to a sector Transfers data between the disk and main memory Some controllers buffer data from tracks in the expectation that the data will be required

15 Accessing Data in a Disk
The disk constantly spins 7,200 rpm* The head pivots over the desired track The desired block is read as it passes underneath the head * Western Digital Caviar Black 1 TB hard drive (again)

16 Accessing A Block The disk head is moved in or out to the track
This seek time is typically  10 milliseconds WD Caviar Caviar Black 1TB: 8.9 ms Wait until the block rotates under the disk head This rotational delay is typically  4 milliseconds WD Caviar Caviar Black 1TB : 4.2 ms The data on the block is transferred to memory This transfer time is the time it takes for the block to completely rotate past the disk head Typically less than 1 millisecond

17 Transfer Time The seek time and rotational delay depend on
Where the disk head is before the request, Which track is being requested, and How far the disk has to rotate The transfer time depends on the request size The transfer time (in ms) for one block equals (60,000 / disk rpm) / blocks per track The transfer time (in ms) for an entire track equals (60,000 / disk rpm)

18 Main Memory versus Disk
Typical access time for a block on a hard disk 15 milliseconds Typical access time for a main memory frame 60 nanoseconds What’s the difference? 1 millisecond = 1,000,000 nanoseconds 60 ns = 0.000,060 ms Accessing a hard drive is around 250,000 times slower than accessing main memory

19 Reducing Disk Access Time
Disk latency (access time) has three components seek time + rotational delay + transfer time The overall access time can be shortened by reducing, or even eliminating seek time and rotational delay Related data should be stored in close proximity Accessing two records in adjacent blocks on a track Seek the desired track, rotate to first block, and transfer two blocks = *1 = 16ms Accessing two records on different tracks Seek the desired track, rotate to the block, and transfer the block, then repeat = ( )*2 = 30ms

20 Order of Closeness What does it mean to say that related data should be stored close to each other? The term close refers not to physical proximity but to how the access time is affected In order of closeness: Same block Adjacent blocks on the same track Same track Same cylinder, but different surfaces Adjacent cylinders

21 Which is Closer Is 2, or 3 "closer" to 1? 2 is in the adjacent track
And is clearly physically closer, but The disk head must be moved to access it 3 is in the same cylinder The disk head does not have to be moved Which is why 3 is closer 1 x x 2 x 3

22 Fulfilling Disk Requests
A fair algorithm would take a first-come, first-serve approach Insert requests in a queue and process them in the order in which they are received 2,0001 4,0004 6,0002 10,0006 14,0003 16,0005 Cylinder Received Complete Moved Total 2,000 5 6,000 14 4,000 14,000 27 8,000 10 43 10,000 24,000 16,000 20 60 12,000 36,000 30 72 42,000

23 Elevator Algorithm The elevator algorithm usually performs better than FIFO Requests are buffered and the disk head moves in one direction, processing requests The arm then reverses direction 2,0001 4,0004 6,0002 10,0006 14,0003 16,0005 Cylinder Received Complete Moved Total 2,000 5 6,000 14 4,000 14,000 27 8,000 16,000 20 35 10,000 30 46 22,000 58 28,000

24 Requests – Discussion The elevator algorithm gives much better performance than FIFO on average And is a relatively fair algorithm The elevator algorithm is not optimal The shortest-seek first algorithm is closer to optimal but can result in a high variance in response time And may even result in starvation for distant requests In some cases the elevator algorithm can perform worse than FIFO

25 Modifying a Record To modify an existing record (on a disk) the following steps must be taken Read the record Modify the record in main memory Write the modified record back to disk It is important to remember that the smallest unit of transfer to / from a disk is a block A single disk block usually contains many records

26 Read – Modify – Write Cycle
Read one block into main memory … other records … Landis#winner#Phonak#... other records … Landis#winner#Phonak#...

27 Read – Modify – Write Cycle
Read one block into main memory … other records … Landis#disq.#none#... other records … Landis#winner#Phonak#... … modify the desired record … other records … Landis#winner#Phonak#...

28 Read – Modify – Write Cycle
Read one block into main memory … other records … Landis#disq.#none#... … modify the desired record … … and write it back. other records … Landis#winner#Phonak#... other records … Landis#disq.#none#...

29 Inserting Records Consider creating a new record
The user enters the data for the record Through some application interface The record is created in main memory And then written to disk Does this process require a read-modify-write process? YES! Because, otherwise, the existing contents of the disk block will be overwritten

30 Disk Failures Intermittent failure Media decay Write failure
Multiple attempts are required to read or write a sector Media decay A bit or a number of bits are permanently corrupted and it is impossible to read a sector Write failure A sector cannot be written to or retrieved Often caused by a power failure during a write Disk crash The entire disk becomes unreadable

31 Checksums An intermittent failure may result in incorrect data being read by the disk controller Such incorrect data can be detected by a checksum Each sector contains additional bits whose values are based on the data bits in the sector A simple single-bit checksum is to maintain an even parity on the sector If there is an odd number of 1s the parity is odd If there is an even number of 1s the parity is even

32 Parity Bits Assume that there are seven data bits and a single checksum bit Data bits – parity is odd Checksum bit is set to 1 so that the overall parity is even Using a single checksum bit allows errors of only one bit to be detected reliably Several checksum bits can be maintained to reduce the chance of failing to notice an error e.g. maintain 8 checksum bits, one for each bit position in the data bytes

33 Stable Storage Checksums can detect errors but can't correct them
Stable storage can be implemented on a disk to allow errors to be corrected Sectors are paired, with each pair representing a single sector Pairs are usually referred to as Left and Right Errors in a sector (L or R) are detected using checksums Stable storage can cope with media failures and write failures

34 Stable Storage Policy For writing, write the value of some sector X into XL Check that the value is correct (using checksums) If the value is not correct after a given number of attempts then assume that the sector has failed A spare sector should be substituted for XL Repeat the process for XR For reading, XL and XR are read from in turn until a correct value is returned


36 Problems with Hard Disks
Hard disks act as bottlenecks for processing DB data is stored on disks, and must be fetched into main memory to be processed, and Disk access is considerably slower than main memory processing There are also reliability issues with disks Disks contain mechanical components that are more prone to failure than electronic components One solution is to use multiple disks

37 Multiple Disks Multiple disks Single disk
Each disk contains multiple platters Disks can be read in parallel, and Different disks can read from different cylinders e.g. the first disk can access data from cylinder 6,000, while the second disk is accessing data from cylinder 11,000 Single disk Multiple platters Disk heads are always over the same cylinder

38 Improving Efficiency Using multiple disks to store data improves efficiency as the disks can be read in parallel To satisfy a request the physical disks and disk blocks that the data resides on must be identified The data may be on a single disk, or it may be split over multiple disks The way in which data is distributed over the disks affects the cost of accessing it In the same way that related data should be stored close to each other on a single disk

39 Data Striping A disk array gives the user the abstraction of a single, large, disk When an I/O request is issued the physical disk blocks to be retrieved have to be identified How the data is distributed over the disks in the array affects how many disks are involved in an I/O request Data is divided into partitions called striping units The striping unit is usually either a block or a bit Striping units are distributed over the disks using a round robin algorithm

40 Striping Notional File – the data is divided into striping units of a given size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 The striping units are distributed across a RAID system in a round robin fashion disk 1 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 disk 2 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 disk 3 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 disk 4 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 The size of the striping unit has an impact on the behaviour of the system

41 Striping Units – Block Striping
Assume that a file is to be distributed across a four disk RAID system, using block striping, and that, Purely for the sake of illustration, the block size is just one byte! Notional File – the numbers represent a sequence of individual bits in the file 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 Distribute these bits across a 4 disk RAID system using BLOCK striping: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40 65 66 67 68 69 70 71 72 Disk 1 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48 73 74 75 76 77 78 79 80 Disk 2 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56 81 82 83 84 85 86 87 88 Disk 3 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64 89 90 91 92 93 94 95 96 Disk 4 Block 1 Block 2 Block 3

42 Striping Units – Bit Striping
Here is the same file to be distributed across a four disk RAID system, this time using bit striping, and again remember that Purely for the sake of illustration , the block size is just one byte! Notional File – the numbers represent a sequence of individual bits in the file 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 Distribute these bits across a 4 disk RAID system using BIT striping: 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 Disk 1 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 Disk 2 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 Disk 3 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 Disk 4 Block 1 Block 2 Block 3

43 Disk Array Performance
Assume that a disk array consists of D disks Data is distributed across the disks using data striping How does it perform compared to a single disk? To answer this question we must specify the kinds of requests that will be made Random read – reading multiple, unrelated records Random write Sequential read – reading a number of records (such as one file or table), stored on more than D blocks Sequential write

44 The Basic Idea … Use all D disks to improve efficiency, and distribute data using block striping Random read performance Very good – up to D different records can be read at once Depending on which disks the records reside on Random write performance – same as read performance Sequential read performance Very good – as related data are distributed over all D disks performance is D times faster than a single disk Sequential write performance – same as read performance But what about reliability …

45 Reliability Hard disks contain mechanical components and are less reliable than other, purely electronic, components Increasing the number of hard disks decreases reliability, reducing the mean-time-to-failure (MTTF) The MTTF of a hard disk is  50,000 hours, or 5.7 years In a disk array the overall MTTF decreases Because the number of disks is greater MTTF of a 100 disk array is 21 days – (50,000/100) / 24 This assumes that failures occur independently and The failure probability does not change over time Reliability is improved by storing redundant data

46 Redundancy Reliability of a disk array can be improved by storing redundant data If a disk fails the redundant data can be used to reconstruct the data lost on the failed disk The data can either be stored on a separate check disk or Distributed uniformly over all the disks Redundant data is typically stored using one of two methods Mirroring, where each disk is duplicated A parity scheme, where sufficient redundant data is maintained to recreate the data in any one disk Other redundancy schemes provide greater reliability

47 Parity Scheme For each bit on the data disks there is a parity bit on a check disk If the sum of the data disks bits is even the parity bit is set to zero If the sum of the bits is odd the parity bit is set to one The data on any one failed disk can be recreated bit by bit 1 1 1 1 4 data disk system showing individual bit values 1 5th check disk containing parity data

48 Parity Scheme Read and Write
Reading The parity scheme does not affect reading Writing A naïve approach would be to calculate the new value of the parity bit from all the data disks A better approach is to compare the old and new values of the disk that is written to And change the value of a parity bit if the corresponding bits have changed

49 Introducing RAID A RAID system consists of several disks organized to increase performance and improve reliability Performance is improved through data striping Reliability is improved through redundancy RAID stands for Redundant Arrays of Independent Disks There are several RAID schemes or levels The levels differ in terms of their Read and write performance, Reliability, and Cost

50 RAID Level 0 All D disks are used to improve efficiency, and data is distributed using block striping No redundant information is kept Read and write performance is very good But, reliability is poor Unless data is regularly backed up a RAID 0 system should only be used when the data is not important A RAID 0 system is the cheapest of all RAID levels As there are no disks used for storing redundant data

51 Level 1: Mirrored An identical copy is kept of each disk in the system, hence the term mirroring Read performance is similar to a single disk No data striping, but parallel reads of the duplicate disks can be made which improves random read performance Write performance is worse than a single disk as the duplicate disk has to be written to Writes to the original and mirror should not be performed simultaneously in case there is a global system failure But write performance is superior to most other RAID levels Very reliable but costly With D data disks, a level 1 RAID system has 2D disks

52 Level 1+0: Striping + Mirroring
Sometimes referred to as RAID level 10, combines both striping and mirroring Very good read performance Similar to RAID level 0 2D times the speed of a single disk for sequential reads Up to 2D times the speed of a single disk for random reads Allows parallel reads of blocks that, conceptually, reside on the same disk Poor write performance Similar to RAID level 1 Very reliable but the most expensive RAID level

53 Writing and Redundant Data
Writing data is the Achilles heel of RAID systems Data and check disks should not be written to simultaneously Parity information may have to be read before check disks can be written to In many RAID systems writing is less efficient than with a single disk!

54 Writing Parity Data Sequential writes, or random writes in a RAID system using bit striping: Write to all D data disks, using a read-modify-write cycle Calculate the parity information from the written data Write to the check disk(s) A read-modify-write cycle is not required Random writes in a system using block striping: Write to the data disk using a read-modify-write cycle Read the check disk(s), and calculate the new parity data

55 Striping Units Performance
A RAID system with D disks can read data up to D times faster than a single disk system For sequential reads there is no performance difference between bit striping and block striping Block striping is more efficient for random reads With bit striping all D disks have to be read to recreate a single record (and block) of the data file With block striping, a complete record is stored on one disk, so only one disk is required to satisfy a single random read Write performance is similar except that it is also affected by the parity scheme

56 Levels 2 and 3 Level 2 does not use the standard parity scheme
Uses a scheme that allows the failed disk to be identified increasing the number of disks required However the failed disk can be detected by the disk controller so this is unnecessary Can tolerate the loss of a single disk Level 3 is Bit Interleaved Parity The striping unit is a single bit Random read and write performance is poor as all disks have to be accessed for each request

57 RAID Level 4 Uses block striping to distribute data over disks
Uses one redundant disk containing parity data The ith block on the redundant disk contains parity checks for the ith blocks of all data disks Good sequential read performance D times single disk speed Very good random read performance Disks can be read independently, up to D times single disk speed

58 RAID Level 4: Writing When data is written the affected block and the redundant disk must both be written to To calculate the new value of the redundant disk Read the old value of the changed block Read the corresponding redundant disk block Write the new data block Recalculate the block of the redundant disk To recalculate the redundant data consider the changes in the bit pattern of the written data block

59 RAID Level 4: Performance
Cost is moderate Only one check disk is required The system can tolerate the loss of one drive Write performance is poor for random writes Where different data disks are written independently For each such write a write to the redundant disk is also required Performance can be improved by distributing the redundant data across all disks – RAID level 5

60 Level 5: Block-Interleaved Distributed Parity
The dedicated check disk in RAID level 4 tends to act as a bottleneck for random writes RAID level 5 does not have a dedicated check disk but distributes the parity data across all disks This removes the bottleneck thus increasing the performance of random writes Sequential write performance is similar to level 4 Cost is moderate, with the same effective space utilization as level 4 The system can tolerate the loss of one drive

61 Multiple Disk Crashes RAID levels 4 and 5 can only cope with single disk crashes Therefore if multiple disks crash at the same time (or before a failed disk can be replaced) data will be lost RAID level 6 allows systems to deal with multiple disk crashes These systems use more sophisticated error correcting codes One of the simpler error correcting codes is the Hamming Code

62 Hamming Code Consider a system with seven disks which can be identified with numbers from 1 to 7 Four of the disks are data disks, disks 1 to 4 Three of the disks are redundant disks, disks 5 to 7 Each of the three check disks contain parity data for three of the four data disks Disk 5 contains parity data for disks 1, 2 and 3 Disk 6 contains parity data for disks 1, 2 and 4 Disk 7 contains parity data for disks 1, 3 and 4

63 Hamming Code Example Disk 1 2 3 4 5 6 7 1,2,3 1,2,4 1,3,4 Data
1,2,3 1,2,4 1,3,4 Data Redundant Data

64 RAID Level 6 Reads are performed as normal
Only the data disks are used Writes are performed in a similar way to RAID level 4 Except that multiple redundant disks may be involved Cost is high, as more check disks are required

65 RAID Level 6 Recovery If one disk fails use the parity data to restore the failed disk like level 4 If two disks fail then both disks can be rebuilt using three of the other disks, e.g. If disks 1 and 2 fail Rebuild disk 1 using disks 3, 4 and 7 Rebuild disk 2 using disks 1, 3 and 5 If disks 3 and 5 fail Rebuild disk 3 using disks 1, 4 and 7 Rebuild disk 5 using disks 1, 2 and 3

66 Parity Scheme and Reliability
In real-life RAID systems the disk array is partitioned into reliability groups A reliability group consists of a set of data disks and a set of check disks The number of check disks depends on the reliability level that is selected Consider a RAID system with 100 data disks and 10 check disks i.e. 10 reliability groups The MTTF is increased from 21 days to 250 years!

67 Which RAID Level Level 0 improves performance at the lowest cost but does not improve reliability Level 1+0 is better than level 1 and has the best write performance Levels 2 and 4 are always inferior to 3 and 5 Level 3 is good for large transfer requests of several contiguous blocks but bad for many small request of a single disk block Level 5 is a good general-purpose solution Level 6 is appropriate if higher reliability is required In practice the choice is usually between 0, 1, and 5

68 RAID Levels Comparison
The table that follows compares RAID levels using RAID level 0 as a baseline Comparisons of RAID systems vary dependent on the metric used and the measurements of those metrics The three primary metrics are reliability, performance and cost These can be measured in I/Os per second, bytes per second, response time and so on The comparison uses throughput per dollar for systems of equivalent file capacity File capacity is the amount of information that can be stored on the system, which excludes redundant data

69 RAID Levels Comparison
Random Read Random Write Sequential Read Sequential Write Storage Efficiency RAID 0 1 RAID 1 RAID 3 1/G (G-1)/G RAID 4 max(1/G, ¼) RAID 5 RAID 6 max(1/G, 1/6) (G-2)/G G refers to the number of disks in a reliability group (both data disks and check disks) RAID levels 10 and 2 not shown

70 Solid State Drives

71 Solid State Drives Solid State Drives (SSDs) use NAND flash memory and do not contain moving parts like an HDD Accessing an SSD does not require seek time or rotational latency and they are therefore considerably faster Flash memory is non-volatile memory that is used by smart-phones, mp3 players and thumb (or USBO) drives NAND flash architecture is similar to a NAND (negated and) logic gate hence the name NAND flash architecture is only able to read and write data one page at a time There are two types of SSD Multi-level cell (MLC) Single-level cell (SLC)

72 MLC SSD MLC cells can store multiple different charge levels
And therefore more than one bit With four charge levels a cell can store 2 bits Multiple threshold voltages makes reading more complex but allows more data to be stored per cell MLC SSDs are cheaper than SLC SSDs However write performance is worse And their lifetimes are shorter

73 SLC SSD SLC cells can only store a single charge level
They are therefore on or off, and can contain only one bit SLC drives are less complex They are more reliable and have a lower error rate They are faster since it is easier to read or write a single charge value SLC drives are more expensive And typically used for enterprise rather than home use

74 SSD Performance Reads are much faster than HDDs since there are no moving parts Writes are also faster than HDDs However flash memory must be erased before it is written, and entire blocks must be erased Referred to as write amplification The performance increase is greatest for random reads

75 Storing Data on a Disk

76 DBMS Structure DBMS Query Evaluation Transaction and Lock Manager
File and Access Code Recovery Manager Buffer Manager Disk Space Manager Database

77 Accessing Data When an SQL command is evaluated a request may be made for a DB record Such a request is passed to the buffer manager If the record is not stored in the (main memory) buffer the page must be fetched from disk The disk space manager provides routines for allocating, de-allocating, reading and writing pages

78 Disk Space Management The disk space manager (DSM) keeps track of available disk space The lowest level of the DBMS architecture Supports the allocation and de-allocation of disk pages Pages are abstract units of storage, mapped to disk blocks Reading and writing to a page is performed in one disk I/O Sequences of pages are allocated to a contiguous sequence of blocks to increase access speed The DSM hides the underlying details of storage Allowing higher level processes to consider the data to be a collection of pages

79 Tracking Free Blocks A DB increases and decreases in size over time
In addition to mapping pages to blocks the DSM has to record which disk blocks are in use As time goes on, gaps in sequences of allocated blocks appear Free blocks need to be recorded so that they can be allocated in the future, using either A linked list, the head points to the first free block A bitmap, each bit corresponds to a single block Allows for fast identification, and therefore allocation, of contiguous areas of free space

80 Using OS as the DSM An OS is required to manage space on a disk
Typically an OS abstracts a file as a sequence of bytes While possible to build a DSM with the OS many DBMS perform their own disk management This makes the DBMS more portable across platforms Using the OS may impose technical limitations such as maximum file size In addition, OS files cannot typically be stored on separate disks, which may be necessary in a DBMS

81 Record and Page Format

82 Record Formats Attributes, or fields, must be organized within records
Information that is common to all records of a particular type is stored in the system catalog Including the number and type of fields Records of a single table can vary from each other In addition to differences in data (obviously) Different records may contain different number of fields, or Fields of varying length

83 Examples of Field Types
INTEGER, represented by two or four bytes FLOAT, represented by four or eight bytes CHAR(n), fixed length character strings of n bytes Unused characters are occupied with a pad character e.g. if a CHAR(5) stored "elm" it would be stored as elm VARCHAR(n), character strings of varying lengths Stored as arrays of n+1 bytes i.e. even though a VARCHAR's contents can vary, n+1 bytes are dedicated to them The length of a VARCHAR is stored in the first byte, or Its end is specified by a null character

84 Fixed Length Records Fields are a fixed length, and the number of fields is fixed Fields may then be stored consecutively And given the address of a record, the address of a particular field can be found By referring to the field size in the system catalog It is common to begin all fields at a multiple of 4 or 8 bytes

85 Fixed Length Record Format
Consider an employee record: {name CHAR(30), address VARCHAR(255), salary FLOAT} Fields can be found by looking up the field size in the schema and performing an offset calculation pointer to schema length timestamp name address salary 12 44 300 308 header

86 Variable Length Fields
In the relational model each record contains the same number of fields However fields may be of variable length If a record contains both fixed and variable length fields, store the fixed length fields first The fixed length fields are easy to locate To store variable length fields include additional information in the record header The length of the record Pointers to the beginning of each variable length field A pointer to the end of the record

87 Variable Length Record
Consider an employee record with name, and salary being fixed length and address being variable length The pointer to the first variable length field may be omitted other header information record length address pointer end of record name salary address header

88 Repeating Fields Records in a relational DB have the same number of fields But it is possible to have repeating fields For example a many to many relationship in a record that represents an object References to other objects will have to be stored The references (or pointers) to other objects suggest that different records will have different lengths There are three alternatives for recording such data

89 Storing Repeating Fields
Store the entire record in one block Maintain a pointer to the first reference Store the fixed length portion in one location, and the variable length portion in another The header contains a pointer to the variable length portion (the references to other objects), and The number of such objects Store a fixed length record with an fixed number of occurrences of the repeating fields, and A pointer to (and count of) any additional occurrences

90 Fixed vs. Variable Length
There are many advantages of keeping records (and therefore fields) fixed length More efficient for search Lower overhead (the header contains less data) Easier to move records around The main advantage of using variable length fields is that it can save space This can result in fewer disk I/O operations

91 Variable Length Fields Issues
Modifying a variable field in a record may make it larger Later fields in the same record have to be moved, and Other records may also have to be moved When a variable field is modified the record’s size may increase to the extent that it no longer fits on the page The record must then be moved to another page, but A "forwarding address" has to be maintained on the old page, so that external references to the rid are still valid A record may grow larger than the page size The record must then be broken into sections and connected by pointers

92 Forwarding Addresses Forwarding addresses may need to be maintained
When a record grows too large, or When records are maintained in order (clustered) When maintaining ordered data Provide a forwarding address if a record has to be moved to a new page to maintain the ordering Delete records by inserting a NULL value, or tombstone pointer in the header The record slot can be re-used when another record is inserted

93 Insertion The numbers represent the primary keys of the records
1 3 Insert 6, 17, and 21 8 1 8 3 17 8 6 21

94 Tombstone Pointer 1 3 Delete 3, and insert 5 8 1 1 1 5 8 8

95 Other Data Types There are other data types that requires special treatment in terms of record storage Pointers, and reference variables Large objects such as text, images, video, sound etc.

96 Records with Pointers If a record represents an object, the object may contain pointers to or addresses of some other object Such pointers need to be managed by a DBMS A data item may have two addresses Database address on disk, usually 8 bytes Memory address in main memory, usually 4 bytes When an item is on the disk (i.e. secondary storage) its database address must be used And when an item is in the buffer pool it can be referred to by either its database or memory address It is more efficient to use the memory address

97 Translation Tables Database addresses of items in main memory should be translated to their current memory addresses To avoid unnecessary disk I/O It is possible to create a translation table that maps database addresses to memory addresses However when using such a table addresses may have to be repeatedly translated Whenever a pointer of a record in main memory is accessed the translation table must be used Pointer swizzling is used to avoid repeated translation table look-up

98 Pointer Swizzling Whenever a block is moved from secondary to main memory pointers in that block may be swizzled i.e. translated from the database to the memory address A pointer in main memory consists of A bit that indicates whether the pointer is a database or a memory address, and The memory address (four byte) or database address (8 byte) as appropriate Space is always reserved for the database address There are several strategies to decide when a pointer should be swizzled

99 Pointer Swizzling Example
Disk Memory Read into memory swizzled Block 1 unswizzled Block 2

100 Swizzling Strategies When a new block is brought into main memory, pointers related to that block may be swizzled The block may contain pointers to records in the same block, or other blocks, and Pointers in records in other blocks, already in main memory, may point to records in the newly copied block There are four main swizzling strategies Automatic swizzling Swizzling on demand No swizzling – i.e. just use the translation table Programmer controlled swizzling – when access patterns are known

101 Automatic Swizzling Enter the address of the block and its records into the translation table Enter the address of any pointers in the records in the block into the translation table If such an address is already in the table, swizzle the pointer giving it the appropriate memory address If the address is not already in the table, copy its block into memory and swizzle the pointer This ensures that all pointers in the new block are swizzled when the block is loaded, which may save time However, it is possible that some of the pointers may not be followed, hence time spent swizzling them is wasted

102 Swizzling on Demand Enter the address of the block and its records into the translation table Leave all pointers in the block unswizzled When an unswizzled pointer is followed, look up the address in the translation table If the address is in the table, swizzle the pointer If the address is not in the table, copy the appropriate block into main memory, and swizzle the pointer Unlike automatic swizzling, this strategy does not result in unnecessary swizzling

103 Returning Blocks to Disk
When a block is written to disk its pointer must first be unswizzled That is, the pointers to memory addresses must be replaced by the appropriate database addresses The translation table can be searched (by memory address) to find the database address This is potentially time consuming The translation table should therefore be indexed to allow efficient lookup of both memory and database addresses

104 Pinned Records and Blocks
Pointer swizzling may result in blocks being pinned A block is pinned if it cannot safely be written back to disk A block that is pointed to by a swizzled pointer should be pinned Otherwise, the pointer can no longer be followed to the block at the specified memory address If a block is unpinned pointers to it must be unswizzled The translation table must also include the memory addresses of pointers that refer to an entry As a linked list attached to an entry in the translation table, or As a (pointer to a) linked list in the record's pointer field

105 Large Object Blocks How are large data objects stored in records?
Video clips, or sound files or the text from a book LOB data types store and manipulate large blocks of unstructured data Tables can contain multiple LOB columns The maximum size of a LOB is large At least 8 terabytes in Oracle 10g LOB data must be processed by application programs LOB data is stored as either binary or character data BLOB – unstructured binary data CLOB, NCLOB – character data BFILE – unstructured binary data in OS files

106 LOB Storage LOB's have to be stored on a sequence of blocks
Ideally the blocks should be contiguous for efficient retrieval, but It is possible to store the LOB on a linked list of blocks Where each block contains a pointer to the next block If fast retrieval of LOBs is required they can be striped across multiple disks for parallel access It may be necessary to provide an index to a LOB For example indexing by seconds for a movie to allow a client to request small portions of the movie

107 Page Formats Records are organized on pages
Pages can be thought of as a collection of slots, each of which contains a single record A record can be identified by its record id (rid) The rid is the {page ID, slot number} pair Before considering different organizations for managing slots it is important to know if Records are fixed length or Variable length There are two organizations based on how records are deleted

108 Packed Page Format Records are stored consecutively in slots
When a record is deleted the last record on the page is moved to its location Records are found by an offset calculation All empty space is at the bottom of the page But the rid includes the slot number As records are moved external references become invalid slot 1 slot 2 slot N free space N number of records

109 bitmap showing slot occupancy
Unpacked Page Format The page header contains a bitmap Each bit represents a single slot A slot's bit is turned off when the slot is empty New records are inserted in empty slots A record's slot number doesn't change slot 1 slot 2 slot 3 slot M 1 M 3 2 number of slots bitmap showing slot occupancy

110 Variable Length Records
With variable length records a page cannot be divided into fixed length slots If a new record is larger than the slot it cannot be inserted If a new record is smaller it wastes space To avoid wasting space, records must be moved so that all the free space is contiguous without changing the rids One solution is to maintain a directory of page slots at the end of each page which contains A pointer (an offset value) to each record and The length of each record

111 Organizing Variable Length Records
Pointers are offsets to records Moving a record on the page has no impact on its rid Its pointer changes but its slot number does not A pointer to the start of the free space is required Records are deleted by setting the offset to -1 New records can be inserted in vacant slots Pages should be periodically reorganized to remove gaps The directory "grows" into the free space free space 16 24 20 N 2 1 length = 24 length = 16 length = 20 pointer to start of free space slot directory number of slots

112 Files and Records A page can be considered as a collection of records
Pages containing related records are organized into collections, or files One file usually represents a single table One file may span several pages It is therefore necessary to be able to access all of the pages that make up a file The basic file structure is a heap file

113 Heap Files Heap files are not ordered in any way
But they do guarantee that all of the records in a file can be retrieved by repeatedly requesting the next record Each record in a file has a unique record ID (rid) And each page in the file is the same size Heap files support the following operations: Creating and destroying files Inserting and deleting records Scanning all the records in the file To support these operations it is necessary to: Keep track of the pages in the file Keep track of which of those pages contain free space

114 Heap File Organization 1
Maintain the heap file as a pair of doubly linked lists of pages One list for pages with free space and One list for pages that are full The DBMS can record the first page in the list in a table with one entry for each file If records are of variable length most pages will end up in the list of pages with free space It may be necessary to search several pages on the free space list to find one with enough free space

115 Heap File Organization 2
Maintain the heap file as a directory of pages Each directory entry identifies a page (or a sequence of pages) in the heap file The entries are kept in data page order and records for each page: Whether or not the page is full, or The amount of free space If the amount of free space is recorded there is no need to visit a page to determine if it contains enough space

116 Managing Data in Main Memory

117 The Buffer Manager The buffer manager is responsible for bringing pages from disk to main memory as required Main memory is partitioned into a collection of pages called the buffer pool Main memory pages are referred to as frames Other processes must tell the buffer manager if a page is no longer required and whether or not it has been modified A DB may be many times larger than the buffer pool Accessing the entire DB (or performing queries that require joins) can easily fill up the buffer pool When the buffer pool is full, the buffer manager must decide which pages to replace by following a replacement policy

118 Buffer Pool Management
Program 1 Program 2 Buffer Manager Disk Page Free Frame MAIN MEMORY Buffer Pool Database DISK

119 Page Frames Buffer pool frames are the same size as disk pages
Dirty Bit Buffer pool frames are the same size as disk pages The buffer manager records two pieces of information for each frame dirty bit – on if the page has been modified pin-count – the number of times the page has been requested but not released Pin Count Data Page Main memory frame

120 Requesting (Allocating) a Page
If the page is already in the buffer pool Increment the frame's pin-count (called pinning) Otherwise Choose a frame to replace (using the policy) A frame is only chosen for replacement if its pin-count is zero If there is no frame with a pin-count of zero the transaction must either wait or be aborted If the chosen frame is dirty write it to the disk Read requested page into replacement frame and set its pin-count to 1 Return the address of the frame

121 Releasing a Page When a process releases (de-allocates) a page its pin-count is reduced, known as unpinning The program indicates if the page has been modified, if so the buffer manager sets the dirty bit to on Processes for requesting and releasing pages are affected by concurrency and crash recovery policies These will be discussed at a later date

122 Buffer Replacement Policies
The policy used to replace frames can affect the efficiency of database operations Ideally a frame should not be replaced if it will be needed again in the near future Least Recently Used (LRU) replacement policy Assumes that frames that haven't been used recently are no longer required Uses a queue to keep track of frames with pin-count of zero Replaces the frame at the front of the queue Requires main memory space for the queue

123 Clock Replacement A variant of the LRU policy with les overhead
Instead of a queue the policy requires one bit per frame, and a single variable, called current Assume that the frames are numbered from 0 to B-1 Where B is the number of frames Each frame has an associated referenced bit The referenced bit is initially set to off, and is Set to on when the frame's pin-count reaches zero current is initially set to 0, and is used to indicate the next frame to be considered for replacement

124 Clock Replacement Process
Consider the current frame for replacement If pin-count  0, increment current If pin-count  0 and referenced bit is on Switch referenced to off and increment current If pin-count  0 and referenced is off Replace the frame If current equals B-1 set it to 0 Only replace frames with pin-counts of zero Frames with a pin-count of zero are only replaced after all older candidates are replaced

125 Is LRU the Right Policy? LRU and clock replacement are fair schemes
They are not always the best strategies for a DB system It is common for some DB operations to require repeated sequential scans of data (e.g. Cartesian products, joins) With LRU such operations may result in sequential flooding An alternative is the Most Recently Used policy This prevents sequential flooding but is generally poor Most systems use some variant of LRU Some systems will identify certain operations, and apply MRU for those operations

126 No Sequential Flooding
Assume that a process requests sequential scans of a file The file, shown below, has nine pages Assume that the buffer pool has ten frames p1 p2 p3 p4 p5 p6 p7 p8 p9 Buffer Pool p1 p2 p1 p1 p2 p3 p4 p5 p6 p7 p8 p9 Read page 1 first, then page 2, … then page 9 All the pages are in the buffer, when the next scan of the file is requested, no further disk access is required!

127 Sequential Flooding Assume that a process requests sequential scans of a file This file, shown below, has eleven pages Assume that the buffer pool still has ten frames p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Buffer Pool p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p1 p2 p1 Read pages 1 to 10 first, page 11 is still to be read

128 Sequential Flooding Assume that a process requests sequential scans of a file This file, shown below, has eleven pages Assume that the buffer pool still has ten frames p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Buffer Pool p11 p2 p3 p4 p5 p6 p7 p8 p9 p10 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 Read pages 1 to 10 first, page 11 is still to be read Using LRU, replace the appropriate frame, which contains p1, with p11

129 Sequential Flooding Assume that a process requests sequential scans of a file This file, shown below, has eleven pages Assume that the buffer pool still has ten frames p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Buffer Pool p11 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p1 p3 p4 p5 p6 p7 p8 p9 p10 Read pages 1 to 10 first, page 11 is still to be read Using LRU, replace the appropriate frame, which contains p1, with p11 The first scan is complete, start the second scan by reading p1 from the file Replace the LRU frame (containing p2) with p1

130 Sequential Flooding Assume that a process requests sequential scans of a file This file, shown below, has eleven pages Assume that the buffer pool still has ten frames p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Buffer Pool p11 p1 p2 p4 p5 p6 p7 p8 p9 p10 p11 p1 p3 p4 p5 p6 p7 p8 p9 p10 Read pages 1 to 10 first, page 11 is still to be read Using LRU, replace the appropriate frame, which contains p1, with p11 The first scan is complete, start the second scan by reading p1 from the file Replace the LRU frame (containing p2) with p1 Continue the scan by reading p2, …

131 Sequential Flooding Assume that a process requests sequential scans of a file This file, shown below, has eleven pages Assume that the buffer pool still has ten frames p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Buffer Pool p11 p1 p2 p3 p4 p5 p6 p7 p9 p10 p11 p1 p2 p3 p4 p5 p6 p7 p8 p10 p11 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p1 p2 p3 p4 p5 p6 p7 p8 p9 p11 p1 p2 p3 p4 p5 p6 p8 p9 p10 p11 p1 p2 p4 p5 p6 p7 p8 p9 p10 p11 p1 p2 p3 p5 p6 p7 p8 p9 p10 p11 p1 p2 p3 p4 p5 p7 p8 p9 p10 p10 p11 p2 p3 p4 p5 p6 p7 p8 p9 p11 p1 p2 p3 p4 p6 p7 p8 p9 p10 Each scan of the file requires that every page is read from the disk! In this case LRU is the WORST possible replacement policy!

132 OS Buffer Management There are similarities between OS virtual memory and DBMS buffer management Both have the goal of accessing more data than will fit in main memory Both bring pages from disk to main memory as needed and replace unneeded pages A DBMS requires its own buffer management To increase the efficiency of database operations To control when a page is written to disk

133 DBMS Buffer Management
A DBMS can often predict patterns in the way in which pages are referenced Most page references are generated by processes such as query processing with known patterns of page accesses Knowledge of these patterns allows for a better choice of pages to replace and Allows prefetching of pages, where the page requests can be anticipated and performed before they are requested A DBMS requires the ability to force a page to disk To ensure that the page is updated on a disk This is necessary to implement crash recovery protocols where the order in which pages are written is critical

134 Prefetching Some DBMS buffer managers are able to predict page requests And fetch pages into the buffer before they are requested The pages are then available in the buffer pool as soon as they are requested, and If the pages to be prefetched are contiguous, the retrieval will be faster than if they had been retrieved individually If the pages are not contiguous, retrieval may still be faster as access to them can be efficiently scheduled The disadvantage of prefetching (aka double-buffering) is that it requires extra main memory buffers

135 Performance Strategies
Organizing data by cylinders Related data should be stored "close to" each other Using a RAID system to improve efficiency or reliability Multiple disks and striping improves efficiency Mirroring or redundancy improves reliability Scheduling requests using the elevator algorithm Reduces disk access time for random reads and writes Most effective when there are many requests waiting Prefetching (or double-buffering) data in large chunks Speeds up access when needed blocks can be predicted but requires more main memory buffers

Download ppt "Data Storage and Disk Access"

Similar presentations

Ads by Google