Download presentation
1
Chapter 2. Data Storage Chapter 2
2
Outline Memory hierarchy Hardware: Disks Access Times
Example - Megatron 747 Optimizations Disk failure RAIDs Chapter 2
3
Hardware - Data Storage
Users DBMS’s Operating Systems Hardware - Data Storage Chapter 2
4
The Memory Hierarchy DBMS Programs, Main-memory Tertiary DBMS’s
Storage Disk As Virtual Memory File System Main memory Cache Chapter 2
5
Cache The cache is an integrated circuit or part of the processor’s chip Holding data or machine instructions Copy from main-memory If data being expelled from the cache has been modified, then the new value must be copied into the main memory. Typical performance Capacities up to a megabyte Access time: 10 nanoseconds (10-8 seconds) Moving data bet. Cache and main memory: 100 nanoseconds (10-9 seconds) Chapter 2
6
Main Memory Everything that happens in the computer is resident in main memory Capacity: around 100 Mbyte to 10 Gbyte Random access Typical access time is nanoseconds Chapter 2
7
Virtual Memory Is a part of disk In a 32-bit address machine
Virtual memory grows up to 232 bytes (4 Gbyte) Data is moved between disk and main memory in entire blocks, which are also called pages in main memory Main-memory database systems Chapter 2
8
Secondary Storage (1) Slower, more capacious than main memory
Random access magnetic, optical, magneto-optical disks Disk read/write are done by moving a chuck of bytes called blocks (or pages) file buffer Chapter 2
9
Secondary Storage (2) Accessing a block: 10-30 milliseconds
Recently, one disk unit can store data ranging from 10 to 32 Gbytes A machine can have several disk units Chapter 2
10
Tertiary Storage (1) Have been developed to hold data volumes measured in terabytes Compared with secondary storage, it offers Higher read/write times Larger capacities and smaller cost per byte Not random access in general Chapter 2
11
Tertiary Storage (2) Kinds of tertiary storage devices Capacities
Ad-hoc tape storage Optical-disk juke boxes: CD-ROMs Tape silo: an automated version ad-hoc tape storage Capacities CD: 2/3 Gbytes, 2.3 Gbytes Tapes: 50 Gbytes Access time: about 1000 times slower than secondary memory Chapter 2
12
Volatile and Nonvolatile
Volatile vs. nonvolatile storage Flush memory A form of main memory Nonvolatile Becomes economical RAM disk A battery-backed main memory Chapter 2
13
Access Time vs. Capacity
2 1 -1 -2 -3 -4 -5 -6 -7 -8 -9 5 6 7 8 9 10 11 12 13 floppy disk zip disk Secondary Main Cache Tertiary X (10 seconds) Y (10 y bytes) Chapter 2
14
Moore’s Law Gordon Moore observed that the followings double every 18 months The speed of processors, i.e., the number of instructions executed per second and the ratio of the speed to cost of a processor The cost of main memory per bit and the number of bits that can be put on one chip The cost of disk per bit and the number of bytes that a disk can hold Not applicable to Main memory access time, disk access time Chapter 2
15
Disks … A typical disk Terms: Platter, Head, Actuator Cylinder, Track
Sector, Block, Gap A typical disk Chapter 2
16
Disks: A Top View top view Cylinder, Track, Sector, Gap
Gaps often represents about 10% of the total tracks A entire section cannot be used if portion of it gets destroyed Typically a block consists of one or more sectors. top view Chapter 2
17
The Disk Controller Controls one or more disk drives Processor
Main Memory Disk Controller Bus Disks Controls one or more disk drives controlling the mechanical actuator selecting a surface or a sector on that surface Transferring bits via a data bus Chapter 2
18
Disk Storage Characteristics (as of 1999)
Rotation speed of the disk assembly 5400 RPM (one rotation every 11 milliseconds) Number of platters per unit Typical disk drive: 5 platters (10 surfaces) Floppy/zip disk: 1 platter (2 surfaces) Number of tracks per surface Have as many as 10,000 tracks 3.5 inch diskette : 40 tracks Number of bytes per track Common disk: 105 or more bytes 3.5 inch diskette: 150K Chapter 2
19
Megatron 747 Disk (1) Characteristics Diameters of tracks
Have 4 platters (8 surfaces) 8192 (213) tracks per surface On average 256 (28) sectors per track 512 (29) bytes per sector Diameters of tracks outermost track is 3.5 inches innermost track is 1.5 inches Track consists of two parts gap: 10 % data: 90% Chapter 2
20
Megatron 747 Disk (2) The capacity of the disk
8 surfaces * 8192 tracks * 256 sectors * 512 bytes = 8G bytes A single track on average 256 sectors * 512 bytes = 128K bytes = 1 Mbits A cylinder is of 1 Mbytes on average If a block is 4096 bytes (212) A block uses 8 sectors (= 4096 bytes / 512 bytes) A track consists of 32 blocks (= 256 sectors / 8) Chapter 2
21
Megatron 747 Disk (3) If each track had the same number (i.e. 256) of sectors, then the density of bits around the tracks would be greater Length of the outermost track 0.9 * 3.5 * ≒ 9.9 inch 1 megabit / 9.9 ≒ 100,000 bits per inch Length of the innermost track 0.9 * 1.5 * ≒ 4.2 inch 1 megabit /4.2 ≒ 250,000 bits per inch Each track in Megatron 747 has the different numbers of sectors outer: 320 sectors middle: 250 sectors inner: 192 sectors The outermost track 1,801,800 bit / 9.9 ≒ 182,000 bpi The innermost track 47,880 bit / 4.2 ≒ 114,000 bpi Chapter 2
22
The Latency of The Disk Disk access time I want block X block x
in memory disk access time Disk access time seek time rotational delay transfer time others Chapter 2
23
Seek Time The time to position the head assembly at the proper cylinder 0(zero): already to be at the proper cylinder Otherwise: move to be at the proper cylinder In range 3 or 20x x 1 Max Cylinders Traveled Time Chapter 2
24
Rotational latency Time
The time for disk to rotate the first of the sectors containing the block One rotation takes 10 ms, so rotational latency on average 5 ms. Head Here Block I Want Chapter 2
25
Transfer Time/Other delays
the time to read/writes the data on the appropriate disk surface 10 Mbytes per second Other delays (here, those are neglected) taken by the processor and disk controller due to contention for the disk controller other delays due to contention Chapter 2
26
Modifying Blocks Not possible to modify a block on disk directly
Sequence of procedures Read block (time: rt) Modify in memory (time: mt) Write block (time: wt) Verify (time: vt) if appropriate Total time rt + mt + wt + vt Chapter 2
27
Example 2.3 (1) Let us examine the time to read a 4096-byte block from the Megatron 747 disk Characteristic 4 platters (8 surfaces), 1 surface = 8192 tracks 1 track = 256 sectors, 1 sector = 512 bytes Disk rotates at 3840 RPM, one rotation = 1/64 of a second To move the head assembly 1ms (to start and stop)+ 1ms for every 500 cylinders Heads move one track in ms To move heads from innermost to outermost track 1 + (8192 / 500) = 17.4 ms Chapter 2
28
Example 2.3 (2) Minimum time (the best case)
No seek time, no rotational latency, only transfer time Note: 1 track = 256 sectors, 1 sector = 512 bytes 4096 bytes / 512 bytes = 8 sectors (including 7 gap) gaps/sectors occupy 10%/90% of track A track has 256 gaps and 256 sectors 36 * 7/ * 8/256 = degrees (11.109/360)/64 = 4.8e-4 seconds = 0.5 ms Chapter 2
29
Example 2.3 (3) Maximum time (the worst case)
full seek time and rotational latency, plus transfer time full seek time: 17.4 ms full rotational time: 1/64 of a second = 15.6 ms transfer time: 0.5 ms = 33.5 ms Chapter 2
30
Example 2.3 (4) Average Time Transfer time: 0.5 ms
Average rotational time: half of the full rotation = 7.8 ms Average seek time average distance traveled = 1/3 of the disk = 2730 cylinders /500 = 6.5ms = 14.8 ms 4096 2048 8192 Average travel Starting track Chapter 2
31
RAM model vs. I/O model computation
Dominance of I/O cost Remember, in-memory operations take the same time as one disk I/O Should minimize the number of block accesses Data Structure vs. File Processing Chapter 2
32
Using Secondary Storage Effectively
In general database Whole databases are much too large to fit in main memory Key parts of databases are buffered in main memory Disk I/O’s occur frequently Main memory sorts (such as “Quick sort”) are inadequate Chapter 2
33
Merge Sort Step List 1 List 2 Output start 1, 3, 4, 9 2, 5, 7, 8 none
1) 3, 4, 9 1 2) 5, 7, 8 1,2 3) 4, 9 1,2,3 4) 9 1,2,3,4 5) 7, 8 1,2,3,4,5 6) 8 1,2,3,4,5,7 7) 1,2,3,4,5,7,8 8) 1,2,3,4,5,7,8,9 Chapter 2
34
Two-Phase, Multiway Merge-Sort (1)
Sort main-memory-sized pieces of the data Fill all available main memory with blocks Sort the records in main memory Write the sorted records Chapter 2
35
Two-Phase, Multiway Merge-Sort (2)
Merge all the sorted sublists into a single sorted list Find the smallest key among the first remaining elements of all the lists Move the smallest element to the first available position of the output block If output block is full, write it to disk and reinitialize the same buffer Repeat until all input blocks become exhausted. Chapter 2
36
Main-memory Organization
Pointers to first unchosen record Input buffers, one for each sorted list Select smallest for output Output buffers Chapter 2
37
Merge Sort Example (1) Assumption
10,000,000 tuples, 1 tuple = 100 bytes So, 1 Gbyte data 50 Mbytes memory available 4096 byte blocks, so each block contains 40 records Total # of blocks: 250,000 # of blocks in main memory: 12,800 (= 50*220 / 212) Number of sublists 19 sublists (12,800 blocks) + 1 sublists (6,800 blocks) Each block read or write: 15 ms Chapter 2
38
Merge Sort Example (2) Computation First phase Second phase
Read each of the 250,000 blocks once Write 250,000 new blocks Total time (250,000 * 15 ms) * 2 = 7500 seconds = 125 minutes Second phase Similar with the first phase Total time: 125 minutes Chapter 2
39
Improving the Access Time of Secondary Storage
Place blocks on the same cylinder Divide the data among several small disks Mirroring disks Use a disk-scheduling algorithm Prefetch blocks to main memory in anticipation of their later use Chapter 2
40
Organizing Data by Cylinders
Use several adjacent cylinders Read all the blocks on a single track or on a cylinder consecutively Neglect all but the first seek time and the first rotational latency Chapter 2
41
Example 2.9 (1) Recall examples 2.3 and 2.7
Original data may be stored on consecutive cylinders Total # of cylinders: 1000 (= 1Gbytes / 1M bytes) Main memory can hold 50 cylinders (i.e. 50M) To read 50 cylinder data into main memory 6.5 ms for average seek time 49 ms for 49 one-cylinder seeks (1 ms each) 6.4 seconds for transfer of 12,800 blocks (12,800 * 0.5 ms) / 1000 = 6.4 seconds So, ,400 = ms Chapter 2
42
Example 2.9 (2) First phase Second phase Read
((6.5 ms + 49 ms seconds) * 20 times) = 2.15 minutes Write: The same as reading Total time: 4.3 minutes Second phase Still takes about 125 minutes (WHY ?) Chapter 2
43
Using Multiple Disks in place of One
Use several disks with their independent heads Transfer data at a higher rate Roughly speaking, total time could be divided by the number of disks Chapter 2
44
Example 2.10 (1) Replace one 747 by four 737’s which have one platter and two surfaces Assumption Divide the given records among the four disks Occupy 1000 adjacent cylinders on each disk Fill ¼ of main memory each disk Recall previous examples Average seek time and rotational latency: 0 Number of full memory blocks: 12,800 ¼ memory size: 3,200 blocks Chapter 2
45
Example 2.10 (2) Computation First phase
Transfer time: 3200 * 0.5 ms = 1.6 seconds Read: (6.5 ms + 49 ms seconds) * 20 = 33 sec. Write: similar with reading Total time: about 1 minute Chapter 2
46
Example 2.10 (3) Second phase
Apply delicate techniques (?) to reduce disk I/O time Start comparisons among the 20 lists as soon as the first element of the block appears in main memory Use four output buffers … Total time: about 1 hours (?) Chapter 2
47
Mirroring Disks Two or more disks hold identical copies of data
Survive a head crash by either disk If we make n copies of a disk, we can read any n blocks in parallel. Using mirror disks does not speed up writing, but neither does it slow writing down (to some extent) Chapter 2
48
Scheduling Requests by the Elevator Algorithm
Disk controller choose which of several requests to execute first, to increase throughput Elevator Algorithm Proceed in the same direction until the next cylinder with blocks to access is encountered When no requests ahead in direction of travel, reverse direction Chapter 2
49
Example 2.11 Cylinder of Request First time available 1000 3000 7000
3000 7000 2000 20 8000 30 5000 40 Cylinder of Request Time completed 1000 8.3 3000 21.6 7000 38.9 8000 50.2 5000 65.5 2000 80.8 Cylinder of Request Time completed 1000 8.3 3000 21.6 7000 38.9 2000 58.2 8000 79.5 5000 94.8 Arrival times for six block-access requests Finishing times for block accesses using the elevator algorithm Finishing times for block accesses using the first-come-first-served algorithm Chapter 2
50
Prefetching Data on Track- or Cylinder-sized Chunks
Can we predict the order in which blocks will be requested from disk ? For example, Devote two block buffers to each list when merged (when there is plenty of memory) When a buffer is exhausted, switch to the other buffer for the same list Chapter 2
51
Single Buffering Single buffering Computation Read B1 Buffer
Process Data in Buffer Read B2 Buffer Process Data in Buffer ... Computation P = time to process/block R = time to read in 1 block n = # of blocks Single buffer time = n(P+R) Chapter 2
52
Single Buffering vs. Double Buffering
done process A C process B done Memory: Disk: A A B C D G E F Chapter 2
53
Double Buffering Computation P = processing time/block
R = IO time/block n = # of blocks Double buffering time: R + nP Single buffering time: n(R+P) Chapter 2
54
Prefetching Combine prefetching with the cylinder-based strategy
Store the sorted sublists on whole, consecutive cylinders Read whole tracks or cylinders whenever we need some records from a given list Chapter 2
55
Example 2.14 (1) Consider the second phase of the sort
Have in main memory two track-sized buffers A track: 128KB Total space requirement: 128KB * 20 lists * 2 = 5 Mbyte Read all the blocks on 1000 cylinders (8000 tracks) Computation average seek time : 6.5 ms the time for disk to rotate once: 15.6 ms total time (for reading): ( ) * 8000 = 2.95 minutes Chapter 2
56
Example 2.14 (2) Have in main memory two cylinder-sized buffers per sorted sublist 1 cylinder = 8 tracks = 128K * 8 = 1M Use 40 buffers of a megabyte each 50 megabytes available main memory Need only do a seek once per cylinder Read all the block on 1000 cylinders (8000 tracks) Total time (for reading) ( * 15.6) * 1000 cylinders) = 2.19 minutes Chapter 2
57
Block Size Selection Big block amortize I/O cost
Big block read in more useless stuff and takes longer to read As memory prices drop, blocks get bigger… Chapter 2
58
Disk Failures Intermittent failure Media decay Write failure
An attempt to read or write a sector is unsuccessful, but with repeated tries we are able to read or write successfully. Media decay A bit or bits are permanently corrupted, and the sector becomes unreadable. Write failure We can neither write successfully nor can we retrieve the previously written sector. Disk Crash When a disk becomes unreadable permanently Chapter 2
59
Checksums (1) Each section has additional bits, called the checksum, to check reading or writing operations (w, s) w: the data that is read s: a status bit A simple form of checksum: parity Chapter 2
60
Checksums (2) Example 1 (even parity) Example 2 (even parity)
The sequence of bits in a sector : The parity bit is 1 Data becomes Example 2 (even parity) The sequence of bits in a sector : The parity bit is 0 Data becomes Chapter 2
61
Checksums (3) Possible that we cannot detect an error if more than one bit of the sector may be corrupted If we use n independent bits as a checksum, then the chance of missing an error is only 1/2n (WHY ?) Chapter 2
62
Stable Storage (1) How to correct errors ?
Stable storage is a technique for organizing a disk so that media decays or failed writes do not result in permanent loss. The general idea is that sectors are paired, and each pair represents one sector-contents X As the left (XL) and right (XR) copies Chapter 2
63
Stable Storage (2) Writing policy
Write the value of X into XL if status is good, write the value if status is bad, repeat writing If fails after a number of times, a media failure in the sector Repeat above scheme for XR Reading policy (to obtain the value of X) Read XL if status bad is returned, repeat reading if status good is returned, take that value as X If can’t read XL , repeat above with XR Chapter 2
64
Recovery from Disk Crashes
Disk crash is fatal in mission-critical applications RAID (redundant arrays of independent disks) Here, we talk levels 5, 6, and 7 These RAID schemes also handle failures discussed previously Chapter 2
65
The Failure Model of Disks
Mean time to failure represents the length of time by which 50% of a population of disks will have failed catastrophically. For modern disks, it is about 10 years Fraction surviving Time Chapter 2
66
RAID Level 1 To protect against data loss
Use mirroring disks The only way data can be lost is if there is a second disk crash while the first crash is being repaired. Chapter 2
67
How often will a data loss occur?
Assume The process of replacing the failed disk take 3 hours, 1/8 day, 1/2920 year A failure rate of 5% per year Probability that the mirror disk will fail during copying (1/20) * (1/2920) = 1/58,400 Mean time to a failure involving data loss One of the two disks will fail once in 5 years on the average 5 * 58,400 = 292,000 years Chapter 2
68
RAID Level 4 (1) Use one redundant disks no matter how many data disks there are In the redundant disk, the ith block consists of parity checks for the ith blocks of all the data disks Use modulo-2 sum: an even parity disk1: disk2: disk3: disk4: Data disks Redundant disk Chapter 2
69
The Algebra of Modulo-2 Sums
The commutative law x y = y x The associative law x (y z) = (x y) z The all-0 vector of the appropriate length is the identity for x Ō = Ō x = x is its own inverse x x = Ō If x y = z, y = x z Chapter 2
70
RAID Level 4: Reading (2) Read disks normally.
We could read the redundant disk ! Example read disk 2, 3, and 4, and get the contents of disk 1 using modulo-2 sum. disk2 : disk3 : disk4 : disk1 : Chapter 2
71
RAID Level 4: Writing (3) When a block is written, we need to change the redundant disk Naïve approach N-1 reads of blocks not being rewritten One write of new block Rewrite new redundant disk In total, N+1 disk I/O’s There is a better way to do that ! Chapter 2
72
Writing Example (4) When disk 2 changes from 10101010 to 11001100
disk2 : disk4 : Modulo-2 sum of old and new bits of disk 2 Modulo-2 sum of old redundant disk and modulo-2 sum of disk 2’s Chapter 2
73
RAID Level 4: Failure Recovery (5)
Recomputing any missing data is simple, and does not depend on which disk (data or redundant) is failed. Chapter 2
74
RAID Level 5 We could treat each disk as the redundant disk for some of the blocks That is, do not have to treat one disk as the redundant disk and the others as data disks When there are n+1 disks (disk 0 – disk n) If (i mod n+1) = j, then we can treat the ith cylinder of disk j as redundant Chapter 2
75
Example 2.21 (1) How redundant blocks compute for 4 disks (n=3)?
redundant for block 4, 8, 12, … Disk 1 redundant for block 1, 5, 9, … Disk 2 redundant for block 2, 6, 10, … Disk 3 redundant for block 3, 7, 11, … Chapter 2
76
Example 2.21 (2) The reading and writing load for each disk is the same If all blocks are equally likely to be written each disk has a 1/4 chance If not each disk has a 1/3 chance Each of four disks is involved in ½ of the writes 1/4 + 3/4 * 1/3 = 1/2 Chapter 2
77
RAID Level 6 (1) To handle with any number of disk crashes – data or redundant Here, focused on a simple example, where two simultaneous crashes are correctable and the strategy is based on a simple error-correcting code, Hamming code Consider a system with seven disks data disks: disk 1-4 redundant disks: disk 5-7 Chapter 2
78
RAID Level 6 (2) The relationship between data and redundant disks
Note every possible column of three 0’s and 1’s, except for the all-0 column the columns for the redundant disk have a singe 1 the columns for the data disks each have at least two 1’s DATA Redundant Disk Number 1 2 3 4 5 6 7 Chapter 2
79
RAID Level 6 (3) DATA Redundant Disk Number 1 2 3 4 5 6 7 The disks with 1 in a row are treated as if they were the entire set of disks in a RAID level 4 scheme. The bits of disk 5 are the modulo-2 sum of bits of disk 1,2, and 3 The bits of disk 6 are the modulo-2 sum of bits of disk 1,2, and 4 The bits of disk 7 are the modulo-2 sum of bits of disk 1,3, and 4 Chapter 2
80
RAID Level 6 – Read/Write
Reading: Just read data from any data disk normally Writing Need to recalculate several redundant disks Chapter 2
81
A Writing Example (1) Writing Disk 2 is changed to be 0000111
Corresponding redundant disks disk 5 and 6 Using modulo-2 sum between old and new disk 2 between modulo-2 sum of disk 2’s and disk 5 between modulo-2 sum of disk 2’s and disk 6 Disk Contents 1 2 3 4 5 6 7 Chapter 2
82
A Writing Example (2) Disk Contents 1 11110000 2 00001111 3 00111000 4
5 6 7 (old disk 2) (new disk 2) (modulo-2 sum ) (modulo-2 sum) (disk 5) (new disk 5) (modulo-2 sum) (disk 6) (new disk 6) Chapter 2
83
RAID Level 6 – Failure Recovery
Assume that disk a and b fails simultaneously Find a row r in which the columns of a and b are different For example, a has 0 in row r, b has 1 in row r Compute the correct b by taking the modulo-2 sum of corresponding bits from all the disks other than b that have 1 in row r. Then, compute the correct a Chapter 2
84
A Recovery Example Pick the second row Disk 2: Disk 5:
modulo-2 sum of disks 1, 4, and 6 Disk 5: modulo-2 sum of disks 1, 2, and 3 Disk Contents 1 2 ???????? 3 4 5 6 7 Chapter 2
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.