Presentation on theme: "Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many."— Presentation transcript:
Disk Arrays COEN 180
Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many IO per seconds. Data spread across more drives is more accessible. JBOD: Just a Bunch Of Disks
Large Storage Systems Principal difficulty: Reliability Data needs to be stored redundantly: Mirroring, Replication Simple Expensive (double, triple, … storage costs) Good performance Erasure correcting codes Complex Save storage Moderate performance
Large Storage Systems Mirrored Disks Used by Tandem 1970 – 1997, bought by Compact Nonstop architecture Used redundancy (CPU, storage) for fail-over capacity Data is replicated on both drives Performance: Writes as fast as single disk model Reads: Slightly faster, since we can serve the read from the drive with best expected service time.
Disk Performance Modeling Basics Service Time: Time to satisfy a request if system is otherwise idle. Response Time: Time to satisfy a request at a given system load. Response time = service time + waiting time Utilization: Time system is busy
Disk Performance Modeling Basics M/M/1 queue single server Assume Poisson arrival, exponential service time Arrival rate Service time S Utilization U = S (Littles law) Response time R Determine R by: R = S + UR R= S/(1-U) = S/(1- S) S=1 hence U = R
Disk Performance Modeling Basics Need to determine service time of disk request. Service time = seek time + latency + transfer time Industrial (but wrong) determination: Seek time = time to travel one third of a disk. Why?
Disk Performance Modeling Basics Assume that head position is randomly on any track. Assume that target track is another random track. Given x [0,1], calculate D(x) = distance of random point in [0,1] from x.
Disk Performance Modeling Basics Given x [0,1], calculate D(x) = distance of random point in [0,1] from x.
Disk Performance Modeling Basics Now calculate the average distance from a random point to a random point in [0,1]
Disk Performance Modeling Basics Is Average Seek Time = Seek Time for Average Distance? NO: Seek Time is not linearly dependent on average seek time. Seek Time consists acceleration cruising (if seek distance is long braking exact positioning
Disk Performance Modeling Basics Is Average Seek Time = Seek Time for Average Distance? Practical measurements suggests Seek time depends on the seek distance roughly as a square-root of distance
Disk Performance Modeling Basics Rules of Thumb Keep utilization of disks between 50% and 80%.
Disk Arrays Dealing with reliability RAID Redundant array of inexpensive (independent) disks RAID Levels RAID Level 0: JBOD (striping) RAID Level 1: Mirroring RAID Level 2: Encodes symbols (bytes) with a Hamming code. Stores a bit per symbol on different disk. Not used in practice.
Disk Arrays Dealing with reliability RAID Levels RAID Level 3: Encodes symbols (bytes) with the simple parity code. Breaks a file up into n stripes. Calculates parity stripes. Stores all n + 1 stripes on n + 1 disks.
Disk Arrays Dealing with Reliability RAID Levels RAID Level 4 Maintains n data drives. Files are stored completely on one drive. Or perhaps in stripes if files become very large. Additional drive storing the byte-wise parity of the disk arrays. Parity Data
Disk Arrays Level 4 RAID Uneven load of parity drive and data drives
Disk Arrays Dealing with Reliability RAID Level 5 No dedicated parity disk Data in blocks Blocks in parallel positions on disks form reliability stripe. One block in each reliability stripe is the parity of the others. No performance bottleneck
Disk Arrays Dealing with Reliability RAID Level 6 Like RAID Level 5, but every stripe has two parity blocks Lower write performance 2-failure resilience RAID Level 7 Proprietary name for a RAID Level 3 with lots of caching. (Marketing bogus)
Disk Arrays Disk Array Operations Reads: Directly from data in RAID Level 3-6 Writes: Large Writes: Writes to all blocks in a single reliability stripe. Calculate parity from data and write it. Small Writes: Need to maintain parity. Option 1: Write data, then read all other blocks in the stripe and recalculate parity. Option 2: Read old data, then overwrite it. Calculate the difference (XOR) between old and new data. Then read old parity, XOR it with the result of the previous operation and overwrite with it the parity block.
Disk Arrays Disk Array Operations Reconstruction (RAID Level 4-5): Systematically: Reconstruct only lost data. Read all surviving blocks in the reliability stripe. Calculate its parity. This is the lost data block. Write data block in place of parity. Out of order reconstruction for data that is being read.
Disk Arrays Performance Analysis Assume that read and write service times are the same. seek latency (transfer) Write operation involves the read-modify operation. About twice as long as read / write service time seek latency transfer two latencies transfer
Disk Arrays Performance Analysis Level 4 RAID Offered read load r Offered write load w n disks Utilization at data disk: r S /(n – 1) + w 2S/(n – 1) Utilization at parity disk: w 2S Equal utilization only if r = 2(n – 2) w
Disk Arrays Performance Analysis Level 4 RAID Offered load. Assume only small writes. Assume read /write ratio of Utilization at data disk S/n Utilization at write disk (1- )2 S parity disk data disk Utilization Offered Load (IO/sec) Parameters: 4+1 layout 70% reads Service time 10 msec
Disk Arrays Performance Analysis RAID Level 5 Offered load Read ratio n disks Read Load S/n Write Load (1- ) 4S/n Every write leads to two read-modify-write ops.
Disk Arrays Level 4 RAID vs Level 5 RAID Without parity disk (JBOD) RAID Level 5 Parameters: 4+1 layout 70% reads Service time 10 msec parity drive data drive
Disk Arrays Performance Small writes are expensive. Parity logging (Daniel Stodolsky, Garth Gibson, Mark Holland) Write operation: Read old data, Write new data, Send XOR to a parity log file. Whenever parity log file becomes to big, process it by updating parity information.
Disk Arrays Reliability Accurately given by the probability of failure at every moment in time.
Disk Arrays Reliability Often given by Mean Time To Data Loss MTTDL Warning: MTTDL numbers can be deceiving. Red line is more reliable during Design Life, but has lower MTTDL
Disk Arrays Use Markov Model to model system in various states. States describe system. Assumes constant rates of transitions. Transitions correspond to: component failure component repair
Disk Arrays One component system Failure State (absorbing) Initial State MTTDL = MTTF = 1/
Disk Arrays Two component system without repair Failure State (absorbing) Initial State: 2 components working component working, one failed
Disk Arrays Two component system with repair Failure State (absorbing) Initial State: 2 components working component working, one failed
Disk Arrays How to calculate MTTF Start with original Markov model. Remove failure state. Replace transition(s) to failure state with failure transitions to initial state. This models a meta-system where we replace a failed system immediately with a new one. Now calculate the steady-state solution of the Markov model. It typicallyhas become ergodic. Use this to calculate the average rate of a failure transition being taken. This gives the MTTF.
Disk Arrays One component system Initial State System in initial state all the time. Failure transition taken at rate. Loss rate L =. MTTDL = 1/L = 1/
Disk Arrays Two component system without repair Initial State: 2 components working component working, one failed Steady-state solution Let x be the probability to be in state 2, y the probability to be in state 1. Then: Inflow into state 2 = Outflow from state 2: 2 x = y Total sum of probabilities is 1: x+y = 1.
Disk Arrays Two component system without repair Initial State: 2 components working component working, one failed Steady-state solution 2 x = y x+y = 1. Solution is: x = 1/3, y = 2/3. Loss rate is L = (2/3). MTTF = 1/L = 1.5 (1/ ). (1.5 times better than before).
Disk Arrays Two component system with repair Initial State: 2 components working component working, one failed
Disk Arrays RAID Level 4/5 Reliability Failure State (absorbing) Initial State: n disks n nn-1 (n-1) n – 1 disks
Disk Arrays RAID Level 6 Reliability Initial State: n disks n n n-1 (n-2) n – 1 disks Failure State (absorbing) (n-1) n-2 2 n – 2 disks
Disk Arrays Sparing Create more resilience by adding a hot spare. Failover to hot spare reconstructs and replaces contents of the lost disk on spare disk. Distributed sparing (Menon et al.): Distribute the spare space throughout the disk array.