Presentation is loading. Please wait.

Presentation is loading. Please wait.

Availability in Globally Distributed Storage Systems

Similar presentations


Presentation on theme: "Availability in Globally Distributed Storage Systems"— Presentation transcript:

1 Availability in Globally Distributed Storage Systems
Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,Luiz Barroso, Carrie Grimes, and Sean Quinlan Presented By Ala`a Ibrahim

2 OUTLINE Markov Model Findings Conclusions Introduction Disks failures
Correlated Failures Fault Tolerance MechanismsMarkov Model of Stripe Availability Markov Model Findings Conclusions

3 Data Center

4 Data Center Components
Server Components Interconnects  Racks Cluster of Racks

5 Data Center Components
ALL THESE COMPONENTS CAN FAIL Server Components Interconnects  Racks Cluster of Racks

6 Cell, Stripe and Chunk Stripe 1 Stripe 2 Stripe 1 Stripe 2
GFS Instance 1 GFS Instance 2 Chunks Chunks Chunks Chunks CELL 2 CELL 1

7 Failure Sources Failure Sources Availability
Hardware – Disks, Memory etc. Software – chunk server process Network Interconnect Power Distribution Unit Availability Reasons of unavailable Overloaded Crash or restart Hardware error Automated repair processes

8

9 Disks failures Node restarts Planned machine reboots
Unplanned machine reboots Unknown

10 Fault Tolerance Mechanisms
Replication (R = n) ‘n’ identical chunks (replication factor) are placed across storage nodes in different rack/cell/DC Erasure Coding ( RS (n, m)) ‘n’ distinct data blocks and ‘m’ code blocks Can recover utmost ‘m’ blocks from the remaining ‘n-m’ blocks

11 Replication Fast Encoding / Decoding Very Space Inefficient 5 replicas
1 Chunk Fast Encoding / Decoding Very Space Inefficient

12 Erasure Coding ‘n’ data blocks ‘m’ code blocks Encode ‘n + m’ blocks

13 Erasure Coding ‘n’ data blocks ‘m’ code blocks Encode ‘n + m’ blocks

14 Erasure Coding Highly Space Efficient Slow Encoding / Decoding
‘n’ data blocks ‘m’ code blocks Encode ‘n’ data blocks ‘n + m’ blocks Decode Highly Space Efficient Slow Encoding / Decoding

15 Correlated Failures Failure Domain Failure Burst
Set of machines that simultaneously fails from a common source of failure Failure Burst Sequence of node failures each occurring within a time window ‘w’ of the next Window 120 s

16 Correlated Failures… Failure Burst (Window Size)

17 Markov Model Chunk placement policy Cell Simulation
trace-based simulation Priority queue

18 Markov Chain

19 Conclusion The findings provides a feedback for improving
Replication and encoding schemes Recovery rate

20


Download ppt "Availability in Globally Distributed Storage Systems"

Similar presentations


Ads by Google