Presentation is loading. Please wait.

Presentation is loading. Please wait.

Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,

Similar presentations


Presentation on theme: "Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,"— Presentation transcript:

1 Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,Luiz Barroso, Carrie Grimes, and Sean Quinlan

2 OUTLINE Introduction Disks failures Correlated Failures Fault Tolerance Mechanisms Markov Model of Stripe Availability Markov Model Findings Conclusions

3 Data Center 3

4 Data Center Components 4 Server Components Racks Interconnects Cluster of Racks

5 Data Center Components 5 Server Components Racks Interconnects Cluster of Racks ALL THESE COMPONENTS CAN FAIL

6 Cell, Stripe and Chunk 6 Stripe 1 Stripe 2 Stripe 1 Stripe 2 CELL 1 CELL 2 Chunks GFS Instance 1GFS Instance 2

7 Failure Sources Hardware – Disks, Memory etc. Software – chunk server process Network Interconnect Power Distribution Unit Availability Reasons of unavailable Overloaded Crash or restart Hardware error Automated repair processes 7

8

9 Disks failures Node restarts Planned machine reboots Unplanned machine reboots Unknown

10 Fault Tolerance Mechanisms Replication (R = n) n identical chunks (replication factor) are placed across storage nodes in different rack/cell/DC Erasure Coding ( RS (n, m)) n distinct data blocks and m code blocks Can recover utmost m blocks from the remaining n-m blocks 10

11 Replication 11 1 Chunk 5 replicas Fast Encoding / Decoding Very Space Inefficient

12 Erasure Coding 12 n data blocks Encode n + m blocks m code blocks

13 Erasure Coding 13 n data blocks Encode n + m blocks m code blocks

14 Erasure Coding 14 Highly Space Efficient Slow Encoding / Decoding n data blocks Decode Encode n + m blocks m code blocks n data blocks

15 Correlated Failures Failure Domain Set of machines that simultaneously fails from a common source of failure Failure Burst Sequence of node failures each occurring within a time window w of the next Window 120 s 15

16 Correlated Failures… Failure Burst (Window Size)

17 Markov Model Chunk placement policy Cell Simulation trace-based simulation Priority queue 17

18 Markov Chain 18

19 Conclusion The findings provides a feedback for improving Replication and encoding schemes Recovery rate 19

20


Download ppt "Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely,"

Similar presentations


Ads by Google