Presentation is loading. Please wait.

Presentation is loading. Please wait.

Failures in the System  Two major components in a Node Applications System.

Similar presentations


Presentation on theme: "Failures in the System  Two major components in a Node Applications System."— Presentation transcript:

1

2 Failures in the System  Two major components in a Node Applications System

3 Failures in the System Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System

4 Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System

5 Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System

6 Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System Could cause data loss

7 Unavailability: Defined  Data on a node is unreachable  Detection:  Periodic heartbeats are missing  Correction:  Lasts until node comes back  System recreates the data

8 Unavailability: Measured

9 Replication Starts

10 Unavailability: Measured Replication Starts Question: After replication starts, why does it take so long to recover?

11 Node Availability Storage Software Restart

12 Node Availability Storage Software Restart Software is fast to restart

13 Node Availability: Time Planned Reboots

14 Node Availability: Time Planned Reboots Node updates (planned reboots) cause the most downtime.

15 MTTF for Components  Even though Disk failure can cause data loss, node failure is much more often  Conclusion: Node failure is more important to system availability

16 Correlated Failures  Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes  Losing nodes before replication can start can cause unavailability of data

17 Correlated Failures

18 Rolling Reboots of cluster

19 Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)

20 Coping with Failure

21 Replication Encoding

22 Coping with Failure Replication Encoding 27.3 M Years 3 replicas is standard in large clusters 27,000 Years

23 Coping with Failure Cell Replication (Datacenter Replication)

24 Cell Replication Block A Cell 1 Block A Cell 2

25 Cell Replication Block A Cell 1 Block A Cell 2

26 Cell Replication Block A Cell 1 Block A Cell 2

27 Cell Replication Block A Cell 1 Block A Cell 2

28 Modeling Failures We’ve seen the data, now lets model the behavior.

29 Modeling Failures  A chunk of data can be in one of many states.  Consider when Replication = 3 3 3 2 2 1 1 0 0 Lose a replica, but still 2 available

30 Modeling Failures  A chunk of data can be in one of many states.  Consider when Replication = 3 3 3 2 2 1 1 0 0 0 replicas = service unavailable Recovery

31 Modeling Failures  Each loss of a replica has a probability  The recovery rate is also known 3 3 2 2 1 1 0 0 0 replicas = service unavailable Recovery

32 Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication

33 Modeling Failures  Using Markov models, we can find:

34 Modeling Failures  Using Markov models, we can find: Nebraska 402 Years

35 Modeling Failures  For Multi-Cell Implementations

36 Paper Conclusions  Given enormous amount of data from Google, can say:  Failures are typically short  Node failures can happen in bursts, and are not independent  In modern distributed file systems, disk failure is the same as node failure.  Built Markov Model for failures that accurately reason about past and future availability.

37 My Conclusions  This paper contributed greatly by showing data from very large scale distributed file systems.  If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook?  Complicated code?  Complicated administration?


Download ppt "Failures in the System  Two major components in a Node Applications System."

Similar presentations


Ads by Google