Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.

Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © December 152 Reliability is important Failures are frustrating and expensive. Might get worse in the future with increasing scale & component count Why has there not been more progress?

Bianca Schroeder © December 153 Failures are not very well understood “Much academic and corporate research is based on anecdotes and back of the envelope calculations” [Schwarz06]. “Most papers use simplistic assumptions about component failures..” [Patterson99]. Why? No publicly available data on failures in real systems.

Types of failures covered: Cluster node outages (records of more than 23,000 outages) Storage failures (data covering more than 100,000 drives) DRAM errors Examples from real-world data [FAST 07] Joint w/ Gibson. Best paper award. [SciDAC 07] Joint w/ Gibson. [FAST 08] Joint w/ Bairavasundaram. Best paper award. [TOS 08] Joint w/Bairavasundaram et al. [DSN 06] Joint w/ Gibson. [TDSC 08] Joint w/ Gibson [Sigmetrics 09] Join w/ Pinheiro, Weber. Best presentation award.

Bianca Schroeder © December 155 The data: Hard drive failures Data covers > 100,000 drives SATA, FC, SCSI Enterprise and HEC environment Errors in DRAM Data written differently from how it was written –Both correctable & uncorrectable, soft & hard Data covers all of Google’s fleet DDR1, DDR2, FBDIMM 5 different manufacturers, 6 hardware platforms

Bianca Schroeder © December 156 Frequency of errors in today’s systems Example 1: [Sigmetrics’09] DRAM errors in the field Sheet Data Field Example 2: [FAST’06,TOS’07] HDD replacements in the field Number of CEs / year Hardware Platform Correctable errors (CEs) Accelerated lab tests and vendor data sheets are not enough Need real field data! Field Lab tests SATA Dominated by hard errors, not soft errors Not getting worse with newer generations SATA not less reliable than SCSI & FC

Bianca Schroeder © December 157 Effect of age? Nominal lifetime – 5 years Theory: Little effect during nominal lifetime Practice: [FAST’06,sigmetrics’09] Surprisingly early wear-out Infant mortality no concern HDD replacements DRAM errors

Bianca Schroeder © December 158 Effect of temperature? Theory: Effect known from lab experiments Practice: [FAST’06,sigmetrics’09] Unclear effect in the field HDD replacements Time Error rate DRAM errors Similar results for latent sector errors in hard drives

Bianca Schroeder © December 159 Statistical properties? Theory: Poisson process - independent failures - exponential time between failures Practice: [FAST’06,sigmetrics’09] Correlations - autocorrelation - long-range dependence Long tails in time between failures. Bianca Schroeder © December 159 Expected number of failures in a week Data 4 3 2 1 0 5 Independence SMALL MEDIUM LARGE Number of failures in previous week

Bianca Schroeder © December 1510 Failures are not very well understood Failures often look different from common assumptions Even for basic properties, such as frequency. Impact of factors such as age, workload, environmental factors, etc. Statistical properties Found this to be true for various types of errors: Hard drive replacements Memory errors Cluster node outages Latent sector errors Data corruption Does it matter?

Bianca Schroeder © December 1511 Probability of a RAID failure Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. 4 2 1 0 3 5 6 x 10 -3 Probability (%) Reconstruction time

Bianca Schroeder © December 1512 4 2 1 0 3 5 6 x 10 -3 Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Probability of a RAID failure Reconstruction time

Bianca Schroeder © December 1513 x 10 -3 4 2 1 0 3 5 6 Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Approach 2: Use measured MTTF and exponential distribution. Probability of a RAID failure Reconstruction time

Bianca Schroeder © December 1514 x 10 -3 4 2 1 0 3 5 6 Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Approach 2: Use measured MTTF and exponential distribution. Approach 3: Use Weibull distribution fit to data. Probability of a RAID failure Reconstruction time

Bianca Schroeder © December 1515 1.2 1.0 0.6 0.4 0.2 0 0.8 1.4 1.6 x 10 -2 Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Approach 2: Use measured MTTF and exponential distribution. Approach 3: Use Weibull distribution fit to data. Probability of a RAID failure

Conclusion Failures often not well understood It matters when designing systems! Need real world data!

Bianca Schroeder © December 1517 The computer failure data repository (CFDR) Gather & publish real failure data Community effort Usenix clearinghouse Data on all aspects of system failure Anonymized as needed www.cfdr.usenix.org Do you have any data to contribute?

Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.

Similar presentations

Presentation on theme: "Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.

Similar presentations

Presentation on theme: "Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto."— Presentation transcript:

Similar presentations

About project

Feedback