Presentation is loading. Please wait.

Presentation is loading. Please wait.

Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.

Similar presentations


Presentation on theme: "Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure."— Presentation transcript:

1 Disk Failures Eli Alshan

2 Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you? – Article review – Conclusions – Criticism Further research suggestion

3 Definitions Disk failure - drive is considered to have failed if it was replaced as part of a repairs procedure – 15-60% of drives considered to have failed at the user site are found to have no defect by the manufacturers MTTF - Mean Time To Failure AFR - Annual Fail Rate ARR – Annual Replacement Rate

4 Failure Trends in a Large Disk Drive Population Analysis of drives self monitoring data, collected from large disk drive Attempt to isolate parameters highly correlated with disk failures

5 Results - Utilization Very young and very old age groups appear to show the expected behavior Possible Explanation -Infant mortality

6 Results - Temperature Lower temperatures are associated with higher failure rates

7 Results – SMART Scan Errors – background surface scan errors Reallocation count - count of sector data reallocations triggered by recurring errors caused by the sector Offline Reallocations - reallocation counts in which only reallocated sectors found during background scrubbing Probational Counts - sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems

8 Results – SMART

9 Scan errors affect the survival probability of young drives dramatically but after the first month the curve flattens out Older drives, decline steadily in survival probability throughout the 8-month period This behavior could be another manifestation of infant mortality phenomenon

10 Results – SMART

11 Conclusions No consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels was found Few SMART parameters are well-correlated with higher failure probabilities Out of all failed drives, over 36% have no count in any of the SMART signals, temperature or utilization indication before failure

12 Criticism Attempt to analyze complex, correlated input data one parameter at a time might be misleading Temperature and utilization should be time windowed, so the reading closer to the failure will receive more attention Physical vicinity between tested drives must be taken into account since close drives experience similar environmental conditions

13 Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you? An analysis of seven data sets, with a focus on storage related failures – Disk replacement rates observed in the field and compare our observations with common predictors and models used by vendors – Statistical properties of disk replacement rates

14 Disk Replacement Rates The measured average ARR was 3.4 times larger than 0.88% given in the datasheet

15 Disk Replacement Rates Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation, but steadily increase over time

16 Statistical properties of disk failures The hypothesis that time between disks replacements follows an exponential distribution can be rejected with high confidence The distribution of time between disk replacements exhibits decreasing hazard rates. Disk replacements are fit best with gamma and Weibull distributions.

17 Statistical properties of disk failures The statistical analysis present strong evidence for the existence of correlations between disk replacement intervals. In particular, the empirical data exhibits significant levels of autocorrelation and long-range dependence.

18 Conclusions The article demonstrates the lack of reliability of data MTTF and AFR provided by disk vendors. Based on the data analysis the papers authors find a significant correlation between disk failures intervals. The paper was able to substantiate with significant statistical confidence the commonly made assumption that exponentially distributed time between failures is not realistic. The article identifies as the key features that distinguish the empirical distribution of time between disk replacements from the exponential distribution, higher levels of variability and decreasing hazard rates.

19 Criticism Data set size is relatively small which might invalidate it’s thorough statistical analysis performed The statistical model suggested in the article seem to be too simplistic to describe a complex system as a disk in drive population

20 Further research suggestion State machine disk health model (HMM) State estimation: – Vector of drive health indicators – Current state of the drives physically close to the drive Parameters estimation: – BIC + EM (Baum-Welch Algorithm) …


Download ppt "Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure."

Similar presentations


Ads by Google