Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.

Slides:



Advertisements
Similar presentations
Ensieea Rizwani Disk Failures in the real world:
Advertisements

CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Linear Regression t-Tests Cardiovascular fitness among skiers.
Inference for Regression
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
Inference Sampling distributions Hypothesis testing.
The Dynamics of Resource Allocation in Research Organizations In firms with research units, two interesting problems arise: 1.Managers that allocate resources.
Statistical Quality Control
Chapter 10 Section 2 Hypothesis Tests for a Population Mean
Objectives (BPS chapter 24)
EC2 demystification, server power efficiency, disk drive reliability CSE 490h, Autumn 2008.
Topic 6: Introduction to Hypothesis Testing
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Failure Patterns Many failure-causing mechanisms give rise to measured distributions of times-to-failure which approximate quite closely to probability.
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
Chapter 10 Hypothesis Testing
1 Review Definition: Reliability is the probability that a component or system will perform a required function for a given period of time when used under.
Ch. 9 Fundamental of Hypothesis Testing
1 2. Reliability measures Objectives: Learn how to quantify reliability of a system Understand and learn how to compute the following measures –Reliability.
1 Reliability Application Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS.
Chapter 10 Hypothesis Testing
Hypothesis testing is used to make decisions concerning the value of a parameter.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Fundamentals of Hypothesis Testing: One-Sample Tests
Section 9.1 Introduction to Statistical Tests 9.1 / 1 Hypothesis testing is used to make decisions concerning the value of a parameter.
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Discussion: Risk and Valuation of Contingent Catastrophe Bonds by Daniel Bauer and Florian Kramer Discussion by Patrick Brocket Longevity 5: Fifth International.
Software Reliability SEG3202 N. El Kadri.
Chapter 21 Univariate Statistical Analysis © 2010 South-Western/Cengage Learning. All rights reserved. May not be scanned, copied or duplicated, or posted.
Employment, unemployment and economic activity Coventry working age population by gender Source: Annual Population Survey, Office for National Statistics.
1 Measuring Quality Issues Associated with Internal Migration Estimates Joanne Clements, Amir Islam, Ruth Fulton & Jane Naylor Demographics Methods Centre.
Correlation & Regression
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
MDG data at the sub-national level: relevance, challenges and IAEG recommendations Workshop on MDG Monitoring United Nations Statistics Division Kampala,
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
L Berkley Davis Copyright 2009 MER301: Engineering Reliability Lecture 9 1 MER301:Engineering Reliability LECTURE 9: Chapter 4: Decision Making for a Single.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Lecture 9 Chap 9-1 Chapter 2b Fundamentals of Hypothesis Testing: One-Sample Tests.
Chapter 20 Testing Hypothesis about proportions
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Employment, unemployment and economic activity Coventry working age population by ethnicity Source: Annual Population Survey, Office for National Statistics.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 11: Bivariate Relationships: t-test for Comparing the Means of Two Groups.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Reliability Failure rates Reliability
Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.
PowerPoint Presentation by Charlie Cook The University of West Alabama William G. Zikmund Barry J. Babin 9 th Edition Part 6 Data Analysis and Presentation.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Chapter 15 Inference for Regression. How is this similar to what we have done in the past few chapters?  We have been using statistics to estimate parameters.
Chapter 6  PROBABILITY AND HYPOTHESIS TESTING
Unit 5: Hypothesis Testing
Chapter 9: Inferences Involving One Population
Exploring the Backblaze Hard Drive Data Big, Missing, Problematic Data
A 4 Step Process (Kind of…)
Discrete Event Simulation - 4
Chapter Nine Part 1 (Sections 9.1 & 9.2) Hypothesis Testing
Significance Tests: The Basics
Chapter 9 Hypothesis Testing: Single Population
Presentation transcript:

Disk Failures Eli Alshan

Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you? – Article review – Conclusions – Criticism Further research suggestion

Definitions Disk failure - drive is considered to have failed if it was replaced as part of a repairs procedure – 15-60% of drives considered to have failed at the user site are found to have no defect by the manufacturers MTTF - Mean Time To Failure AFR - Annual Fail Rate ARR – Annual Replacement Rate

Failure Trends in a Large Disk Drive Population Analysis of drives self monitoring data, collected from large disk drive Attempt to isolate parameters highly correlated with disk failures

Results - Utilization Very young and very old age groups appear to show the expected behavior Possible Explanation -Infant mortality

Results - Temperature Lower temperatures are associated with higher failure rates

Results – SMART Scan Errors – background surface scan errors Reallocation count - count of sector data reallocations triggered by recurring errors caused by the sector Offline Reallocations - reallocation counts in which only reallocated sectors found during background scrubbing Probational Counts - sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems

Results – SMART

Scan errors affect the survival probability of young drives dramatically but after the first month the curve flattens out Older drives, decline steadily in survival probability throughout the 8-month period This behavior could be another manifestation of infant mortality phenomenon

Results – SMART

Conclusions No consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels was found Few SMART parameters are well-correlated with higher failure probabilities Out of all failed drives, over 36% have no count in any of the SMART signals, temperature or utilization indication before failure

Criticism Attempt to analyze complex, correlated input data one parameter at a time might be misleading Temperature and utilization should be time windowed, so the reading closer to the failure will receive more attention Physical vicinity between tested drives must be taken into account since close drives experience similar environmental conditions

Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you? An analysis of seven data sets, with a focus on storage related failures – Disk replacement rates observed in the field and compare our observations with common predictors and models used by vendors – Statistical properties of disk replacement rates

Disk Replacement Rates The measured average ARR was 3.4 times larger than 0.88% given in the datasheet

Disk Replacement Rates Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation, but steadily increase over time

Statistical properties of disk failures The hypothesis that time between disks replacements follows an exponential distribution can be rejected with high confidence The distribution of time between disk replacements exhibits decreasing hazard rates. Disk replacements are fit best with gamma and Weibull distributions.

Statistical properties of disk failures The statistical analysis present strong evidence for the existence of correlations between disk replacement intervals. In particular, the empirical data exhibits significant levels of autocorrelation and long-range dependence.

Conclusions The article demonstrates the lack of reliability of data MTTF and AFR provided by disk vendors. Based on the data analysis the papers authors find a significant correlation between disk failures intervals. The paper was able to substantiate with significant statistical confidence the commonly made assumption that exponentially distributed time between failures is not realistic. The article identifies as the key features that distinguish the empirical distribution of time between disk replacements from the exponential distribution, higher levels of variability and decreasing hazard rates.

Criticism Data set size is relatively small which might invalidate it’s thorough statistical analysis performed The statistical model suggested in the article seem to be too simplistic to describe a complex system as a disk in drive population

Further research suggestion State machine disk health model (HMM) State estimation: – Vector of drive health indicators – Current state of the drives physically close to the drive Parameters estimation: – BIC + EM (Baum-Welch Algorithm) …