Ensieea Rizwani Disk Failures in the real world:

Slides:



Advertisements
Similar presentations
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Advertisements

Fundamentals of Probability
Copyright © 2008 Pearson Addison-Wesley. All rights reserved. Chapter 16 Unemployment: Search and Efficiency Wages.
AG Barr Proc 1 © The Delos Partnership 2003 Sales Forecasting and Demand Management The process.
Part 3 Probabilistic Decision Models
Chapter 1 The Study of Body Function Image PowerPoint
STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
Detection of Hydrological Changes – Nonparametric Approaches
Describing Data: Measures of Dispersion
1Regional policy responses to demographic challenges, Bruxelles, January 2007 EUROSTAT regional population projections Giampaolo LANZIERI Eurostat.
Summary of Convergence Tests for Series and Solved Problems
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Copyright © 2010 Pearson Education, Inc. Slide
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.
Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay.
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
1 Implementing Internet Web Sites in Counseling and Career Development James P. Sampson, Jr. Florida State University Copyright 2003 by James P. Sampson,
Copyright ©2010 Pearson Education, Inc. publishing as Prentice Hall
Reliability McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
IEC – IEC Presentation G.M. International Safety Inc.
Configuration management
Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.
Effectively applying ISO9001:2000 clauses 6 and 7.
Recurrences : 1 Chapter 3. Growth of function Chapter 4. Recurrences.
S-Curves & the Zero Bug Bounce:
ABC Technology Project
Taming User-Generated Content in Mobile Networks via Drop Zones Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
การทดสอบโปรแกรม กระบวนการในการทดสอบ
VOORBLAD.
Weighted moving average charts for detecting small shifts in process mean or trends The wonders of JMP 1.
IONA Technologies Position Paper Constraints and Capabilities for Web Services
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
© 2012 National Heart Foundation of Australia. Slide 2.
Sets Sets © 2005 Richard A. Medeiros next Patterns.
Science as a Process Chapter 1 Section 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Copyright © Cengage Learning. All rights reserved.
Putting Statistics to Work
Computing Transformations
Januar MDMDFSSMDMDFSSS
Determining How Costs Behave
Evaluation of precision and accuracy of a measurement
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
PSSA Preparation.
Simple Linear Regression Analysis
1 McGill University Department of Civil Engineering and Applied Mechanics Montreal, Quebec, Canada.
1 Functions and Applications
1 Office of New Teacher Induction Introducing NTIMS New Teacher Induction Mentoring System A Tool for Documenting School Based Mentoring Mentors’ Guide.
1 Volume measures and Rebasing of National Accounts Training Workshop on System of National Accounts for ECO Member Countries October 2012, Tehran,
Commonly Used Distributions
SMJ 4812 Project Mgmt and Maintenance Eng.
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Software Reliability SEG3202 N. El Kadri.
Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What.
Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Presentation transcript:

Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University MTTF: mean time to failure 1,000,000 : million

Motivation Storage failure can not only cause temporary data unavailability, but permanent data loss. Technology trends and market forces may make storage system failures occur more frequently in the future. With ever larger server clusters, maintaining high levels of reliability and availability is a growing problem for many sites, including high-performance computing systems and internet service providers. Finally, the size of storage systems in modern, large-scale IT installations has grown to a scale with thousands of storage devices, making component failures the norm rather than the exception.

What is Disk Failure? Why ??? Assumption: Disk failures follow a simple “fail-stop model”, where disk either work perfectly or fail absolutely and in an easily detectable manner. Disk failures are much more complex in reality. Why ???

Complexities of Disk failure Often it is hard to correctly attribute the root cause of a problem to a particular hardware component. Example: If a drive experiences a latent sector faults or transient performance problem, it is often hard to pin point the root cause. Headache and Dr. visit.

Complexities of Disk failure There is not a unique definition of when a drive is faulty. Example: Customers and vendors might use different definitions. A common way for a customer to test a drive is to read all of its sectors to see if any reads experience problems, and decide that it is faulty if any one operation takes longer than a certain threshold. Many sites follow a “better safe than sorry” mentality, and use even more rigorous testing. The outcome of such a test will depend on how the thresholds are chosen. 2. As a result, it cannot be ruled out that a customer may declare a disk faulty, while its manufacturer sees it as healthy. This also means that the definition of “faulty” that a drive customer uses does not necessarily fit the definition that a drive manufacturer uses to make drive reliability projections. 3. In fact, a disk vendor has reported that for 43% of all disks returned by customers they find no problem with the disk.

Challenge! Unfortunately, many aspects of disk failures in real systems are not well understood, probably because the owners of such systems are reluctant to release failure data or do not gather such data.

Research data for this paper Request to a number of large production sites and were able to convince several of them to provide failure data from some of their systems Dataset in the article provides an analysis of seven data sets collected high-performance computing sites large internet services sites.

Data Sets Consist of primarily hardware replacement logs Vary in duration from one month to five years Cover a population of more than 100,000 drives from at least four different vendors Include drives with SCSI, FC and SATA interfaces. 1. Although 100,000 drives is a very large sample relative to previously published studies, it is small compared to the estimated 35 million enterprise drives, and 300 million total drives built in 2006. 2. SCSI and FC interfaces, commonly represented as the most reliable types of disk drives, as well as drives with SATA interfaces, common in desktop and nearline systems.

Research data sets Research is based on hardware replacement records and logs. Article analyzes records from number of large production systems, which contain a record count and conditions for every disk that was replaced in the system during the data collection.

Data set Analysis After a disk drive is identified as the likely culprit in a problem, the operations staff perform a series of tests on the drive to assess its behavior. If the behavior qualifies as faulty according to the customer’s definition, the disk is replaced and a corresponding entry is made in the hardware replacement log.

Failure Depending Factors Operating Conditions Environmental factors Temperature Humidity Data Handling procedures Work loads Duty cycles or powered-on hours patterns

Effect of Bad Batch Failure behavior of disk drive may differ even if they are of the same model Changes in manufacturing process and parts could play huge role Drive’s hardware or firmware component Assembly line on which a drive was manufactured. A bad batch can lead to unusually high drive failure rates or high media error rates.

Bad Batch Example HPC3, one of the data set customers had 11,000 SATA drives replaced in Oct. 2006 high frequency of media errors during writes. It took a year to resolve. The customer and vendor agreed that these drives did not meet warranty conditions. The cause was attributed to the breakdown of a lubricant during manufacturing leading to unacceptably high head flying heights.

Bad Batch Effect of batches is not analyzed. Research report is on the field experience, in terms of disk replacement rates, of a set of drive customers. Customers usually do not have the information necessary to determine which of the drives they are using come from the same or different batches. Since the data spans a large number of drives (more than 100,000) and comes from a diverse set of customers and systems.

Research Failure Data Sets Super Computers, Internet Service Providers, Warranty Services, Large external Storage systems used by Internet services. SATA, FC, SCSI Nodes, file system nodes 10 k, rotation speed, Rev/ min

Reliability Metrics Annualized Failure rate (AFR) scaled to yearly estimation and the Mean time to failure (MTTF) Annual Replacement Rate (ARR) MTTF = Power on Hours / AFR Fact: MTTFs specified for today’s high-test quality disks range from 1,000,000 hours to 1,500,000 ARR <> AFR Million to 1.5 Million

Hazard rate h(t) Hazard rate between Disk replacement h(t) = f(t) / 1 – F(t) (t) denotes time between failures, h(t) describes instantaneous failure rate since the most recently observed failure. F(t) cumulative distribution function constant hazard rate implies that the probability of failure at a given point in time does not depend on how long it has been since the most recent failure. Increasing hazard rate means that the probability of a failure increases, if the time since the last failure has been long. Decreasing hazard rate means that the probability of a failure decreases, if the time since the last failure has been long.

Hazard rate Analysis constant hazard rate implies that the probability of failure at a given point in time does not depend on how long it has been since the most recent failure. Increasing hazard rate means that the probability of a failure increases, if the time since the last failure has been long. Decreasing hazard rate means that the probability of a failure decreases, if the time since the last failure has been long.

Comparing Disk replacement frequency with that of other hardware components The reliability of a system depends on all its components not just the hard drives. What is the frequency of hard drive failures to other hardware failures?

Hardware Failure Comparison While the above table suggests that disks are among the most commonly replaced hardware components, it does not necessarily imply that disks are less reliable or have a shorter lifespan than other hardware components. The number of disks in the systems might simply be much larger than that of other hardware components. In order to compare the reliability of different hardware components, we need to normalize the number of component replacements by the component’s population size. After wearing out.. Excluded from data set

Responsible Hardware Node outages that were attributed to hardware problems broken down by the responsible hardware component. This includes all outages, not only those that required replacement of a hardware component. Completer picture Ideally, we would like to compare the frequency of hardware problems that we report above with the frequency of other types of problems, such software failures, network problems, etc. Unfortunately, we do not have this type of information for the systems in Table 1. However, in recent work [27] we have analyzed failure data covering any type of node outage, including those caused by hardware, software, network problems, environmental problems, or operator mistakes. The data was collected over a period of 9 years on more than 20 HPC clusters and contains detailed root cause information. We found that, for most HPC systems in this data, more than 50% of all outages are attributed to hardware problems and around 20% of all outages are attributed to software problems. Consistently with the data in Table 2, the two most common hardware components to cause a node outage are memory and CPU.

It is interesting to observe that for these data sets there is no significant discrepancy between replacement rates for SCSI and FC drives, commonly represented as the most reliable types of disk drives, and SATA drives, frequently described as lower quality. Note HPC4 exclusively SATA drives.

Why are the observed field ARR so much higher than datasheet MTTF? Field AFR are more than a factor of two higher than the datasheet AFR. Customer and vendor definition of faulty varies MTTF are determined based on accelerated stress tests, which make certain assumptions about the operating conditions. Temperature Workloads and duty cycles Data center handling procedures Operating conditions might not always be ideal

Age dependent replacement rate Failure rates of hardware products typically follow a “bathtub curve” with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle. Failure rates in real life are not constant and cannot be captured in MTTF, AFR

MTTF & AFR ARR larger than suggested MTTF in all years except first year. Increasing replacement rate suggesting early wear out. Disagree with “bottom of bath-tub” analogy.

Distribution of Time Distribution of time between disk replacements across all nodes in HPC1.

Many have pointed out the need for a better understanding of what disk failures look like in the field. Yet hardly any published work exists that provides a large-scale study of disk failures in production systems.

Conclusion Field usage appears to differ from datasheet MTTF conditions. For drives less than five years old, ARR was much larger than what the datasheet MTTF suggested by a factor of 2–10. This rate often expected to be in steady state (bottom of the “bathtub curve”). In the data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk independent factors, such as operating conditions, usage and environmental factors, affect replacement rates more than component specific factors.

Conclusion Concern that MTTF under represents infant mortality but early on-set of wear out is more important than under representation of infant mortality. The empirical distribution of time between disk replacements are best fit by a Weibull distribution and gamma distributions and not exponential distributions.

Thank You !

Citation : Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder http://www.cs.cmu.edu/~bianca/fast07.pdf