Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
Ensieea Rizwani Disk Failures in the real world:
"Failure is not an option. It comes bundled with your system.“ (--unknown)
Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design Andy A. Hwang, Ioan Stefanovici, Bianca.
Thank you for your introduction.
MODULE 2: WARRANTY COST ANALYSIS Professor D.N.P. Murthy The University of Queensland Brisbane, Australia.
Improving capital measurement using micro data Abdul Azeez Erumban CBS, the Hague.
Introduction Background Knowledge Workload Modeling Related Work Conclusion & Future Work Modeling Many-Task Computing Workloads on a Petaflop Blue Gene.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
1 Toward I/O-Efficient Protection Against Silent Data Corruptions in RAID Arrays Mingqiang Li and Patrick P. C. Lee The Chinese University of Hong Kong.
RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Software Reliability Engineering
Stats for Engineers Lecture 11. Acceptance Sampling Summary One stage plan: can use table to find number of samples and criterion Two stage plan: more.
Wed. 17th Sept Hamburg LOFAR Workshop.  Extract a cosmological signal from a datacube, the three axes of which are x and y positions, and frequency.
EC2 demystification, server power efficiency, disk drive reliability CSE 490h, Autumn 2008.
Disk Scrubbing in Large Archival Storage Systems Thomas Schwarz, S.J. 1,2 Qin Xin 1,3, Ethan Miller 1, Darrell Long 1, Andy Hospodor 1,2, Spencer Ng 3.
Latent Sector Errors In Disk Drives Ahmet Salih BÜYÜKKAYHAN Spring.
Reliable System Design 2011 by: Amir M. Rahmani
Helicopter System Reliability Analysis Statistical Methods for Reliability Engineering Mark Andersen.
FAWN: A Fast Array of Wimpy Nodes Authors: David G. Andersen et al. Offence: Jaime Espinosa Chunjing Xiao.
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Failure Rate Estimation M.Lampton UCB SSL An upper limit on failure rate, or a lower limit on MTTF, is required to establish system reliability The limit.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Performance Evaluation
Reliability A. A. Elimam. Reliability: Definition The ability of a product to perform its intended function over a period of time and under prescribed.
3-1 Introduction Experiment Random Random experiment.
1 Fundamentals of Reliability Engineering and Applications Dr. E. A. Elsayed Department of Industrial and Systems Engineering Rutgers University
1 Review Definition: Reliability is the probability that a component or system will perform a required function for a given period of time when used under.
Introduction Before… Next…
Copyright © 2014 reliability solutions all rights reserved Reliability Solutions Seminar Managing and Improving Reliability 2014 Agenda Martin Shaw – Reliability.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
1 2. Reliability measures Objectives: Learn how to quantify reliability of a system Understand and learn how to compute the following measures –Reliability.
Failures in the System  Two major components in a Node Applications System.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Computer Organization CS224 Fall 2012 Lesson 51. Measuring I/O Performance  I/O performance depends on l Hardware: CPU, memory, controllers, buses l.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center.
Probability theory 2 Tron Anders Moger September 13th 2006.
Copyright © SystatS Consulting Sdn. Bhd. No part may be reproduced without written permission from SystatS Consulting Sdn. Bhd. Weibull Analysis for Reliability.
Reliability Models & Applications Leadership in Engineering
RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.
Copyright © 2014 reliability solutions all rights reserved Reliability Solutions Seminar Managing and Improving Reliability 2015 Agenda Martin Shaw – Reliability.
Characterizing Failure Data in High-Performance-Computing Systems Bianca Schroeder With Garth Gibson and Gary Grider Department of Computer Science Carnegie.
Chapter 12 – Mass Storage Structures (Pgs )
Aircraft Windshield Failures Statistical Methods for Reliability Engineering Professor Gutierrez-Miravete Erica Siegel December 4, 2008.
Reliability McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Three-Dimensional Redundancy Codes for Archival Storage J.-F. Pâris, U. of Houston D. D. E. Long, U. C. Santa Cruz W. Litwin, U. Paris-Dauphine.
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What.
Learning Simio Chapter 10 Analyzing Input Data
Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.
Page 1 of 14 Opportunities for Reliability Studies in ARIES M. S. Tillack ARIES Project Meeting April 2009.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Reliability Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 1 In-Jeong Chung Intelligent Information System lab. Department.
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
A Simulation Analysis of Reliability in Primary Storage Deduplication
Advanced Topics in Storage Systems
Dave Eckhardt Disk Arrays Dave Eckhardt
Vladimir Stojanovic & Nicholas Weaver
Reliability and Maintainability
ICOM 6005 – Database Management Systems Design
Fault-tolerance Lecture 15 Sam Madden Today: •Finish Chord
Presentation transcript:

Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © December 152 Reliability is important Failures are frustrating and expensive. Might get worse in the future with increasing scale & component count Why has there not been more progress?

Bianca Schroeder © December 153 Failures are not very well understood “Much academic and corporate research is based on anecdotes and back of the envelope calculations” [Schwarz06]. “Most papers use simplistic assumptions about component failures..” [Patterson99]. Why? No publicly available data on failures in real systems.

Types of failures covered: Cluster node outages (records of more than 23,000 outages) Storage failures (data covering more than 100,000 drives) DRAM errors Examples from real-world data [FAST 07] Joint w/ Gibson. Best paper award. [SciDAC 07] Joint w/ Gibson. [FAST 08] Joint w/ Bairavasundaram. Best paper award. [TOS 08] Joint w/Bairavasundaram et al. [DSN 06] Joint w/ Gibson. [TDSC 08] Joint w/ Gibson [Sigmetrics 09] Join w/ Pinheiro, Weber. Best presentation award.

Bianca Schroeder © December 155 The data: Hard drive failures Data covers > 100,000 drives SATA, FC, SCSI Enterprise and HEC environment Errors in DRAM Data written differently from how it was written –Both correctable & uncorrectable, soft & hard Data covers all of Google’s fleet DDR1, DDR2, FBDIMM 5 different manufacturers, 6 hardware platforms

Bianca Schroeder © December 156 Frequency of errors in today’s systems Example 1: [Sigmetrics’09] DRAM errors in the field Sheet Data Field Example 2: [FAST’06,TOS’07] HDD replacements in the field Number of CEs / year Hardware Platform Correctable errors (CEs) Accelerated lab tests and vendor data sheets are not enough Need real field data! Field Lab tests SATA Dominated by hard errors, not soft errors Not getting worse with newer generations SATA not less reliable than SCSI & FC

Bianca Schroeder © December 157 Effect of age? Nominal lifetime – 5 years Theory: Little effect during nominal lifetime Practice: [FAST’06,sigmetrics’09] Surprisingly early wear-out Infant mortality no concern HDD replacements DRAM errors

Bianca Schroeder © December 158 Effect of temperature? Theory: Effect known from lab experiments Practice: [FAST’06,sigmetrics’09] Unclear effect in the field HDD replacements Time Error rate DRAM errors Similar results for latent sector errors in hard drives

Bianca Schroeder © December 159 Statistical properties? Theory: Poisson process - independent failures - exponential time between failures Practice: [FAST’06,sigmetrics’09] Correlations - autocorrelation - long-range dependence Long tails in time between failures. Bianca Schroeder © December 159 Expected number of failures in a week Data Independence SMALL MEDIUM LARGE Number of failures in previous week

Bianca Schroeder © December 1510 Failures are not very well understood Failures often look different from common assumptions Even for basic properties, such as frequency. Impact of factors such as age, workload, environmental factors, etc. Statistical properties Found this to be true for various types of errors: Hard drive replacements Memory errors Cluster node outages Latent sector errors Data corruption Does it matter?

Bianca Schroeder © December 1511 Probability of a RAID failure Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution x Probability (%) Reconstruction time

Bianca Schroeder © December x Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Probability of a RAID failure Reconstruction time

Bianca Schroeder © December 1513 x Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Approach 2: Use measured MTTF and exponential distribution. Probability of a RAID failure Reconstruction time

Bianca Schroeder © December 1514 x Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Approach 2: Use measured MTTF and exponential distribution. Approach 3: Use Weibull distribution fit to data. Probability of a RAID failure Reconstruction time

Bianca Schroeder © December x Probability (%) Depends on probability of second failure during reconstruction. Approach 1: Use datasheet MTTF and exponential distribution. Approach 2: Use measured MTTF and exponential distribution. Approach 3: Use Weibull distribution fit to data. Probability of a RAID failure

Conclusion Failures often not well understood It matters when designing systems! Need real world data!

Bianca Schroeder © December 1517 The computer failure data repository (CFDR) Gather & publish real failure data Community effort Usenix clearinghouse Data on all aspects of system failure Anonymized as needed Do you have any data to contribute?