Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Slides:



Advertisements
Similar presentations
GDI Sensor Net RIP GDI Data Analysis Robert Szewczyk December 20, 2002.
Advertisements

Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay.
Ensieea Rizwani Disk Failures in the real world:
"Failure is not an option. It comes bundled with your system.“ (--unknown)
Linear Regression.
Thank you for your introduction.
Big Data Chapter 1 Verónica Morales Márquez,
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Availability in Globally Distributed Storage Systems
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
Stats for Engineers Lecture 11. Acceptance Sampling Summary One stage plan: can use table to find number of samples and criterion Two stage plan: more.
EC2 demystification, server power efficiency, disk drive reliability CSE 490h, Autumn 2008.
Disk Scrubbing in Large Archival Storage Systems Thomas Schwarz, S.J. 1,2 Qin Xin 1,3, Ethan Miller 1, Darrell Long 1, Andy Hospodor 1,2, Spencer Ng 3.
SMJ 4812 Project Mgmt and Maintenance Eng.
Time Series Analysis Autocorrelation Naive & Simple Averaging
Evaluating Search Engine
X-Ray Observation and Analysis of a M1.7 Class Flare Courtney Peck Advisors: Jiong Qiu and Wenjuan Liu.
x – independent variable (input)
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Reliability Chapter 4S.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
Probability Population:
Single-Cell Gauging 101.
Operating Systems COMP 4850/CISG 5550 Disks, Part II Dr. James Money.
Failures in the System  Two major components in a Node Applications System.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
Anomaly detection Problem motivation Machine Learning.
Reliability Analysis of An Energy-Aware RAID System Shu Yin Xiao Qin Auburn University.
1 Emergency Infant Feeding Surveys Assessing infant feeding as a component of emergency nutrition surveys: Feasibility studies from Algeria, Bangladesh.
Small-Scale Anisotropy Studies with HiRes Stereo Observations Chad Finley and Stefan Westerhoff Columbia University HiRes Collaboration ICRC 2003 Tsukuba,
Problem Determination Your mind is your most important tool!
1 The Design of a Robust Peer-to-Peer System Gisik Kwon Dept. of Computer Science and Engineering Arizona State University Reference: SIGOPS European Workshop.
Software Estimation and Function Point Analysis Presented by Craig Myers MBA 731 November 12, 2007.
Scientific Inquiry & Skills
1 Chronic Absence in the Early Grades: Presentation to NNIP An Applied Research Project funded by the Annie E. Casey Foundation (October 2008)
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Introduction to Software Development. Systems Life Cycle Analysis  Collect and examine data  Analyze current system and data flow Design  Plan your.
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
ERCOT SCR745 Update ERCOT Outage Evaluation Phase 1 and Phase 2 TDTWG April 2, 2008.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
Statistical Process Control04/03/961 What is Variation? Less Variation = Higher Quality.
Securing Passwords Against Dictionary Attacks Presented By Chad Frommeyer.
System Monitoring at the DAEC SysMon SMART Teaming up to get the most out of System Monitoring!
Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.
“How to Measure the Impact of Specific Development Practices on Fielded Defect Density” by Ann Marie Neufelder Presented by: Feride Padgett.
Chapter 12: Hypothesis Testing. Remember that our ultimate goal is to take information obtained in a sample and use it to come to some conclusion about.
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What.
Thomas Schwarz, S.J. Qin Xin, Ethan Miller, Darrell Long, Andy Hospodor, Spencer Ng Summarized by Leonid Kibrik.
Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Smart Inventory System. Step 1: Manage inventory Step 2: Record New Purchase Step 3: Generate New Purchase Plan (smartly)
1 Computer Technician Computer Software: Failures, Corruptions, Repair, and the Future of Computing Copyright © Texas Education Agency, All rights.
1 SMU EMIS 7364 NTU TO-570-N Control Charts Basic Concepts and Mathematical Basis Updated: 3/2/04 Statistical Quality Control Dr. Jerrell T. Stracener,
Power Guru: Implementing Smart Power Management on the Android Platform Written by Raef Mchaymech.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. Handbook of Informatics for Nurses and Healthcare.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Calibration Web Innovations 2017.
Center for Advanced Life Cycle Engineering (CALCE)
Exploring the Backblaze Hard Drive Data Big, Missing, Problematic Data
Ensuring the Quality and Best Use of Information
Chapter 9 Hypothesis Testing: Single Population
©2005 Prentice Hall Business Publishing, Introduction to Management Accounting 13/e, Horngren/Sundem/Stratton ©2008 Prentice Hall Business Publishing,
Device Failure Prediction
Presentation transcript:

Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun

Motivation 90% of all new information is stored on magnetic disks. Most of such data stored on HDD Study failure patterns and key factors that affect the life Analyze the correlation between failures and parameters that are believed to impact life of HDD Why ? --better design and maintenance of storage systems

Previous studies Mostly accelerated aging experiments – poor predictor Moderate size Stats present on returned units from warranty databases No insight on what actually happened to drive during operation

Our study Large study – examining hard drives in Google’s infrastructure. 1 lac disk drives Disk population size is large but depth and detail of study from a end users point of view Why? Manufacturers say failure rate is below 2% but end user experiences much high failure rate Some studies say the failure rate is 20-30% when manufacturer says no prob and it fails on field

SYSTEM HEALTH INFRASTRUCTURE Collection layer – collects data from each server and dumps to repository Storage based on BIGTABLE which is based on GFS. Has 2D data cells and 3 rd dimension for time version Database has complete history of environment, error, config and repair events A daemon runs on each machines. It is light weight & gives info to collectors Large scale analysis done by MapReduce Computation is readily available, user focuses on algorithm of computations

Some other info Data collected over nine months. Mix of HDD--- diff ages, manufacturers and models Failure info mined from previous repair databases upto 5 years We monitor temp, activity levels and SMART parameters Results are not affected by population mix

Results Utilization Previous notion – high duty cycles affect disk drives negatively

Utilization AFR More utilization, more failures true only for infant mortality stage and end stage After 1st year high utilization is only moderately over low utilization How is this possible- Survival of the fittest, previous correlation based on accelerated life test. Same is seen here. Conclusion – Utilization has much weaker correlation to failure than assumed before

Temperature Previous belief temperature change of 15C can double failure rate PDF – Failure does not increase with temperature. Infact lower temperatures may have higher failure rate For age vs AFR – flat failure rate for mid range temp, Modest increase for low temps High temp is not associated with high failure rate, except when old Conclusion – If moderate temp range is considered, temp is not a strong factor for failure rate

SMART Data Analysis Some signals more relevant to disk failures Parameters – Scan errors – Reallocation counts – Offline Reallocations – Probational counts – Miscellaneous signals

Scan errors Errors that are reported when drives scan the disk surface in the background Indicative of surface defects Consistent impact on AFR Drives with scan errors are 39 times more likely to fail after first scan error

Reallocation Counts Represents the number of times a faulty sector is remapped to new physical sector Consistent impact on AFR 14 times more likely to fail

Offline reallocations Subset of reallocation counts Reallocated sectors found during background scrubbing Survival probability worse than total reallocations 21 times more likely to fail

Probational counts Sectors are on ‘probation’ until they fail permanently or work without problems 16 times more likely to fail Threshold is 1

Miscellaneous signals Seek errors CRC errors Power cycles Calibration retries Spin retries Power-on hours Vibration

Conclusion Larger population size used compared to previous studies Lack of consistent pattern of failures for high temperatures or utilization levels SMART parameters are well correlated with failure probabilities Prediction models based only on SMART parameters is limited in accuracy