Failures in the System  Two major components in a Node Applications System.

Slides:



Advertisements
Similar presentations
Copysets: Reducing the Frequency of Data Loss in Cloud Storage
Advertisements

A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
Triple-Parity RAID and Beyond Hai Lu. RAID RAID, an acronym for redundant array of independent disks or also known as redundant array of inexpensive disks,
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Availability in Globally Distributed Storage Systems
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
current hadoop architecture
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
HAIL (High-Availability and Integrity Layer) for Cloud Storage
The google file system Cs 595 Lecture 9.
Coding and Algorithms for Memories Lecture 12 1.
Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Availability in Globally Distributed Storage Systems
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology April 1, 2004 MEDIA FAILURES Lecture based on [GUW, ]
The Google File System.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
1 The Google File System Reporter: You-Wei Zhang.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
CSI-09 COMMUNICATION TECHNOLOGY FAULT TOLERANCE AUTHOR: V.V. SUBRAHMANYAM.
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Serverless Network File Systems Overview by Joseph Thompson.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Coding and Algorithms for Memories Lecture 14 1.
The concept of RAID in Databases By Junaid Ali Siddiqui.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Coding and Algorithms for Memories Lecture 13 1.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
BIG DATA/ Hadoop Interview Questions.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 10: Mass-Storage Systems.
RAID.
A Tale of Two Erasure Codes in HDFS
8.6. Recovery By Hemanth Kumar Reddy.
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
A Simulation Analysis of Reliability in Erasure-coded Data Centers
RAID Non-Redundant (RAID Level 0) has the lowest cost of any RAID
Vladimir Stojanovic & Nicholas Weaver
Section 7 Erasure Coding Overview
Understanding Real World Data Corruptions in Cloud Systems
Ministry of Higher Education
RAID RAID Mukesh N Tekwani
湖南大学-信息科学与工程学院-计算机与科学系
ICOM 6005 – Database Management Systems Design
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
RAID RAID Mukesh N Tekwani April 23, 2019
Disk Scheduling The operating system is responsible for using hardware efficiently — for the disk drives, this means having a fast access time and disk.
Presentation transcript:

Failures in the System  Two major components in a Node Applications System

Failures in the System Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System

Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Application System

Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System

Failures in the System  Similar systems at Nebraska Bigtable GFS File Systems Hard Drive Cluster Scheduler Hadoop File Systems Hard Drive Google Nebraska Failure will cause unavailability Application System Could cause data loss

Unavailability: Defined  Data on a node is unreachable  Detection:  Periodic heartbeats are missing  Correction:  Lasts until node comes back  System recreates the data

Unavailability: Measured

Replication Starts

Unavailability: Measured Replication Starts Question: After replication starts, why does it take so long to recover?

Node Availability Storage Software Restart

Node Availability Storage Software Restart Software is fast to restart

Node Availability: Time Planned Reboots

Node Availability: Time Planned Reboots Node updates (planned reboots) cause the most downtime.

MTTF for Components  Even though Disk failure can cause data loss, node failure is much more often  Conclusion: Node failure is more important to system availability

Correlated Failures  Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes  Losing nodes before replication can start can cause unavailability of data

Correlated Failures

Rolling Reboots of cluster

Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)

Coping with Failure

Replication Encoding

Coping with Failure Replication Encoding 27.3 M Years 3 replicas is standard in large clusters 27,000 Years

Coping with Failure Cell Replication (Datacenter Replication)

Cell Replication Block A Cell 1 Block A Cell 2

Cell Replication Block A Cell 1 Block A Cell 2

Cell Replication Block A Cell 1 Block A Cell 2

Cell Replication Block A Cell 1 Block A Cell 2

Modeling Failures We’ve seen the data, now lets model the behavior.

Modeling Failures  A chunk of data can be in one of many states.  Consider when Replication = Lose a replica, but still 2 available

Modeling Failures  A chunk of data can be in one of many states.  Consider when Replication = replicas = service unavailable Recovery

Modeling Failures  Each loss of a replica has a probability  The recovery rate is also known replicas = service unavailable Recovery

Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication

Modeling Failures  Using Markov models, we can find:

Modeling Failures  Using Markov models, we can find: Nebraska 402 Years

Modeling Failures  For Multi-Cell Implementations

Paper Conclusions  Given enormous amount of data from Google, can say:  Failures are typically short  Node failures can happen in bursts, and are not independent  In modern distributed file systems, disk failure is the same as node failure.  Built Markov Model for failures that accurately reason about past and future availability.

My Conclusions  This paper contributed greatly by showing data from very large scale distributed file systems.  If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook?  Complicated code?  Complicated administration?