Fault Tolerance In Operating System

Slides:



Advertisements
Similar presentations
RAID Redundant Array of Independent Disks
Advertisements

Chapter 8 Fault Tolerance
Fault-Tolerant Systems Design Part 1.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
8. Fault Tolerance in Software
Last Class: Weak Consistency
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Introduction to Dependability. Overview Dependability: "the trustworthiness of a computing system which allows reliance to be justifiably placed on the.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.
Fault-Tolerant Systems Design Part 1.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Fault-Tolerant Systems Design Part 1.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Lecture 12 Fault Tolerance, Logging and recovery Thursday Oct 8 th, Distributed Systems.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Fault Tolerance and Recovery Mostly taken from
Week#3 Software Quality Engineering.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Fault Tolerance Chap 7.
8.6. Recovery By Hemanth Kumar Reddy.
Faults and fault-tolerance
Outline Introduction Background Distributed DBMS Architecture
Fault Tolerance & Reliability CDA 5140 Spring 2006
Operating System Reliability
Operating System Reliability
Chapter 8 Fault Tolerance Part I Introduction.
Software Reliability: 2 Alternate Definitions
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
RAID RAID Mukesh N Tekwani
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
RAID RAID Mukesh N Tekwani April 23, 2019
Disk Failures Disk failure ways and their mitigation
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Operating System Reliability
Seminar on Enterprise Software
Operating System Reliability
Presentation transcript:

Fault Tolerance In Operating System Zihao Mao

Fault-Tolerance Definition Refers to the ability of a system or component to continue normal operation despite the presence of hardware or software faults A fault management technique which build a components in such a way that it can meet its specifications in the presence of faults.

Properties Reliability Availability Safety Maintainability R(t) : Probability of a system operating correctly up to time t given that the system was operating correctly at time t=o Mean Time To Failure(MTTF) Mean Time To Repair(MTTR) Availability Defined as the fraction of time the system is available to service users’ requests Safety The ability to minimize the impacts of small failures. Maintainability How easy is it to repair the faults? High reliability doesn’t mean high availability

Fault Definitions Fault Failure An erroneous hardware or software state resulting from component failure, operator error, physical interference from the environment, design error, program error, or data structure error. A defect in a hardware device or component An incorrect step, process, or data definition in a computer program Failure When the system failed to meet its promises. Caused by faults.

Typical Failure Types Crash failure Omission failure Timing failure When the system halts, but it behaves correctly before halting Omission failure fails to respond Timing failure correct output, but the time taken to respond has exceeded the specification. Response failure Wrong output Arbitrary/Byzantine failure: Arbitrary/Malicious output The severity raises from top to bottom in the list.

Fault Categories Temporary: A fault that is not present all the time for all operating conditions Transient: A fault that occurs only once. Intermittent: A fault that occurs at multiple, unpredictable times. Permanent A fault that, after it occurs, is always present.

Fault Detection Techniques Fail-stop Detects the crashing Fail-silent Detects when the system remain silent.(after crashing) Fail-safe Detects wrong outputs Byzantine failure Hard to detect

Fault Tolerant Techniques Redundancy Hide the effects of the faults. Recovery Bring the system to a fault-free state to remove the effect of faults.

Redundancy Physical Redundancy Temporal Redundancy Involves the use of multiple components that either perform the same function simultaneously or are configured so that one component is available as a backup in case of the failure of another component Ex: extra CPUs, multiple parallel circuitry, multi-versions software, backup name server. Temporal Redundancy repeating a function or operation when an error is detected. EX: re-execution, execute backup copy, retransmission Information Redundancy replicating or coding data in such a way that bit errors can be both detected and corrected. Ex: Parity, Hamming codes.

Triple Modular Redundancy(TMR) If A2 fails a V1: majority vote a all B get good result What if V1 fails?

Redundancy level How man faults can be tolerated in the system? k-fault tolerant system : handles k number of faults. TMR: 1-fault tolerant For silent faults: (k + 1) components required. For Byzantine faults: (2k + 1) components required.

Recovery Forward Recovery Backward Recovery Move the system to a new state from which system continue operating Ex: Error-corrections Backward Recovery Bring the system back into a previous fault-free state Ex: Checkpoints, Message Logging, Unix Targon/32 System

Checkpoints Periodically store system states on stable storage when system is operating suffering the effects from faults. At recovery, bring the system back to the last state stored in checkpoints. Problem: inconsistent cut.

Independent Checkpoints Each processes periodically checkpoints independently Fix the problem of inconsistency.

Message logging Checkpointing is expensive Message logging: Periodically saving states. Restart from the last consistent state. Message logging: Take infrequent checkpoints Log all messages between checkpoints to local stable storage At recovery: simply relay messages from previous checkpoint. Avoid re-computations Problem: inconsistency

Summary Fault-tolerance is the ability of a system or component to continue normal operation despite the presence of hardware or software faults Reliability Mean Time To Failure(MTTF) Mean Time To Repair(MTTR) Availability Techniques Redundancy Triple Modular Redundancy(TMR) Recovery Checkpoints Independent checkpoints Message logging