Fault Tolerance & Reliability CDA 5140 Spring 2006

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Fault-Tolerant Systems Design Part 1.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
5th Conference on Intelligent Systems
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
1 Chapter Fault Tolerant Design of Digital Systems.
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
Last Class: Weak Consistency
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Design of SCS Architecture, Control and Fault Handling.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
March 13, 2001CSci Clark University1 CSci 250 Software Design & Development Lecture #15 Tuesday, March 13, 2001.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
Safety-Critical Systems T Ilkka Herttua. Safety Context Diagram HUMANPROCESS SYSTEM - Hardware - Software - Operating Rules.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Adaptive control and process systems. Design and methods and control strategies 1.
Reliability & Maintainability Engineering An Introduction Robert Brown Electrical & Computer Engineering Worcester Polytechnic Institute.
CprE 545Iowa State University CprE 558: Real-Time Systems Lectures 15-16: Dependability Concepts & Faul-Tolerance.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
CprE 458/558: Real-Time Systems
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault-Tolerant Systems Design Part 1.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
SENG521 (Fall SENG 521 Software Reliability & Testing Overview of Software Reliability Engineering Department of Electrical.
1 INTRUSION TOLERANT SYSTEMS WORKSHOP Phoenix, AZ 4 August 1999 Jaynarayan H. Lala ITS Program Manager.
Introduction to Fault Tolerance By Sahithi Podila.
©Ian Sommerville 2000Dependability Slide 1 Chapter 16 Dependability.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
CS 61C: Great Ideas in Computer Architecture Dependability - ECC Nicholas Weaver & Vladimir Stojanovic 1.
CS203 – Advanced Computer Architecture Dependability & Reliability.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Software Defects Cmpe 550 Fall 2005
Software Metrics and Reliability
Hardware & Software Reliability
Faults and fault-tolerance
Large Distributed Systems
Chapter 2: Reliability and Fault Tolerance
Chapter 18 Software Testing Strategies
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
COP 5611 Operating Systems Fall 2011
Reliability and Fault Tolerance
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
COP 5611 Operating Systems Spring 2010
COP 5611 Operating Systems Spring 2010
Progression of Test Categories
Critical Systems Development
Mark Zbikowski and Gary Kimura
Overview Dependability: "[..] the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers [..]"
Seminar on Enterprise Software
Presentation transcript:

Fault Tolerance & Reliability CDA 5140 Spring 2006 Chapter 1 Overview & Definitions

Topics basic concepts of Fault Tolerance (FT) reliability & availability of systems, both hardware & software tools to compare & contrast FT designs

What is FT? Computing in presence of errors Some techniques from analog systems of 1940’s - 1960’s Digital technology adds to these to be faster, better & cheaper Investigate architecture keeping in mind tradeoff of cost, weight & volume Becoming more important as digital systems become more & more prevalent

Why Have FT? Needed more in 21st century since: Harsher environments Many novice users Increasing repair costs Larger systems Digital systems more prevalent More users dependent on digital systems from business to government to home to school

How is FT Obtained? Add redundancy in form of: Hardware, e.g. RAID Software, e.g. 2 algorithms for same task Information, e.g. coding theory Time, e.g. on Internet if fault, then new route

Definitions & Terminology Failure - departure from correct operation Fault - flaw in hardware or software resulting in failure, e.g. physical problems, design flaws, defects in hardware; design or implementation for software Error - incorrect response from module leading to system failure if no FT Type - hardware or software Cause - improper design, hardware failure, external disturbance

Definitions continued Permanent Fault - always present, needs repair to remove Intermittent fault - not always present but still needs repair to remove Transient fault - will disappear without repair Fault latency - fault can go undetected & does not cause error Fault-avoidance - use of high quality components & careful design to avoid faults Fault-tolerance - use of redundancy (hardware, software, information or time) to correct system operation after fault occurs

Definitions continued Graceful degradation - system still performs but with degraded but correct performance after faults Fail-safe - system can fail but only to safe state to avoid catastrophes Reliability - probability of not failing within time t given operating correctly at time 0 Availability - probability system operating correctly at time t Maintainability - probability that system can be restored to operation by time t given not operational at time 0

Definitions continued Mean-time-to-failure (MTTF) - expected value of system failure time Mean-time-to-repair (MTTR) - expected value of system repair time Mean-time-between-failure (MTBF) - expected value between successive system failure, MTTF + MTTR Fault detection - method used to detect presence of fault Fault confinement - technique to confine damage of fault to as small an area as possible

Definitions continued Fault diagnosis - automatic identification of faulty modules Recovery - system put into operating state, possibly degraded Hardware redundancy - extra hardware to detect, mask or diagnose faults Passive hardware redundancy - fault masking to hide faults & prevent faults from resulting in errors; no action by system

Definitions continued Information redundancy - use of coding theory techniques (addition of bits) Software redundancy - use of diagnostic software or extra modules, each with distinct algorithm Temporal redundancy - repeating bus cycles or whole programs, new route on Internet

Microelectronic Growth Density of chips dramatically increased & concomitantly, use of digital systems Obvious need for FT in space shuttle, nuclear power plants, but with increased use in homes, more faults likely so will need FT there too Interesting observations: 1999 typical home had 40-60 microprocessors 2004 expected to be 280

Reliability & Availability Goal: high reliability & availability based on sound analysis & not conjecture! Use both reliability & availability as measures

Air Traffic Control Example ATC fails once/year, so MTTF = 8766 hours Airline Reservation System (ARS) down 5 times/year, so MTTF=1753 hours Availability (A) = uptime/(uptime + downtime) ATC down 1 hour, so A = 8765/(8765 + 1) = 0.999886 ARS down for 1 minute, 5 times, or 0.083333 hours A = 8765.91666/(87666) = 0.999905

Air Traffic Control Example cont’d Unavailability U = 1-A So, comparing the two systems for U: (1-0.999886)/(1-0.999905) = 12 The ARS is 12 times better than the ATC in terms of availability. Homework 1: 1.13, 1.14, 1.17 (3 examples)