3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
ICAP CONTROLLER FOR HIGH-RELIABLE INTERNAL SCRUBBING Quinn Martin Steven Fingulin.
11. Practical fault-tolerant system design Reliable System Design 2005 by: Amir M. Rahmani.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
5th Conference on Intelligent Systems
9. Fault Modeling Reliable System Design 2011 by: Amir M. Rahmani.
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
Introduction to Fault- Tolerance Amos Wang Credit from: Dr. Axel Krings, Dr. Behrooz Parhami, Prof. Jalal Y. Kawash, Kewal K.Saluja, and Paul Krzyzanowski.
Making Services Fault Tolerant
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.
1 Chapter Fault Tolerant Design of Digital Systems.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Design of SCS Architecture, Control and Fault Handling.
TCP: Software for Reliable Communication. Spring 2002Computer Networks Applications Internet: a Collection of Disparate Networks Different goals: Speed,
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Module 9 Review Questions 1. The ability for a system to continue when a hardware failure occurs is A. Failure tolerance B. Hardware tolerance C. Fault.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Critical systems development. Objectives l To explain how fault tolerance and fault avoidance contribute to the development of dependable systems l To.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CSI-09 COMMUNICATION TECHNOLOGY FAULT TOLERANCE AUTHOR: V.V. SUBRAHMANYAM.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Simple Concepts in Fault-Tolerance.
Redundancy. Definitions Simplex –Single Unit TMR or NMR –Three or n units with a voter TMR/Simplex –After the first failure, a good unit is switched out.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
ECE 753: FAULT-TOLERANT COMPUTING
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Critical systems development
RAID RAID Mukesh N Tekwani
Sequential circuits and Digital System Reliability
Mattan Erez The University of Texas at Austin July 2015
RAID RAID Mukesh N Tekwani April 23, 2019
Seminar on Enterprise Software
Presentation transcript:

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir Forms of Redundancy Hardware redundancy – add extra hardware for detection or tolerating faults Software redundancy – add extra software for detection and possibly tolerating faults Information redundancy – extra information, i.e. codes Time redundancy – extra time for performing tasks for fault tolerance

matlab1.ir Types of Hardware Redundancy Fault Tolerance requires Redundancy 1- Static Redundancy (that is Passive) uses fault masking to hide occurrence of fault does not require reconfiguration Example: TMR, Voting 2- Dynamic Redundancy (that is Active) uses comparison for detection and/or diagnoses requires reconfiguration remove faulty hardware from system Example: Stand-by system 3- Hybrid Redundancy combination of static & dynamic redundancy

matlab1.ir 1- Static Redundancy A class of redundancy techniques that can tolerate faults without reconfiguration (failover). Static redundancy can be divided into two major subclasses: Masking redundancy Active redundancy

matlab1.ir Masking Redundancy Uses majority voting to mask faults Requires 2f +1 modules to tolerate f faulty modules N-Modular Redundant system (NMR) N independent modules replicate the same function – parallelism – results are voted on – requirements: N >= 3 TMR (Triple Modular Redundancy)

matlab1.ir Triple Modular Redundancy (TMR) e.g. Majority voting. 1-bit majority voter (3 AND gates ORed)

matlab1.ir Triple Modular Redundancy (TMR)

matlab1.ir Masking Redundancy TMR with triple voting

matlab1.ir Masking Redundancy Multi-stage TMR

matlab1.ir N-Modular Redundant system (NMR)

matlab1.ir Active Redundancy Two or more units are active and produce replicated results simultaneously Relies on fail-stop units Fail-stop property: a unit produces correct results or no results at all Requires f +1 modules to tolerate f faulty modules

matlab1.ir Fail-stop Nodes Node 1 and 2 send their results individually to node 3 and 4 All nodes are fail-stop: They send correct results or no results at all

matlab1.ir 2- Dynamic Redundancy Relies on error detection and reconfiguration Requires f +1 modules to tolerate f faulty modules May require recovery of system or application state May require outage time

matlab1.ir Example: Duplicate and Compare – can only detect, but NOT diagnose i.e. fault detection, no fault-tolerance – may order shutdown – comparator is single point of failure simple implementation: 2 input XOR for single bit compare

matlab1.ir Example: Stand-by System E.g. communications checksums and memory parity bits – only one module is driving outputs – other modules are: idle => hot spares shut down => cold spares – error detection => switch to a new module (hot or cold spares)

matlab1.ir Types of Stand-by Systems Hot standby Warm standby Cold standby

matlab1.ir Hot Stand-by Characteristics Spare updated simultaneously with primary module + Advantages + Very short or no outage time + Does not require recovery of application - Drawbacks - High failure rate (fault rate) - High power consumption

matlab1.ir Warm Stand-by Characteristics Spare up and running Needs to recover application status + Advantages + Does not require simultaneous up-dating of spare and primary module - Drawbacks - Requires recovery of application state - High fault rate - High power consumption

matlab1.ir Cold Stand-by Characteristics Spare powered-down + Advantages + Low failure rate (fault rate) + Low power consumption Satellite application - Drawbacks - Very long outage time - Needs to boot kernel/operating system and recover application status.

matlab1.ir 3- Hybrid Redundancy N-Modular Redundancy with spares – N active + S spare modules (off-line) – Voting and comparison – Replaces erroneous module from spare pool

matlab1.ir N-Modular Redundancy with spares

matlab1.ir Coding checks / Exception checks Coding checks Error detection codes are formed by the addition of check bits to a data word. A cyclic redundancy code check was used in the disk store of ESS. A parity bit was used in the RAM Exception checks Hardware constraints: Usually result from the inability of the hardware to provide the better service needed by the software. Examples Improper address alignment Unequipped memory locations Unused op-code Stack overflow

matlab1.ir Watchdog Timers So far, we’ve figured out how to detect when something is wrong … but how do we detect when we’re not doing anything at all? Watchdog timer monitors a module and triggers a recovery if the module doesn’t do anything in a given amount of time – E.g., put a watchdog timer on a microprocessor bus Who watches the watchdog? – If we assume single fault scenario, then this usually isn’t a problem – But what if watchdog has hard fault that causes it to never timeout and trigger a recovery?