Oct. 2007Combinational ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.

Slides:



Advertisements
Similar presentations
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Advertisements

1 Fault-Tolerant Computing Systems #6 Network Reliability Pattara Leelaprute Computer Engineering Department Kasetsart University
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Oct. 2007Fault MaskingSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
SECURITY AND DEPENDABILITY OF COMPUTER SYSTEMS
6. Reliability Modeling Reliable System Design 2010 by: Amir M. Rahmani.
Reliability Block Diagrams A reliability block diagram is a success-oriented network describing the function of the system. If the system has more than.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Oct. 2007State-Space ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Reliable System Design 2011 by: Amir M. Rahmani
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Reliability of Systems
Oct State-Space Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov Malfunction Diagnosis and Tolerance Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct. 2007Defect Avoidance and CircumventionSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Oct Error Correction Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct Combinational Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov. 2006Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
Oct Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
1 Chapter Fault Tolerant Design of Digital Systems.
Dependability Evaluation. Techniques for Dependability Evaluation The dependability evaluation of a system can be carried out either:  experimentally.
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Oct Fault Testing Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 3 – Combinational Logic Design Part 1 –
Oct Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 3 – Combinational Logic Design Part 1 –
What is Fault Tree Analysis?
System Reliability. Random State Variables System Reliability/Availability.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Dept. of Electrical & Computer engineering
ERT 312 SAFETY & LOSS PREVENTION IN BIOPROCESS RISK ASSESSMENT Prepared by: Miss Hairul Nazirah Abdul Halim.
A. BobbioBertinoro, March 10-14, Dependability Theory and Methods 2. Reliability Block Diagrams Andrea Bobbio Dipartimento di Informatica Università.
CS1Q Computer Systems Lecture 8
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
FAULT TREE ANALYSIS (FTA). QUANTITATIVE RISK ANALYSIS Some of the commonly used quantitative risk assessment methods are; 1.Fault tree analysis (FTA)
L10 – State Machine Design Topics. States Machine Design  Other topics on state machine design Equivalent sequential machines Incompletely specified.
Fault-Tolerant Systems Design Part 1.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 3 – Combinational Logic Design Part 1 –
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CprE 458/558: Real-Time Systems
1 Component reliability Jørn Vatn. 2 The state of a component is either “up” or “down” T 1, T 2 and T 3 are ”Uptimes” D 1 and D 2 are “Downtimes”
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
1 3. System reliability Objectives Learn the definitions of a component and a system from a reliability perspective Be able to calculate reliability of.
Logic Design EE-2121 Manesh T. Digital Systems  Introduction  Binary Quantities and Variables  Logic Gates  Boolean Algebra  Combinational Logic.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.
Series-Parallel Circuits. Most practical circuits have both series and parallel components. Components that are connected in series will share a common.
COMBINATIONAL AND SEQUENTIAL CIRCUITS Guided By: Prof. P. B. Swadas Prepared By: BIRLA VISHVAKARMA MAHAVDYALAYA.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Week#3 Software Quality Engineering.
Fault Tree Analysis Using Binary Decision Diagrams
ECE 313 Probability with Engineering Applications Lecture 7
Sequential circuits and Digital System Reliability
Information Redundancy Fault Tolerant Computing
RAID Redundant Array of Inexpensive (Independent) Disks
CUT SET TRANSFORMATION
Seminar on Enterprise Software
Presentation transcript:

Oct. 2007Combinational ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools

Oct. 2007Combinational ModelingSlide 2 About This Presentation EditionReleasedRevised FirstOct. 2006Oct This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Oct. 2007Combinational ModelingSlide 3 Combinational Modeling

Oct. 2007Combinational ModelingSlide 4 When model does not match reality.

Oct. 2007Combinational ModelingSlide 5 A Motivating Example Five-site distributed computer system Data files to be stored on the sites so that they remain available despite site and link malfunctions S = Site availability (a S in handout) L = Link availability (a L in handout) Some possible strategies:  Duplication on home site and mirror site  Triplication on home site and 2 backups  Data dispersion through coding Here, we ignore the important problem of keeping the replicas consistent and do not worry about malfunction detection and attendant recovery actions

Oct. 2007Combinational ModelingSlide 6 Data Availability with Home and Mirror Sites A = SL + (1 – SL)SL = 2SL – (SL) 2 Requester R D D Home Mirror Assume data file must be obtained directly from a site that holds it For example, S = 0.99, L = 0.95, A = With no redundancy, A = 0.99  0.95 = Combinational modeling: Consider all combinations of circumstances that lead to availability/success (unavailability/failure) R D D R D D R D D R D 1SL 1 – L (1 – S)L Analysis by considering mutually exclusive subcases

Oct. 2007Combinational ModelingSlide 7 Data Availability with Triplication A = SL + (1 – SL)SL + (1 – SL) 2 SL = 3SL – 3(SL) 2 + (SL) 3 Requester R D DD Home Backup 1 For example, S = 0.99, L = 0.95, A = With duplication, A = With no redundancy, A = R D DD R D D R D DD R DD 1 SL 1 – L (1 – S)L Backup 2 R D DD R D DD R D D SL 1 – L (1 – S)L 1SL Can merge these two cases A = SL + (1 – SL) [SL + (1 – SL)SL]

Oct. 2007Combinational ModelingSlide 8 Data Availability with File Dispersion File accessible if 2 out of 4 sites accessible A = (SL) 4 + 4(1 – SL)(SL) 3 + 6(1 – SL) 2 (SL) 2 = 6(SL) 2 – 8(SL) 3 + 3(SL) 4 Requester R d dd d Piece 2 Encode an l-bit file into  5l/3 bits (67% redund.) Break encoded file into 5 pieces of length l/3 Store each piece on one of the 5 sites Any 3 of the 5 pieces can be used to reconstruct the original file Piece 1 Piece 4Piece 3 Piece 5 For example, S = 0.99, L = 0.95, A = , Redundancy = 67% With duplication, A = , Redundancy = 100% With triplication, A = , Redundancy = 200% With no redundancy, A =

Oct. 2007Combinational ModelingSlide 9 Series System A system composed of n units all of which must be healthy for the system to function properly R =  R i Example: Redundant system of valves in series with regard to stuck-on-shut malfunctions (tolerates stuck-on-open valves) Example: Redundant system of valves in parallel with regard to to stuck-on-open malfunctions (tolerates stuck-on-shut valves)

Oct. 2007Combinational ModelingSlide 10 Series System: Implications to Design Assume exponential reliability law R i = exp[– i t ] R =  R i = exp[– (  i ) t ] Given the reliability goal r, find the required value of  i Assign a failure rate “budget” to each unit and proceed with its design May have to reallocate budgets if design proves impossible or costly

Oct. 2007Combinational ModelingSlide 11 Parallel System A system composed of n units, the health of one of which is enough for the system to function properly 1 – R =  (1 – R i )  R = 1 –  (1 – R i ) That is, the system fails only if all units malfunction Example: Redundant system of valves in series with regard to stuck-on-open malfunctions (tolerates stuck-on-open valves) Example: Redundant system of valves in parallel with regard to stuck-on-shut malfunctions (tolerates stuck-on-shut valves)

Oct. 2007Combinational ModelingSlide 12 Parallel System: Implications to Design Assume exponential reliability law R i = exp[– i t ] 1 – R =  (1 – R i ) Given the reliability goal r, find the required value of 1 – r =  (1 – R i ) Assign a failure probability “budget” to each unit For example, with identical units, 1 – R m = n 1 – r Assume r = , n = 4  1 – R m = 0.1 (module reliability must be 0.9) Conversely, for r = and R m = 0.9, n = 4 is needed

Oct. 2007Combinational ModelingSlide 13 The Concept of Coverage For r = and R i = 0.9, n = 4 is needed Standby sparing: One unit works; others are also active concurrently or they may be inactive (spares) When a malfunction of the main unit is detected, it is removed from service and an alternate unit is brought on-line; our analysis thus far assumes perfect malfunction detection and reconfiguration R = 1 – (1 – R m ) n = R m 1 – (1 – R m ) n 1 – (1 – R m ) Let the probability of correct malfunction detection and successful reconfiguration be c (coverage factor, c < 1) R = R m See [Siew92], p – c n (1 – R m ) n 1 – c(1 – R m )

Oct. 2007Combinational ModelingSlide 14 Impact of Coverage on System Reliability Assume R m = 0.95 Plot R as a function of n for c = 0.9, 0.95, 0.99, 0.999, , 1 c: prob. of correct malfunction detection and successful reconfiguration R = R m 1 – c n (1 – R m ) n 1 – c(1 – R m ) Unless c is near-perfect, adding more spares has no significant effect on reliability In practice c is not a constant and may deteriorate with more spares; so too many spares may be detrimental to reliability R n c = 0.9 c = 0.95 c = 0.99 c = c = 1 c =

Oct. 2007Combinational ModelingSlide 15 Series-Parallel System The system functions properly if a string of healthy units connect one side of the diagram to the other 1 – R = (1 – R 1 R 2 ) (1 – R 3 R 4 ) Example: Parallel connection of series pairs of valves (tolerates one stuck-on-shut and one stuck-on-open valve) Example: Series connection of parallel pairs of valves (tolerates one stuck-on-shut and one stuck-on-open valve) R = [1 – (1 – R 1 ) (1 – R 3 )]  [1 – (1 – R 2 ) (1 – R 4 )]

Oct. 2007Combinational ModelingSlide 16 More Complex Series-Parallel Systems The system functions properly if a string of healthy units connect one side of the diagram to the other R = R 2  prob(system OK | Unit 2 OK ) + (1 – R 2 )  prob(system OK | Unit 2 not OK) R 2OK = [1 – (1 – R 1 ) (1 – R 6 )]  [1 – (1 – R 3 ) (1 – R 5 )]  R We can think of Unit 5 as being able to replace Units 2 and R 2  OK = [1 – (1 – R 1 R 5 ) (1 – R 3 R 6 )]  R 4 R 2  OK R 2OK

Oct. 2007Combinational ModelingSlide 17 Analysis Using Success Paths R  1 –  i  (1 –  R ith success path ) R  1 – (1 – R 1 R 5 R 4 ) [*] (1 – R 1 R 2 R 3 R 4 ) (1 – R 6 R 3 R 4 ) This yields an upper bound on reliability because it considers the paths to be independent With equal module reliabilities: R  1 – (1 – R m 3 ) 2 (1 – R m 4 ) If we expand [*] by multiplying out, removing any power for the various reliabilities, we get an exact reliability expression R = 1 – (1 – R 1 R 4 R 5 )(1 – R 3 R 4 R 6 – R 1 R 2 R 3 R 4 – R 1 R 2 R 3 R 4 R 6 ) = R 3 R 4 R 6 + R 1 R 2 R 3 R 4 + R 1 R 2 R 3 R 4 R 6 + R 1 R 4 R 5 – R 1 R 3 R 4 R 5 R 6 –R 1 R 2 R 3 R 4 R 5 – R 1 R 2 R 3 R 4 R 5 R 6 (Verify for the case of equal R j )

Oct. 2007Combinational ModelingSlide 18 k-out-of-n System There are n modules, any k of which are adequate for proper system functioning V Example: System with 2-out-of-3 voting Assume perfect voter R = R 1 R 2 R 3 + R 1 R 2 (1 – R 3 ) + R 2 R 3 (1 – R 1 ) + R 3 R 1 (1 – R 2 ) With all units having the same reliability R m and imperfect voter: R = (3R m 2 – 2R m 3 ) R v Triple-modular redundancy (TMR) R =  j = k to n ( ) R m j (1 – R m ) n–j k-out-of-n system in general njnj Assuming that any 2 malfunctions in TMR lead to failure is pessimistic With binary outputs, we can model compensating errors (when two malfunctioning modules produce 0 and 1 outputs)

Oct. 2007Combinational ModelingSlide 19 Consecutive k-out-of-n:G (k-out-of-n:F) System Units are ordered and the functioning (failure) of k consecutive units leads to proper system function (system failure) Ordering may be linear (usual case) or circular Example: System of street lights may be considered a consecutive 2-out-of-n:F system Example: The following redundant bus reconfiguration scheme is a consecutive 2-out-of-4:G system From module Redundant bus lines Common control for shift-switch settings: up, straight, or down

Oct. 2007Combinational ModelingSlide 20 n-Modular Redundancy with Replicated Voters Voters (all but the final one in a chain) no longer critical components V V V V V V V V V V Can model as a series system of 2-out-of-3 subsystems

Oct. 2007Combinational ModelingSlide 21 Fault Tree Analysis: Introduction Quick guide to fault trees: Fault tree handbook: Top-down approach to failure analysis: Start at the top (tree root) with an undesirable event called a “top event” and then determine all the possible ways that the top event can occur Analysis proceeds by determining how the top event can be caused by individual or combined lower-level undesirable events Fault-tree tutorial: Example: Top event is “being late for work” Clock radio not turning on, family emergency, bus not running on time Clock radio won’t turn on if there is a power failure and battery is dead

Oct. 2007Combinational ModelingSlide 22 Fault Tree Analysis: The Process Basic events (leaf, atomic)Composite events 1. Identify “top event” 2. Identify 1 st -level contributors to top event 3. Use logic gate to connect 1 st level to top 4. Identify 2 nd -level contributors 5. Link 2 nd level to 1 st level AND gate OR gate Other symbols XOR (not used in reliability analysis) k/nk/n k-out-of-n gate External event Enabling condition Inhibit gate 6. Repeat until done

Oct. 2007Combinational ModelingSlide 23 Fault Tree Analysis: Cut Set A cut set is any set of initiators so that the failure of all of them induces the top event Minimal cut set: A cut set for which no subset is also a cut set Minimal cut sets for this example: {a, b}, {a, d}, {b, c} Just as logic circuits can be transformed to different (simpler) ones, fault trees can be manipulated to obtain equivalent forms dab cb Path set: Any set of initiators so that if all are failure-free, the top event is inhibited (to derive path sets, exchange AND gates and OR gates and then find cut sets) What are the path sets for this example?

Oct. 2007Combinational ModelingSlide 24 Converting Fault Trees to Reliability Block Diagrams Minimal cut sets for this example: {a, b}, {a, d}, {b, c} Another example: Minimal cut set {a, b}, {a, c}, {a, d}, {c, d, e, f} dab cb Construct a fault tree for the above Derive a reliability block diagram What are the path sets for this example? Applications of cut sets: 1. Evaluation of reliability 2. Common-cause failure assessment 3. Small cut set  high vulnerability b a db c