Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Similar presentations


Presentation on theme: "ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1."— Presentation transcript:

1 ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

2 ECE 753 Fault Tolerant Computing 2 Overview Motivation About the Course and the Instructor –Conduct, Outline, CoursepackConduct, Outline, Coursepack Introduction Terminology and definitions –Sources, Overview and CommentsSources, Overview and Comments –System definedSystem defined Dependability/Security and their attributes Threat to dependability and modeling FEF chain Means to attain dependability Fundamental Principles

3 ECE 753 Fault Tolerant Computing 3 Motivation Informal Definition Key Attributes Who, What and Why Study Examples

4 ECE 753 Fault Tolerant Computing 4 Motivation What is Fault-Tolerance? A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failures in some components that constitute the system.A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failures in some components that constitute the system.

5 ECE 753 Fault Tolerant Computing 5 Motivation (contd.) Key attributes Fault - Error - Failure Performance - Availability - Reliability More recently concept of “survivability” More recently concept of “survivability” Inclusions of these constraints at design stage is likely to be more cost effective.Inclusions of these constraints at design stage is likely to be more cost effective.

6 ECE 753 Fault Tolerant Computing 6 Motivation (contd.) Who is concerned about fault-tolerance? –System Users – irrespective of the application but some are a lot more concerned than othersSystem Users – irrespective of the application but some are a lot more concerned than others Who is concerned at design stages? –UniversitiesUniversities R, d, and a (Research, development, applications) –IndustryIndustry r, D, and A (research, Development, Applications) Issues –Design, Analysis/Validation, Implementation, Testing/Validation, EvaluationDesign, Analysis/Validation, Implementation, Testing/Validation, Evaluation

7 ECE 753 Fault Tolerant Computing 7 Motivation (contd.) Examples General Purpose Systems –PCs: RAMs with parity checks and possibly ECCPCs: RAMs with parity checks and possibly ECC (consideration of re-execution on failure detection is being investigated) –Workstations/Servers: error detection (HW), occasional corrective action (SW), Even ECC (HW), keeping log (SW)Workstations/Servers: error detection (HW), occasional corrective action (SW), Even ECC (HW), keeping log (SW)

8 ECE 753 Fault Tolerant Computing 8 Motivation (contd.) Examples Reliable Systems –Telephone systemsTelephone systems –Banking systems e.g. ATMBanking systems e.g. ATM –Stock marketStock market –CAE - exams/projectsCAE - exams/projects –Football games - display/ticketingFootball games - display/ticketing

9 ECE 753 Fault Tolerant Computing 9 Motivation (contd.) Examples Critical and Life Critical Systems –Manned and unmanned space borne systemsManned and unmanned space borne systems –Aircraft control systemsAircraft control systems –Nuclear reactor control systemsNuclear reactor control systems –Life support systemsLife support systems

10 ECE 753 Fault Tolerant Computing 10 Motivation (contd.) Examples Reliable -> Critical Systems –911 telephone switching system911 telephone switching system –Traffic light control systemTraffic light control system –Automotive control systems (ABS, Fuel injection system)Automotive control systems (ABS, Fuel injection system)

11 ECE 753 Fault Tolerant Computing 11 About the Course and the Instructor Conduct –homeworks, exam, project, gradinghomeworks, exam, project, grading Outline Coursepack –references and reading listreferences and reading list

12 ECE 753 Fault Tolerant Computing 12 Introduction –Historical perspective and major pushHistorical perspective and major push –New initiativesNew initiatives –Goals of fault-toleranceGoals of fault-tolerance –Applications of fault-toleranceApplications of fault-tolerance

13 ECE 753 Fault Tolerant Computing 13 Introduction (contd.) Historical Perspective –not a new conceptnot a new concept –first use by J. van Neumann 1956first use by J. van Neumann 1956 probabilistic logic and synthesis of reliable organism from unreliable components, Annals of mathematical studies, Princeton University Pressprobabilistic logic and synthesis of reliable organism from unreliable components, Annals of mathematical studies, Princeton University Press Major push –Space programSpace program –HW Fault tolerance - thenHW Fault tolerance - then –SW Fault tolerance laterSW Fault tolerance later –Merge the twoMerge the two

14 ECE 753 Fault Tolerant Computing 14 Introduction (contd.) New initiatives Density of devices more failures likely Power issue – schedular, on-chip sensors Failures due to soft-errors, life time degradations - hardening, re-exection, - on-chip ECC - erconfiguration - microarchitectural solutions - architectural solutions

15 ECE 753 Fault Tolerant Computing 15 Introduction (contd.) New initiatives (contd.) Deep submicron technology and time to market pressure designs not fully verified Implementation of numerous functionalities on chip/board/system possibility of system hang-up Speculative execution results may need to be re- checked Low cost of HW and SW affordable/ecnomical Hot issues: Soft errors, Life-time failures, Power and Thermal ManagementHot issues: Soft errors, Life-time failures, Power and Thermal Management

16 ECE 753 Fault Tolerant Computing 16 Introduction (contd.) Goals - different goals for different applicationsGoals - different goals for different applications The key word is “reliability” – has different meaning for different users and applications Intuitive explanations –DependabilityDependability –ServiceService –SpecificationSpecification

17 ECE 753 Fault Tolerant Computing 17 Introduction (contd.) Intuitive concepts –Reliability – continues to workReliability – continues to work –Availability – works when I need itAvailability – works when I need it –Safety – does not put me in jeopardySafety – does not put me in jeopardy –PerformabilityPerformability –MaintainabilityMaintainability –TestabilityTestability –Survivability – will the system survive catastrophic events?Survivability – will the system survive catastrophic events? –SecuritySecurity

18 ECE 753 Fault Tolerant Computing 18 Introduction (contd.) Applications –Space borne systemSpace borne system long life system –Airplane control systemAirplane control system critical system –Transaction processing systemTransaction processing system high availability system –Switching systemSwitching system high availability over certain level of performance

19 ECE 753 Fault Tolerant Computing 19 Terminology and definitions Reliability and concept of probability –R(t): conditional probability that a system provides continuous proper service in the interval [0,t] given that it provided desired service at time 0.R(t): conditional probability that a system provides continuous proper service in the interval [0,t] given that it provided desired service at time 0. Availability Performabiltiy –An ExampleAn Example Dependability Security

20 Sources, Overview and Comments (1/4) Key reference: Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1, Jan-Mar 2004.Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1, Jan-Mar 2004. Other references: Israel Koren and C. Mani Krishna, Fault Tolerant Systems, Elsevier, 2007. D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice- Hall, 1996.D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice- Hall, 1996. B. W. Johnson, Design and analysis of fault tolerant digital systems, Addison-Wesley, First edition, 1989.B. W. Johnson, Design and analysis of fault tolerant digital systems, Addison-Wesley, First edition, 1989. My course (Fault-Tolerant Computing) URL: http://homepages.cae.wisc.edu/~ece753/INFO.htmlMy course (Fault-Tolerant Computing) URL: http://homepages.cae.wisc.edu/~ece753/INFO.html ECE 753 Fault Tolerant Computing

21 Sources, Overview and Comments (2/4) What does the paper cover? –Very basic definitions of the terminologies used in dependable computing –It categorizes definitions in three groups System, attributes of dependability, threats to dependability –Covers very briefly methods to attain dependability ECE 753 Fault Tolerant Computing

22 Sources, Overview and Comments (3/4) How to read the paper? –It is easy to read – scan it first and then read it –I have organized the material differently – you may find it helpful What is not covered? –One attribute almost missing - survivability –Basic methods of Fault Tolerance and their characterization ECE 753 Fault Tolerant Computing

23 Sources, Overview and Comments (4/4) Chronology of Developments –Need for fault-tolerance - inception of the space program (recall “Voyager” launched in 1977 is still sending signals) –First standard glossary in 1985 –Integration of performance etc into fault tolerance – and hence the term “Dependability” – book published in 1992 –Recognition of “Security” as a basic attribute of dependability – this paper in 2004 ECE 753 Fault Tolerant Computing

24 System Defined (1/4) “... an entity that interacts with other entities” –First entity (system) – limited to be “electronic (mostly digital)” or “computer based” –Second entity Hardware, software, human, other systems,.. (can also be called “environment”) Characterization and fundamental properties –Functionality –Performance –Dependability and security –Cost (usuability, managability, adaptabilty : not directly included in the paper) ECE 753 Fault Tolerant Computing

25 System Defined (2/4) Function – “ what the system is intended to do” –functional specifications: describe it in terms of functionality and performance –behavior – described as a sequence of states to implement the functionality –Total states – set of states as system evolves Internal states External states – as viewed by the environment and users Structure – “What enables system behavior (function)” –Interconnected components – recursively defined to “atomic” level ECE 753 Fault Tolerant Computing

26 System Defined (3/4) System Life Cycle –Development phase –Use phase Service – what is delivered by the system to its “environment” (user) –Environment sees only the “external states” –Development Phase – activities from concept to decision that system is ready for “use phase” –Use Phase - More meaningful and includes service delivery, service outage, service shutdown, maintenance ECE 753 Fault Tolerant Computing

27 System Defined (4/4) Development phase environment –Physical world –Human developers –Development tools –Production and test facilities User phase environment –Physical world –Administrators – maintainers –Users and intruders –Providers and infrastructure ECE 753 Fault Tolerant Computing

28 Dependability/Security Attributes (1/6) Original definition: “ability to deliver service that can justifiably be trusted” Encompassing the following attributes –Availability –Reliability –Safety –Integrity –maintainability ECE 753 Fault Tolerant Computing

29 Dependability/Security Attributes (2/6) New definition: “ability to avoid service failures that are more frequent or more severe than is acceptable” - deliver service that can justifiably be trusted Reason for modification –Security related issues –This recognizes that a system can fail and it usually does fail and it still can be called dependable –This definition also enables a connection with “development failures” ECE 753 Fault Tolerant Computing

30 Dependability/Security Attributes (3/6) Dependability availability: readiness for correct service. reliability: continuity of correct service. safety: absence of catastrophic consequences on the user(s) and the environment. integrity: absence of improper system alterations. maintainability: ability to undergo modification and repairs When addressing security, an additional attribute confidentiality: the absence of unauthorized disclosure ECE 753 Fault Tolerant Computing

31 Security is concurrent existence of composite of the attributes 1) availability (for authorized actions only), 2) confidentiality, and 3) integrity (with “improper” meaning “unauthorized”) Dependability/Security Attributes (4/6) ECE 753 Fault Tolerant Computing

32 F Dependability/Security Attributes (5/6) ECE 753 Fault Tolerant Computing

33 Other related concepts – summarized in table (Fig 15) - these are –Dependability –High confidence –Survivability –Trustworthiness Example: all these have similar goals such as 1): ability to deliver service, 2): predictable service, 3): fulfill mission, 4): assurance of expected service delivery Dependability/Security Attributes (6/6) ECE 753 Fault Tolerant Computing

34 Threats and modeling threats (1/12) Different phases are open to different types of threats – generally termed as “faults” Faults lead to “errors” – a total state of the system different from the “true total state” Errors can lead to “failure” – the service deviates from the desired service This creates a FEF chain – a hierarchical phenomenon (see next and more later) ECE 753 Fault Tolerant Computing

35 faul t error failu re Fault activation – Error manifestation – Failure Threats and modeling threats (2/12) Fault – active or dormant Error – masked or latent Failure – incorrect response

36 Threats and modeling threats (3/12) FEF Chain in an hierarchy

37 Threats and modeling threats (4/12) Fault classes Groups (not exclusive) –Development, Physical – (that affect hardware - I disagree with this definition ), Interaction Viewpoints: –phase, system boundary, cause, dimension, objective, intent, capability, persistence

38 Threats and modeling threats (5/12) Fault Taxonomy and Examples Production defect: physical, hardware, natural Bug: physical, software, natural Omission (absence of an action): Humam made, system generated Melicious (meant to cause harm): Human made, Hardware or software Notes: 1. Paper has a classification – Fig 4 and 5 2. Examples and definition of many other faults given. Some listed on next slide

39 Threats and modeling threats (6/12) Fault Taxonomy (contd.) Permanent faults Intermittent faults – repeat at some interval Transient faults – no specific interval Malicious logic faults – caused be natural faults Intrusion attempts – caused by humans Interaction faults – may be development phase or use phase Configuration faults – incorrect setting of parameters

40 Threats and modeling threats (7/12) Errors classes Detected Latent An example –An adder gives incorrect sum for certain operands –Fault is active when those operands appear, otherwise it is dormant –Incorrect sum is latent unless used or checked for correctness

41 Threats and modeling threats (8/12) Failure classes Development failures Service failures Security failures

42 Threats and modeling threats (9/12) Development failures – introduced during the development phase –Human developers –Tools –Production facility –Budgetary reasons –Scheduling issue (time to market) (basically the system delivered is a downgraded system)

43 Threats and modeling threats (10/12) Service failures - delivery of incorrect service – Four viewpoints 1. Failure domain –Content failure –Timing failure – early or late delivery of the service(s) Special case: silent failure, halt failure, crash failure Erratic failure (like Byzantine failure)

44 Threats and modeling threats (11/12) 2. Failure detectability –Signal provided by some checking mechanism Signaled failure Unsignaled failure False alarm 3. Consistency –Consistent failure – all services see the same data –Inconsistent – different services see different data (like Byzantine failure)

45 Threats and modeling threats (12/12) 4. Consequence of failure –Need to rate the failure and hence develop criteria – examples: Outage of duration (availability related) Lives being endangered (safely related) Extent of corrupted service (integrity related) Amount of information disclosed (confidentiality related)

46 Means to attain dependability (1/6) Fault Prevention or Fault Avoidance Improvement of development process Elimination of causes that can induce faults Fault Tolerance Techniques and implementations (more later)

47 Means to attain dependability (2/6) Fault Removal Remove faults during development phase – extensive simulation and validation Testing Deterministic testing Random and statistical testing Back to back testing Test/validation quality: fault injection, design for test/verification

48 Means to attain dependability (3/6) Fault Forecasting – evaluate the system behavior and then use one or more methods previously discussed to improve dependability Qualitative evaluation Quantitative evaluation Use benchmarks Use of simulators Examples: 1) Error and failure logs 2) when and where commissioned

49 Means to attain dependability (4/6) Fault Tolerance Techniques Error detection - need redundancy Duplicate execution Use of parity Checker programs and/or hardware More later

50 Means to attain dependability (5/6) Recovery - Key is redundancy Error handling Masking and compensation Rollback Rollforward Fault handling Diagnosis Isolation Reconfiguration Initialization

51 Means to attain dependability (6/6) Key to fault tolerance Break FEF chain Use “redundancy” to improve “use phase” dependability and security See next “fundamental principles”

52 ECE 753 Fault Tolerant Computing 52 Fundamental Principles Hardware redundancy Low level High level Software Redundancy Time Redundancy Information Redundancy

53 ECE 753 Fault Tolerant Computing 53 Fundamental Principles (contd.) Hardware Redundancy - Low level –logic levellogic level Example 1 - Self checking circuits Example 2 - Arithmetic code A modular adder using the mathematical principle (A+B) mod k = ((A mod k) + (B mod k)) mod k Hardware Redundancy - High level –Triplicate or 5-copies as in space shuttleTriplicate or 5-copies as in space shuttle

54 ECE 753 Fault Tolerant Computing 54 Fundamental Principles (contd.) Software Redundancy –Use two different programs/algorithmsUse two different programs/algorithms Time Redundancy –Re-compute or redo the task and compare the resultsRe-compute or redo the task and compare the results –May or may not use the same hardware/softwareMay or may not use the same hardware/software Information Redundancy –backup informationbackup information –Use of ECCUse of ECC Question - What kind of FT is achieved?

55 ECE 753 Fault Tolerant Computing 55 Fault-Error-Failure Intuitive definitions Origins of faults Methods to break FEF chain Attribute of faults

56 ECE 753 Fault Tolerant Computing 56 Fault-Error-Failure concept (contd.) Intuitive definitions Fault - –An anomalous physical condition caused by a manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, …An anomalous physical condition caused by a manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, … –CausesCauses Error - Effect of activation of a fault Failure - over-all system effect of an error Fault -> Error -> Failure

57 ECE 753 Fault Tolerant Computing 57 Fault-Error-Failure concept (contd.) Origins of faults Physical device level (HW) Logic level (HW) Chip level (HW) System level (HW/SW) –interfacing, specifications, …interfacing, specifications, … Why systems fail

58 ECE 753 Fault Tolerant Computing 58 Fault-Error-Failure concept (contd.) Methods to break FEF chain Flow FEF Barriers –Fault avoidanceFault avoidance –Fault maskingFault masking –Fault removalFault removal –Fault forecastingFault forecasting

59

60 ECE 753 Fault Tolerant Computing 60 Fault-Error-Failure concept (contd.) Attribute of faults Cause Nature Duration Extent Value


Download ppt "ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1."

Similar presentations


Ads by Google