Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution.

Similar presentations


Presentation on theme: "ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution."— Presentation transcript:

1 ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution

2 ECE 753 Fault Tolerant Computing 2 Overview Introduction Watchdog techniques –Timers, watchdog processors, error model, control flow checking, memory access and assertion checkingTimers, watchdog processors, error model, control flow checking, memory access and assertion checking Re-execution for fault-tolerance –Basic techniques: RESO concept, program re- execution, instruction re-executionBasic techniques: RESO concept, program re- execution, instruction re-execution –Case studies: Fine grain parallel architecture (CRAY), SMT architecture, multiscalar architecture. Chip MultiprocessorCase studies: Fine grain parallel architecture (CRAY), SMT architecture, multiscalar architecture. Chip Multiprocessor Summary

3 ECE 753 Fault Tolerant Computing 3 Introduction References Watchdog - [mahm:88] Re-execution - [rotenberg:99], [rashid:00] [subra:10], [kala:13]Re-execution - [rotenberg:99], [rashid:00] [subra:10], [kala:13] Sohi, Franklin, and Saluja, “A study of time- redundant fault-tolerant techniques for high- performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436- 443.Sohi, Franklin, and Saluja, “A study of time- redundant fault-tolerant techniques for high- performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436- 443.

4 ECE 753 Fault Tolerant Computing 4 Introduction (contd.) Somewhat higher level than ECC and masking at circuit levelSomewhat higher level than ECC and masking at circuit level Bordering between hardware and software (hardware often assisted by software)Bordering between hardware and software (hardware often assisted by software) These are some of the very first fault- tolerance methodsThese are some of the very first fault- tolerance methods

5 ECE 753 Fault Tolerant Computing 5 Watchdog techniques Key concept –A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc.A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc. Processor watchdog

6 ECE 753 Fault Tolerant Computing 6 Watchdog: Timers Check for aliveness –Processor resets the timer at certain intervals or on certain conditionsProcessor resets the timer at certain intervals or on certain conditions –Timer raises error flag if not reset before it overrunsTimer raises error flag if not reset before it overruns Processor timer Error

7 ECE 753 Fault Tolerant Computing 7 Watchdog: Timers (contd.) Check for timeout –Processor sends a message and starts a timer, the second processor must reply within this time (hardware/software implementation)Processor sends a message and starts a timer, the second processor must reply within this time (hardware/software implementation) Timer Processor B Processor A

8 ECE 753 Fault Tolerant Computing 8 Watchdog: Timers (contd.) Applications –Processor control systems (chemical, mechanical and other control systems)Processor control systems (chemical, mechanical and other control systems) –Switching systems – messages sent or received often await certain length of time before they are repeatedSwitching systems – messages sent or received often await certain length of time before they are repeated –Networks – email messages often have timeouts associated with themNetworks – email messages often have timeouts associated with them

9 ECE 753 Fault Tolerant Computing 9 Watchdog: Processors Architecture – can be complex but let us consider the following simple architectureArchitecture – can be complex but let us consider the following simple architecture Memory Processor data address control BUS Watchdog (observer)

10 ECE 753 Fault Tolerant Computing 10 Watchdog: Processors (contd.) What can it achieve? –Observe the address busObserve the address bus Can observe the data Can observe instructions Can check the flow of program control –Need to know what kind of errors can occur to determine the capability of this methodNeed to know what kind of errors can occur to determine the capability of this method

11 ECE 753 Fault Tolerant Computing 11 Watchdog: Error models Experimental setup to develop error models applicable at this levelExperimental setup to develop error models applicable at this level –Processor-memory architectureProcessor-memory architecture –Inject faults (random errors) - in I/O processor, within processor (register file, states), within memoryInject faults (random errors) - in I/O processor, within processor (register file, states), within memory –SimulateSimulate –Also hardware was designed to inject such faults and study the impact/behaviorAlso hardware was designed to inject such faults and study the impact/behavior

12 ECE 753 Fault Tolerant Computing 12 Watchdog: Error models (contd.) Conclusions of the studies –Program flow could change (branch to no branch, or vise a versa)Program flow could change (branch to no branch, or vise a versa) –Instruction fetched from data spaceInstruction fetched from data space –Access to non existence memory spaceAccess to non existence memory space –Data fetched from instruction spaceData fetched from instruction space –Illegal instructionIllegal instruction –Writing in protected area (ROM)Writing in protected area (ROM) 60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow

13 ECE 753 Fault Tolerant Computing 13 Watchdog: Control flow checking Basic principle –Analyze the program and extract control informationAnalyze the program and extract control information Branch free intervals Subroutine calls –Assign signatures to branch free intervals and provide these signatures to the watchdog processor to check these valuesAssign signatures to branch free intervals and provide these signatures to the watchdog processor to check these values

14 ECE 753 Fault Tolerant Computing 14 Watchdog: Control flow checking (contd.) A simple example Program watchdog start ------------  receive start branch observe bus free cont. to form code signature check sig X ---  Check X against collected sig

15 ECE 753 Fault Tolerant Computing 15 Watchdog: Control flow checking (contd.) Details and variations –Structural integrity checkingStructural integrity checking Analyze the program control flow – create a program control flow graphAnalyze the program control flow – create a program control flow graph Assign unique identifier to the nodes of the graph Provide control flow graph to the watchdog along with the identifiersProvide control flow graph to the watchdog along with the identifiers In case of branches, watchdog expects one of the many possible identifiersIn case of branches, watchdog expects one of the many possible identifiers Limitations –Performance impact – insertion of special instructionsPerformance impact – insertion of special instructions –Inability to detect data processing variations – add to subInability to detect data processing variations – add to sub

16 ECE 753 Fault Tolerant Computing 16 Watchdog: Control flow checking (contd.) Details and variations (contd.) –Derived signature checkingDerived signature checking Compiler identifies branch free intervals and generates signatures (such as check sum) for these intervalsCompiler identifies branch free intervals and generates signatures (such as check sum) for these intervals At run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messagesAt run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messages Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature)Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature) Example: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processorExample: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processor

17 ECE 753 Fault Tolerant Computing 17 Watchdog: Control flow checking (contd.) Details and variations (contd.) –Derived signature checking (contd.)Derived signature checking (contd.) Coverage –Can detect random errors in instructions in branch free intervals (but aliasing can occur)Can detect random errors in instructions in branch free intervals (but aliasing can occur) Overheads –Memory width increase due to tag bitsMemory width increase due to tag bits – Memory increase due to signatures insertions Memory increase due to signatures insertions –Performance impact due to NOPsPerformance impact due to NOPs Solutions –Using path signature method – reduces the number of signatures neededUsing path signature method – reduces the number of signatures needed –Branch address hashing – merge signature and branch addressBranch address hashing – merge signature and branch address

18 ECE 753 Fault Tolerant Computing 18 Watchdog: Mem access and assertion checks What to do about memory/data errors –Use ECCUse ECC –Few other methods using watchdogFew other methods using watchdog Check for non existent memory addresses Check for out of range addresses Capability based checking for objects is also possibleCapability based checking for objects is also possible Assertion based checking and sanity checks using watchdog (independent hardware) is also possibleAssertion based checking and sanity checks using watchdog (independent hardware) is also possible

19 ECE 753 Fault Tolerant Computing 19 Re-execution for fault-tolerance Key concept –Execute a program/instruction twice (or more times) and then compare the results.Execute a program/instruction twice (or more times) and then compare the results. –A time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy techniqueA time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy technique –Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.

20 ECE 753 Fault Tolerant Computing 20 Re-execution: Basic Techniques RESO concept –Re-execution of an instruction with shifted operandsRe-execution of an instruction with shifted operands Already discussed early in the course Can detect transient faults Can also detect many permanent faults

21 ECE 753 Fault Tolerant Computing 21 Re-execution: Basic Techniques (contd.) Program Re-execution –Make two copies the programMake two copies the program Execute them serially –Can use RESO if the hardware platform is same for both executionsCan use RESO if the hardware platform is same for both executions Execute them in parallel if sufficient hardware redundancy is availableExecute them in parallel if sufficient hardware redundancy is available –May take twice as long or twice the hardwareMay take twice as long or twice the hardware –When/how to compare: impacts the system complexityWhen/how to compare: impacts the system complexity –Performance impactPerformance impact Serial computation: High latency Parallel computation: Complex implementation, and hence possible loss of performanceParallel computation: Complex implementation, and hence possible loss of performance

22 ECE 753 Fault Tolerant Computing 22 Re-execution: Basic Techniques (contd.) Instruction Re-execution – fine grain parallelismInstruction Re-execution – fine grain parallelism –Re-execute every instruction on same or different hardware, depending upon the redundancy availableRe-execute every instruction on same or different hardware, depending upon the redundancy available May use RESO if same hardware is used for instruction re-executionMay use RESO if same hardware is used for instruction re-execution –If sufficient resources are available, this method may have little impact on the performanceIf sufficient resources are available, this method may have little impact on the performance

23 ECE 753 Fault Tolerant Computing 23 Re-execution: Case studies Introduction to case studies –CRAYCRAY Instruction re-execution –SMT architectureSMT architecture Two copies the program are interleaved as two threads for simultaneous executionTwo copies the program are interleaved as two threads for simultaneous execution –Multiscalar architectureMultiscalar architecture Two copies of the program are executed on many processing elements simultaneouslyTwo copies of the program are executed on many processing elements simultaneously –Chip multiprocessorChip multiprocessor With critical value forwarding (DSN-2010)

24 ECE 753 Fault Tolerant Computing 24 Re-execution: Case studies (contd.) CRAY Instruction re-execution Duplication of instruction in hardware Sufficient resources and pipelining available for re-execution without doubling the execution timeSufficient resources and pipelining available for re-execution without doubling the execution time Consider a generic fine grain parallel architecture (OH)Consider a generic fine grain parallel architecture (OH) Consider executing a code segment (OH) Now look at ways of duplicating instructions and executing original and duplicated instructions (OH)Now look at ways of duplicating instructions and executing original and duplicated instructions (OH) Some experimental results

25 ECE 753 Fault Tolerant Computing 25 Re-execution: Case studies (contd.) AR-SMT –High level view of the technique (OH)High level view of the technique (OH) Concept of execution (Active) streams Re-execution of the instruction stream – Redundant streamRe-execution of the instruction stream – Redundant stream –Issue of delay buffer length and latencyIssue of delay buffer length and latency –Implementation issues and coverageImplementation issues and coverage –Performance impactPerformance impact

26 ECE 753 Fault Tolerant Computing 26 Re-execution: Case studies (contd.) Multiscalar –Concept of control flow graph (OH)Concept of control flow graph (OH) –Basic architecture (OH)Basic architecture (OH) – Static division of PUs and performance impact (OH) Static division of PUs and performance impact (OH) –Dynamic division of PUs and performance impact (OH)Dynamic division of PUs and performance impact (OH)

27 ECE 753 Fault Tolerant Computing 27 Re-execution: Case studies (contd.) Chip Multiprocessor (See slide set) –IntroIntro –Design Overview and conceptDesign Overview and concept – Evaulation Evaulation –ConclusionConclusion

28 ECE 753 Fault Tolerant Computing 28 Watchdog and Re-execution: Comments Concepts discussed here can be used to design high performance processorsConcepts discussed here can be used to design high performance processors –Performance improvement via speculationPerformance improvement via speculation Have a very high performance speculative processor Verify the control flow using watchdog or use a second processor to fully verify the executed stream by the speculative processor.Verify the control flow using watchdog or use a second processor to fully verify the executed stream by the speculative processor. This will lead to a processor with high performance (throughput) albeit high latencyThis will lead to a processor with high performance (throughput) albeit high latency

29 ECE 753 Fault Tolerant Computing 29 Summary Watchdog –TimerTimer –ProcessorProcessor –Control flow checkingControl flow checking Re-execution –Basic techniquesBasic techniques –Case studies: CRAY, AR-SMT, MultiscalarCase studies: CRAY, AR-SMT, Multiscalar


Download ppt "ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution."

Similar presentations


Ads by Google