Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control- Flow Checking and ECC Franco Trovo, Shantanu Dutt & Hasan Arslan.

Similar presentations


Presentation on theme: "Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control- Flow Checking and ECC Franco Trovo, Shantanu Dutt & Hasan Arslan."— Presentation transcript:

1 Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control- Flow Checking and ECC Franco Trovo, Shantanu Dutt & Hasan Arslan Univ. of Illinois at Chicago

2 Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion

3 Assumptions/Scenarios of Past FD/FT Work Past Work on general fault detection: Past Work on general fault detection: Random single (sometimes double) faultsRandom single (sometimes double) faults Deterministic faultsDeterministic faults Types of faults: permanent, transient, intermittent; intermittent type not generally tackledTypes of faults: permanent, transient, intermittent; intermittent type not generally tackled Past Work on EM-induced faults: Past Work on EM-induced faults: No how/why/what analysis and classification of computer failure due to EM interferenceNo how/why/what analysis and classification of computer failure due to EM interference

4 Broad Goals of Our Work Will determine and classify the following type of computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and location of faults under EM-type faults: Will determine and classify the following type of computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and location of faults under EM-type faults: l Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults l Data errors. Causes: computation errors, memory & bus faults l Termination Errors (hung processor & crashes). Causes: C.U. transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero, spurious interrupts l Note: Error types are NOT mutually exclusive Provide recipes for FT and reliable operation Provide recipes for FT and reliable operation

5 In This Work Will detect Will detect l Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults l Raw bus errors using ECC l Provide a FT mechanism using these detections for reliable operation

6 Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion

7 FD/FT Solutions Fault Detection: Fault Detection: Control flow checking (CFC) by a concurrent error detection using watchdog (WD) processorControl flow checking (CFC) by a concurrent error detection using watchdog (WD) processor Hamming ECC (2-error detecting) on data & address busesHamming ECC (2-error detecting) on data & address buses Fault Tolerance: Fault Tolerance: Instruction micro rollback triggered byInstruction micro rollback triggered by Hamming ECC Hamming ECC WD-monitored CFC WD-monitored CFC

8 General Structure of a System with a Watchdog MAIN PROCESSOR MAIN MEMORY DATA BUS ADD. BUS WATCHDOG PROCESSOR Performs various checks (CFC, address, etc.)

9 General Structure of a WD- Monitored System with On-Chip Cache ADD. BUS DATA BUS CPU MM WD Cache

10 Control Flow Checking [Mahmood, et al., IEEE TC ’ 88] Hybrid solution for detecting wrong block sequence execution Hybrid solution for detecting wrong block sequence execution Starting from a program it extracts a Control Flow Graph Starting from a program it extracts a Control Flow Graph Each node is associated to a block of branch free instructions + branch at end Each node is associated to a block of branch free instructions + branch at end Each edge is associated w/ a possible branch between two blocks Each edge is associated w/ a possible branch between two blocks Block A If cond1 then Block B if cond2 then Block D else Block E Else Block C End if Block F A B C DE F

11 Control Flow Checking Block: branch free set of instructions Block: branch free set of instructions Signature: information added to the block in order to distinguish a block from another Signature: information added to the block in order to distinguish a block from another Block augmentation & sign. insertion A B C DE F Jump free set of instructions JUMP JUMP sign 1 JUMP JUMP sign 2 Branch free set of instructions Branch BLOCK sign Sign of 1st bra Branch Sign of 2nd bra Branch Block

12 CFC Implemented State Diagram Reset Begin Block Error Wrong Bra Error Wrong Jump or Faulted Signature Error Wrong Computed Signature Header Middle Block Signature 1 Signature 2 Branch Error Signature Expected Computed Sign. Eq. Header Sign? GET2S GET1S Header Sign Eg. Bra Signatures? N N N N Y Y Y Y A B C DE F Jump free set of instructions JUMP JUMP sign 1 JUMP JUMP sign 2 Branch free set of instructions BLOCK sign Sign of 1st bra Branch Sign of 2nd bra Branch No Branch signs

13 Micro Rollback [Tamir, et al., IEEE TC ‘ 90] Individual State Registers (RAM based) Register File, Caches, Main Mem (DWB based)

14 Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 …

15 Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 1 000 00 D0 XX 0000 XXXX

16 Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 1 100 00 D0 XX D0 000F XXXX 0000

17 Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 1 110 00 A3 XX D0 0101 XXXX 0000 000F

18 Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 00 XX XXXX 1 111 D0 A3 D0 0000 000D 0101 000F

19 Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 SUB 0002, D0 Micro rollback 2 levels Micro rollback 2 levels … 00 XX XXXX 1 100 D0 A3 D0 0000 000D 0101 000F

20 Support for Micro Rollback for Register File - example MOVE 0000, D0 MOVE 0000, D0 ADD 000F, D0 ADD 000F, D0 MOVE 0001, A3 (f) MOVE 0001, A3 (f) SUB 0002, D0 … SUB 0002, D0 … 00 XX XXXX 1 1 D0 0000 10 D0 A3 000D 0001 000F

21 CFC with Micro Rollback - Priority Two concurrent fault detection techniques can request the processor a micro rollback Two concurrent fault detection techniques can request the processor a micro rollback They generally requests different number of levels of rollback They generally requests different number of levels of rollback Which technique should have the priority in case of simult. detection by both HC and WD? Which technique should have the priority in case of simult. detection by both HC and WD? We assign the priority to the Hamming codeWe assign the priority to the Hamming code Reason: shorter jump backs Reason: shorter jump backs Although a rationale exists for WD priority Although a rationale exists for WD priority HCWD MRB Unit uRB=1uRB=3 ??

22 CFC with Instruction Micro Rollback – State Diagram Reset Begin Block Error Wrong Branch Error Wrong Computed Signature Header Middle Block Signature 1 Signature 2 Branch GET2S GET1S Header Sign Eg. Jump Signatures? N N N N Y Y Y Y Computed Sign. Eq. Header Sign? Error Wrong Branch or Faulted Signatures Multiple points of micro rollback t<t1 t1<=t<t2 t  t2 A B C DE F urb_d = 2 urb_d = bsize urb_d = 1 urb_d = 2 urb_d = 3 t = number of times the same error state is encountered. t < t1 : urb to BEGIN_BLOCK (1 instr) read header sign. again t1<=t<t2 : urb to “ Branch ” (2 instr) --re- exec prev. blk ’ s branch t >≥ t2 : urb to MIDDLE BLOCK (3 instr)-- re-read 2 branch signs. prev blk Hamming Code urb_d = 1 (re-execute previous branch) Jump free set of instructions JUMP JUMP sign 1 JUMP JUMP sign 2 Branch free set of instructions BLOCK sign Sign of 1st bra Branch Sign of 2nd bra Branch

23 Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion

24 Improved VHDL Model of 68040 + Watchdog connections WD Hamming code error detect. bits Control lines Data buses

25 Outline Goals Goals Solution Adopted Solution Adopted Control Flow CheckingControl Flow Checking Hamming encoding on the busesHamming encoding on the buses Instruction Micro rollbackInstruction Micro rollback Motorola 68040 and VHDL description Motorola 68040 and VHDL description Simulation results Simulation results Conclusion Conclusion

26 Simulation Environment The Total Fault Injection Time is simply the total duration of the intermittent fault on the bus or buses considered. The Delay Time is the time that the FG waits before starting the fault injection. The Period Time is the period of the intermittent fault. The Fault Time is the time of duration of the injection of a certain fault. Start Fault Injection First Fault Injected Second Fault Injected Period Time Fault Time Total Fault Injection Time Delay Time Fault Enable

27 Fault Parameters Values Simulations run on the model: Simulations run on the model: Faults injected on all cache busesFaults injected on all cache buses Fault typesFault types Random Double, Triple, Quadruple Faults Random Double, Triple, Quadruple Faults Clustered 1 cluster 2bits, 1 cluster 4bits, 2 clusters 2bits Clustered 1 cluster 2bits, 1 cluster 4bits, 2 clusters 2bits Three values of repeat frequencyThree values of repeat frequency Low (100 clock cycles = 100KHz) Low (100 clock cycles = 100KHz) Medium (10 clock cycles = 1MHz) Medium (10 clock cycles = 1MHz) High (1 clock cycle = 10MHz) High (1 clock cycle = 10MHz) Three values of duty cycleThree values of duty cycle 25% all the simulations 25% all the simulations 50% all except high freq and 4 faults 50% all except high freq and 4 faults 75% all 2 faults and 3faults middle frequencies 75% all 2 faults and 3faults middle frequencies

28 Simulation Results (contd.)

29 NOTE: HC has better error coverage for cluster faults Block sign check (part of CFC) has better err cov for rand faults

30 Simulation Results (contd.)

31 Conclusions Micro-rollback coupled with FD for the first time Micro-rollback coupled with FD for the first time Micro-rollable WD state diagram for the first time Micro-rollable WD state diagram for the first time More extensive fault patterns than previous work More extensive fault patterns than previous work Good reliability for our FD/FT solutions (correct or fail-safe execution) Good reliability for our FD/FT solutions (correct or fail-safe execution) 3 faults: 94% low freq, 90% mid freq & 90% high freq3 faults: 94% low freq, 90% mid freq & 90% high freq 4 faults: 86% low freq, 80% mid freq & 80% high freq4 faults: 86% low freq, 80% mid freq & 80% high freq Average execution time linear with duty cycle and almost quadratic with the fault injection frequency Average execution time linear with duty cycle and almost quadratic with the fault injection frequency time ovhd 3 faults: 11% low, 12% med, 64% high freqtime ovhd 3 faults: 11% low, 12% med, 64% high freq time ovhd 4 faults: 16% low, 32% med, 182% high freqtime ovhd 4 faults: 16% low, 32% med, 182% high freq Data buses less tolerant to faults than address buses (latter causes more CFC errors and are so detected more easily) Data buses less tolerant to faults than address buses (latter causes more CFC errors and are so detected more easily)

32 Future Work Introduction of other fault detection techniques as triggers for micro rollback Introduction of other fault detection techniques as triggers for micro rollback Lower level fault detection like the micro instruction control flow checking -- can detect internal processor faultsLower level fault detection like the micro instruction control flow checking -- can detect internal processor faults Higher level fault detection like algorithm based fault tolerance (ABFT) for checking data errors -- can detect external & internal faults affecting dataHigher level fault detection like algorithm based fault tolerance (ABFT) for checking data errors -- can detect external & internal faults affecting data


Download ppt "Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control- Flow Checking and ECC Franco Trovo, Shantanu Dutt & Hasan Arslan."

Similar presentations


Ads by Google