Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois.

Similar presentations


Presentation on theme: "Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois."— Presentation transcript:

1 Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois -Chicago

2 Outline Past Work-- General Fault Detection and Tolerance Past Work – EMI Induce Faults Fault Types and Fault Injection Methods Proposed Work and System Methodologies to Detect Faults Question and Future Outlook

3 Past work – General Fault Detection and Tolerance Off-line testing of digital circuits Self-diagnosis Test each of functional block Not a good system for real-time app. Redundancy Hardware, software or time Have a high overhead penalty

4 Past work – General Fault Detection and tolerance Concurrent-online testing: Adding external hardware, monitoring data,address and control lines Memory:error-detecting & correcting codes Computer systems Watchdog processor – detecting control flow errors in program execution [Mahmood & McCluskey, TC’88] Algorithm-based fault tolerance: use of some property of computation for self-checking [Huang & Abraham, TC’84, Dutt & Assad, TC’96]

5 Past work – General Fault Detection and tolerance ( contd.) Concurrent-online testing(contd.) Reconfigurable Systems: On-line testing and fault tolerance using dynamic circuit reconfiguration FPGA-based systems: On-line testing & FT [Verma, M.S. Thesis, UIC’01], [Dutt, et al., ICCAD’99], [Mahapatra & Dutt, FTCS’99], [Abramovici et al., ITC’99]

6 EM-Induced Faults High level computer failure detection due to different types of EM signals[Mojert et al., EMC’01] Radiation therapy machine overdoses patients Space Shuttle can’t launch due to synchronization error in redundant computers Failure in real-time communication & control systems from communication line error due to EM signals [Kohlberg & Carter, EMC’01] SEUs (single-Event Upsets): potential threat to the reliability of integrated circuits operating in radiation environment Space/avionics application, due to heavy-energy particles. Hubble’s Space Telescope Ground level (atmospheric neutrons) NASA space-based astronomical observatory

7 Fault Types & Fault Injection Methods Error Types Control flow errors—incorrect sequence of instruction execution. Causes: address gen. Error, memory faults, bus faults Data Errors: Causes: computation errors, memory & bus faults Hung processor & crashes: Causes: C.U. transition to dead- end states, invalid instruction, out-of-bound address, divide-by-zero Error types are NOT mutually exclusive

8 Fault Types & Fault Injection Methods Fault Injection Methods Hardware Fault Injection with contact (voltage or current changes,use pin-level probes and sockets)Messaline_[Arla et.al.,FTC’89 ] without contact (heavy-ion radiation and EMI) FIST_[Gunnetlo et al.,FTC’89] MARS_[Karlsson er al.,DCCA’95] Software Fault Injection Compile-time injection(modifying program instr. ) Doctor_[Han et al., CPD’95] Runtime injection (trigger fault injection mechanism) Time-out Exception/trap Xception_[Carreira et al., DCCA’95] Code insertion Ferrari_[Kanawati et al.,FTC’92] Ftape_[Tsai et al., FTC’96]

9 Fault Types & Fault Injection Methods Software Fault Injection (Contd.) Adv. Don’t require expensive hardware Used to target application and operation systems,which is difficult to do with hardware fault injection Disadv. Change the structure of original software Can not inject faults into location. That are inaccessible to soft.

10 Fault Types & Fault Injection Methods Fault injection system Controller Fault Injector Workload generator MonitorData collector Fault Library Workload library Data analyzer Target system

11 Characteristics of Fault Injection Methods HardawereSoftware With contactWithout contactCompilationRuntime CostHigh Low DamageHighLowNone TriggerYesNoYes RepeatabilityHighLowHigh ControllabilityHighLowHigh Acc. FIPChip pin.Chip int.Reg. Mem. Soft. Reg. Mem. I/O cont./port

12 Proposed Work VHD modeling of a modern microprocessor (using an available VHDL description of the DLX microprocessor, with appropriate modification) VHDL-based introduction of fault injection logic in the CPU as well as memory and external buses to simulate different fault patterns likely caused by EMI Develop techniques for detection of program errors due to these faults Classification of the fault types into data, control and hung/crashed processor Preliminary results for simulation of faults in external memory address and data buses

13 Proposed Work Memory DLX CPU Address Bus Data Bus Fault Generator Fault Types ( stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc ) Location & Values of Faults Duration of Faults & Start Times [0,Texc(workload)] Texc: execution time without fault [0-50T] T= CPU clock cycle Counter_2 Counter_1 Var-width Var-period Pulse gen. data Signal line 10

14 Proposed Work(contd.) Will include similar fault-injection capability for on-chip wires with a probabilistic component that will be based on analysis of EM effects on p/g lines from the circuit analysis component Processor will be partitioned onto 4 main modules: control unit, ALU, register file & cache with separate or common p/g lines with these to determine different degrees of susceptibility Cache Control Unit Register File ALU p/g

15 Methodologies: Control Flow Checking Compares the information gathered concurrently to the information previously provided Complexity,lies between the current circuit-level and system- level tech. Memory Hierarchy Processor Watchdog Memory Bus Signal from branch circuit A watchdog: small co-processor,monitors the behavior of the system Provided previously with information about the processor to be checked(memory access, control flow,control signal..)

16 Methodologies: Control Flow Checking _fibo: sw -4(r14),r30.. seqr1,r3,r4 bnezr1,L3.. seqr1,r3,r4 bnezr1,L3 j L2 L3:. addi r1,r0,#1 j L1 L2:.... A node is a block of inst. with a branch at the end A derived sign. of a node is a function(e.g.,xor, LFSR) of all instructions A program graph is one in which there is an arc from node u to v if the branch at u can lead to node v. Based on the signature Computation, error coverage is high(>90%) even with multiple faults[Mahmood & McCluskey, FTCS’85] n1 n2 n3 n4 L1 n5 Sign(n4) BRT L1 WD

17 Examples of Error types L1:. lw r3,0(r30) addi r0,r0,#1 seqr1,r3,r0 bnezr1,L1 L2.. subi r2,r2,#1 seqr1,r3,r2 bnezr1,L2 j L4 L3:. addi r1,r0,#1 j L1 L4:.... Error types Segmentation fault r0=24 r3=25 Hung-processor r2=1 r3=0 Out-of-bound address L4=256 Invalid instruction Instruction code can be changed

18 Analysis of Error

19 Program never finished (%47) Program terminated incorrectly(%23) Terminated with incorrect result (%23) Terminated with correct result(%7)

20 Methodologies: Algorithm-Based Fault Tolerance Instruction execution errors Difficult to detect, occur inside the microprocessor,not observable to an external watchdog processor Off-line scheme for detecting execution errors due to permanent faults[K.K. Saluja et al. IEEE ITC’1983] Transient fault occur more frequently than permanent faults in digital systems Detecting transient faults must be done in real- time

21 Methodologies: Algorithm-Based Fault Tolerance Use properties of the computation to check correctness of computed data E.g. linearly property: f(v1+v2)=f(v1)+f(2) of computation f() can be used to check it Pre-compute v’ = v1 + v1 + …+ vk (input checksum) Computer f(v1), …..f(vk) Compute u = f(v) + f(v2) + …. + f(vk) (output checksum) Check if f(v’) = u; inequality indicates computation error(s) Can be used for linear computations such as matrix multiplication, matrix addition, Gaussian elimination [Huang & Abraham, TC’84],[Dutt & Assad, TC’96]

22 Methodologies: Algorithm-Based Fault Tolerance Use a watchdog to monitor the bus and fetch the instruction opcodes along with the main processor Calculate expected execution parameters of each instruction Store this information in the watchdog processor (instruction parameter table) Compare the fetched instruction parameters with the stored data If parameters do not match, give error message Based on the program and microprocessor, error coverage can be change.8086 instruction set, error coverage is around %85 percent for single bit error [Khan & Tront, IEEE TC, 1989]

23 Goals,Questions & Future Outlook Q: Are there patterns of errors that lead to computer crashes w/ high probability? Q:If so, can the detection of such patterns be used to shut down the computer in a fail-safe manner (save state & data for later resumption) Q:Are there patterns of errors that are characteristic of EM- induced faults versus random single/double faults? Q:If so, can these be used as “early detection & warning” of EM interference? Future: Based on the correlation of system errors to EM faults, determine fault tolerance/ error minimization techniques for EM-induced faults.


Download ppt "Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois."

Similar presentations


Ads by Google