1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin.

Slides:



Advertisements
Similar presentations
Test process essentials Riitta Viitamäki,
Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 3 Operating System Organization.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
Introduction Operating Systems’ Concepts and Structure Lecture 1 ~ Spring, 2008 ~ Spring, 2008TUCN. Operating Systems. Lecture 1.
GCSE Computing - The CPU
Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.
SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.
Methods for checking simulation correctness How do you know if your testcase passed or failed?
Lecture 12 Today’s topics –CPU basics Registers ALU Control Unit –The bus –Clocks –Input/output subsystem 1.
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
Software Faults and Fault Injection Models --Raviteja Varanasi.
System/Software Testing
CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.
Slides created by: Professor Ian G. Harris Test and Debugging  Controllability and observability are required Controllability Ability to control sources.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
University of Coimbra, DEI-CISUC
Software Testing Damian Gordon.
MICROPROCESSOR INPUT/OUTPUT
Architecture Support for OS CSCI 444/544 Operating Systems Fall 2008.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Processes Introduction to Operating Systems: Module 3.
25 April 2000 SEESCOASEESCOA STWW - Programma Evaluation of on-chip debugging techniques Deliverable D5.1 Michiel Ronsse.
1 Control Unit Operation and Microprogramming Chap 16 & 17 of CO&A Dr. Farag.
Chapter 4 MARIE: An Introduction to a Simple Computer.
1 CSE451 Architectural Supports for Operating Systems Autumn 2002 Gary Kimura Lecture #2 October 2, 2002.
Department of Computer Engineering PROPANE An Environment for Examining the Propagation of Errors in Software Martin Hiller, Arshad Jhumka, Neeraj Suri.
R ECONFIGURABLE SECURITY SUPPORT FOR EMBEDDED SYSTEMS 1 AKSHATA VARDHARAJ.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Verification of FT System Using Simulation Petr Grillinger.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
Software Quality Assurance and Testing Fazal Rehman Shamil.
Chapter 11 System-Level Verification Issues. The Importance of Verification Verifying at the system level is the last opportunity to find errors before.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Efficient Software-Based Fault Isolation Authors: Robert Wahbe Steven Lucco Thomas E. Anderson Susan L. Graham Presenter: Gregory Netland.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Lecturer: Eng. Mohamed Adam Isak PH.D Researcher in CS M.Sc. and B.Sc. of Information Technology Engineering, Lecturer in University of Somalia and Mogadishu.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
Introduction to Operating Systems Concepts
Computer System Structures
Fail-stutter Behavior Characterization of NFS
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Testing Tutorial 7.
Software Testing.
Control Unit Lecture 6.
nZDC: A compiler technique for near-Zero silent Data Corruption
VLSI Testing Lecture 6: Fault Simulation
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES
VLSI Testing Lecture 6: Fault Simulation
Fault Tolerance Distributed Web-based Systems
Mehrdad Moradi Oct. 22, 2018 MSDL Research day
Fault Tolerant Systems in a Space Environment
Presentation transcript:

1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group Prof. Neeraj Suri Constantin Sârbu Dept. of Computer Science TU Darmstadt, Germany

2 Fault Removal: Software Testing  So far: checkpointing, recovery blocks, NVP, NCP, microreboots …  Verification & Validation  Testing Techniques  Static vs. Dynamic  Black-box vs. White-box  Today: Testing of dependable systems  Modeling  Fault-injection (FI / SWIFI)  Some existing tools for fault injection  Next 2 lectures: Testing of operating systems  Fault injection aspects in OSs (WHEN / WHAT to inject)  Profiling the OS extensions (state runtime)

3 Why is PERFECT testing impossible?  HW/OS/SW/Protocols  our fault/error models are speculative  failure modes and associated failure distributions are probabilistic  sequences (# of data cascades, # temporal links) do not follow any meaningful distributions   state space: fault classes only condense equivalent behavior states – nothing more  lack of details available! [processor level, gate, device, transistor, VHDL?]  fixing bugs often causes more bugs (bug re-injections)  cause of bugs is more important: complex spec? complex dependency?  How good are our system models?

4 Dependability Modeling  Simplex R(t) = e -λt  Series R(sys) = R 1 R 2 R 3 …R n R(sys) = e -( … n) MTTF = 1/ sys  Parallel Example: R 1 =R 2 =.98 U 1 =U 2 =1-.98=.02 (Unreliability) U(sys) = U 1 U 2 =.0004 R(sys) = 1 – U(sys) =.9996 R(sys) = 1 – (1 - R 1 )(1 – R 2 ) R1R2R3Rn R1 R2 Example1: n=5, R 1 =R 2 =R 3 =R 4 =R 5 =.98 R(sys)=.90 Example2: n=10, R 1 =R 2 =…=R 9 =R 10 =.98 R(sys)=.82

5 Dependability Modeling  TMR: is this a parallel system?  Works as long as two units are fault-free  Assumes independent faults  Perfect voter  No repair!  Reliability: Where did this come from? P1 P2 P3 = o/p

6 Modeling P1 P2 P3 = o/p 2F3 ≈ “probability of one out of three failing”  … How about repair?

7 Modeling (Markov) 2F3 Solving this system gives: P1 P2 P3 = o/p Do we always have perfect detection? Can the system go directly from 3 to F? but = 1000 h = 833 h = h for λ = 0.001; µ =0.1

8 Coverage in models New structure, two-out-of-four 2F10 P1 P2 P3 = o/p P4

9 Coverage in models New structure, two-out-of-four P1 P2 P3 = o/p P4 2F10 We add the coverage factor C

10 Fault Injection in One Sentence Experimental evaluation using fault injection is the process of analyzing a system’s response to exceptional conditions by intentionally (& artificially) inserting abnormal states during normal operation and monitoring the reaction(s) The Brute-Force Approach for Evaluating and Validating the Provisioning of Dependability

11 Faults  Errors  Failures FaultErrorFailure GoodBad Detection & Recovery No Faults Fault appears Fault activated Error activated Recovery failed Fault disappears Error overwritten Recovery incomplete Error detected Recovery successful Fault Injection Error Injection

12 Basics of Fault Injection  Where: to apply change (location, abstraction/system level)  What: to inject (what should be injected/corrupted?)  Which: trigger to use (event, instruction, timeout, exception, code mutation?)  When: to inject (corresponding to type of fault)  How: often to inject (corresponding to type of fault)  …  What to record & interpret? To what purpose?  How is the system loaded at the time of the injection  Applications running and their load (workload)  System resources  Real  realistic  synthetic workload

13 Various FI Approaches  Physical fault injection  EMI, radiation, …  Simulated fault injection  Injections into VHDL-model  Hardware fault-injection  Pin-level injection  Scan chains  Software implemented fault injection (SWIFI)  Bit-flips, mutations  Code and Data segments  API’s, …

14 Coverage and Latency  Aim is to find characteristics of Event X  Event X may be detection, recovery, etc.  Coverage of Event X  Conditional probability of Event X occurring  E.g. probability of error detection given that an error exists in the system  Latency of Event X  Time from the earliest (theoretically) possible occurrence of Event X to the actual monitored occurrence  E.g. time from error occurrence to error detection

15 Estimating Metrics in FI  Detection coverage = #detections/#injections  Detection latency = mean (detection times)  Recovery coverage = #recoveries/#detections  Recovery latency = mean (recovery times)

16 Physical Fault Injection  Reproduce extreme environmental conditions  EMI  Radiation  Heat  Shock  Voltage drops/spikes etc  Advantages  “Real” faults  Tangible  Simple “test cases”  Disadvantages  Difficult to control/repeat  Needs at least a prototype

17 Simulation-based Fault Injection  Using a model of the system  VHDL  MatLab  SystemC  Spice  Advantages  Usable during design  Controllable  Disadvantages  Requires a model  Model accuracy?  Slow

18 Simulated Fault Injection Fault injection Electrical levelLogical levelFunctional level Change current Change voltage Stuck at 0 or 1 Inverted fault Change CPU Register Flip memory bits, etc. Electrical circuits Logic gates Functional units Physical process Logic operation

19 Hardware-based Fault Injection  Inject faults using hardware (similar to physical)  Pin-level injection  Scan chains  Advantages  Controllable  Close to “real” faults  Disadvantages  Requires special equipment  Reachability?

20 SoftWare Implemented Fault Injection: SWIFI  Manipulate bits in memory locations and registers  Emulation of HW faults  Change text segment of processes Emulation of SW faults (bugs, defects)  Dynamic: E.g., Op-code switch during operation  Static: Change source code and recompile (a.k.a. mutation)

21 SWIFI  PROS:  No special hardware instrumentation  Inexpensive and easy to control  High observability (down to variables)  CONS:  Only into locations accessible to software  Instrumentation may disturb workload  Difficult to observe short latency faults  Open questions:  Is the injected fault representative of a “real” fault?  Is the emulated/simulated environment (ops., load, tests) representative of the real system?

22 A Generic View of SWIFI-Tools Controller Data analyzer Target Injector Stimuli generator Monitor/ Data collector Readouts Setup

23 Many Tools Available  DEPEND, MEFISTO  Evaluating HW/SW architectures using simulations  FERRARI, DOCTOR, RIFLE, Xception  Evaluate tolerance against HW faults  DEFINE, FIAT, FTAPE  Evaluate tolerance against HW and SW faults  MAFALDA, NFTAPE, PROPANE  Evaluate effects of HW & SW faults and analyze error propagation  Ballista  OS Robustness testing

24 DEPEND and MEFISTO  Evaluation of system architectures  E.g. validate TMR recovery protocols, synchronization protocols etc.  Simulate system and components using SW  DEPEND  uses object-oriented design for flexibility  Models a system and it’s interactions and FTM’s  MEFISTO  uses VHDL  Testing of FTM’s  Support for HW-based FI (validating Fault models)

25 FERRARI, DOCTOR and Xception  Evaluate system level effects of HW faults using SWIFI  E.g. bit-errors in registers, address bus errors, etc.  FERRARI (Fault and ERRor Automatic Real-time Injector)  Inject errors while applications are running  Compare with golden run  Registers, PC, Instruction type, branch and CC are targets  DOCTOR  Injects CPU, memory and network faults  Uses timeouts, traps and code mutations  Used on distributed real-time systems  Xception (example on next slides)  Uses debugging facilities in CPU’s

26 Xception  Goal: SWIFI using HW debugging support  Minimizing intrusion using debugging interfaces  Many fault triggers  Detailed performance monitoring can be used  Can affect any SW process (including kernel) No source code needed Injector Target App Fault Setup Experiment Manager Module Outputs Faults Logs Results Fault Archive User space Kernel space

27 Xception’s Fault Model  Duration  Transient  Location  Components inside processor Integer Unit, FPU, MMU, Buses, Registers, Branch processing  Trigger  Temporal  Opcode fetch, Operand load/store  Types  Bit-flips  Masks based on register/bus/memory sizes (e.g. 32 bits)

28 Xception  Data to collect  Fault information  System state information Instruction pointer etc  Kernel and Application deviations Kernel error codes Output of applications (workload)  Error detection status  Performance monitoring information

29 Xception Results for 4 node parallel computer running a Linda π calculation benchmark: © J. Carreira et al, TOSE 24(2) 1998 Results for 4 node parallel computer running a Linda matrix multiplication benchmark (with FT algorithm): © J. Carreira et al, TOSE 24(2) 1998

30 DEFINE, FIAT and FTAPE  Evaluate system level effects of HW and SW faults  E.g. bit-errors in data and code defects  Define  HW and SW faults for distributed systems  Memory, CPU, buses and communication channels  Synthetic WL  Studied the impact of missing/corrupted messages and client failures  FIAT (Fault Injection Automated Testing)  Measures impact on WL applications  Bit-level errors in target workload  Limited fault manifestations

31 MAFALDA, NFTAPE and PROPANE  Evaluate effects of HW and SW faults, and analyze error propagation  From system level down to variable level  Need instrumentation, but no HW-support  MAFALDA focused on micro-kernels  Bit-flips in memory/data and API’s  NFTAPE tries to do everything in one tool!  PROPANE purely software

32 Instrumentation Example (PROPANE) int spherical_volume( double radius ) { double volume; volume = 4.0 * (PI * pow(radius, 3.0)) / 3.0; return volume; } int spherical_volume( double radius ) { double volume; /* Injection location for radius */ propane_inject( IL_SPHERE_VOL, &radius, PROPANE_DOUBLE ); /* Probe the value of radius */ propane_log_var( P_RADIUS, &radius ); volume = 4.0 * (PI * pow(radius, 3.0)) / 3.0; /* Probe the value of volume */ propane_log_var( P_VOLUME, &volume ); return volume; } Original code Instrumented code

33 PROPANE  PROPANE = PROPagation ANalysis Environment Highest Error Rate Lowest Error Rate ms_slot_nbr i mscnt pulscnt slow_speed stopped IsValue OutValue TOC2 ADC TCNT TIC1 PACNT SetValue CLOCK PRES_S V_REG PRES_A CALC DIST_S

34 Code Mutations  Idea: Try to simulate real faults in binary code 1.Search real SW for faults 2.Identify the fault patterns in the binaries 3.Inject the patterns to your SW

35 When Do I Use Approach X? StudyMain Tools Architecture & high- level FI-mechanisms DEPEND, Loki Low-level FI- mechanisms All (except perhaps DEPEND, Loki) OS-robustnessFERRARI, DEFINE (both are for UNIX), MAFALDA (for kernels), Ballista Propagation analysisNFTAPE, PROPANE

36 Fault Injection  This is experimental and a statistical basis for establish a desired level of confidence in the system.  Keep in mind that: a)the statistical basis does not always apply to real systems esp. SW b)statistically significant injections has little meaning if (a) applies c)the injected fault is NOT the real fault

37 More Information  Iyer R., Tang D., ”Experimental Analysis of Computer System Dependability”, Chapter 5 in Pradhan’s book Fault-Tolerant Computer System Design, 1996  [Check papers on EPIC, Propane, M. Hiller’s PhD thesis]