Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ultrascale Systems Research Center, Los Alamos National Laboratory2

Similar presentations


Presentation on theme: "Ultrascale Systems Research Center, Los Alamos National Laboratory2"— Presentation transcript:

1 Ultrascale Systems Research Center, Los Alamos National Laboratory2
DECAF-FSEFI: A Fine-grained, Accountable, Flexible, and Efficient Soft Error Injection Framework for Profiling Application Vulnerability (PFSEFI 2.0) LA-UR Qiang Guan Ultrascale Systems Research Center, Los Alamos National Laboratory2 SIAM Annual Meeting 2017

2 Acknowledgement Sponsored by DoE ASC program. Participants:
Nathan DeBardeleben (PI and technical lead, LANL), Qiang Guan (LANL), Mike Lang (LANL), Xunchao Hu (Syracuse U.), Heng Yin (UC Riverside), Panruo Wu (ORNL), Bo Fang (UBC), Kai Wu (UC Merced), David Rusty (Clemson U.), Terry Grave (CCU)

3 Motivation Soft errors pose a serious threat to the prospect of exascale systems Fault Injection systems as the research tool Source code instrumentation Dynamic binary instrumentation Limitations Designed for different injection tasks Performance is usually high No user-friendly interfaces

4 Design Goal Fine-grained Accountable Flexible Efficient
Inject faults into designated application and instruction Accountable Trace how the injected faults are propagated Flexible Easy to customize the fault injector Efficient Reasonable performance overhead

5 DECAF-FSEFI Dynamic binary instrumentation and virtual machine based fault injection Based on DECAF [1], a tool built on top of QEMU Dynamic Executable Code Analysis Framework TEMU inserts a kernel model into the guest OS, this kernel model hooks several system events, capture the OS level information/sematics and pass it to the hypervisor through special channel. Andrew Henderson, Aravind Prakash, Lok Kwong Yan, Xunchao Hu, Xujiewen Wang, Rundong Zhou, and Heng Yin, Make It Work, Make It Right, Make It Fast: Building a Platform-Neutral Whole-System Dynamic Binary Analysis Platform, In Proceedings of International Symposium on Software Testing and Analysis (ISSTA'14), San Jose, CA, July 2014.

6 DECAF-FSEFI Overall Architecture

7 DECAF-FSEFI – Just-In-time Fault Injection
Dynamic Binary Translation in QEMU Guest instruction -> Tiny Code Generator (TCG) IR -> Host Instructions Placement of Fault Injector Virtual Machine Introspection (VMI) technique to extract current context information Flush the translation cache and inject fault injectors within code translation process

8 DECAF-FSEFI – Fault Propagation Trace
Dynamic Taint Analysis It runs a program and observes which computations are affected by predefined taint sources such as user input Mark injected faults as taint sources Our design We rely on DECAF, which supports whole system tainting. We extend DECAF to support floating point instructions tainting and MPI applications.

9 DECAF-FSEFI – Fault Propagation Trace
Fault propagation for MPI applications Fault Propagation Instruction level log is too expensive. Local propagation and cross-process propagation. Only tainted memory operations are logged. An instruction writes tainted data to memory An instruction reads tainted data from memory

10 DECAF-FSEFI - Flexible Fault Injection Interfaces
Fine-grained control What application When to inject

11 Evaluation Goal Setup Performance
Flexibility of the injection interfaces Fine-grained and accountable fault injection using case studies Setup 8-core 3.60GHz Intel Core i CPU desktop with 32GB of RAM Host system : Ubuntu LTS DECAF-FSEFI uses an Ubuntu LTS image with 512MB of RAM

12 Performance Performance Overhead DECAF-FSEFI over DECAF
Over bare metal PFSEFI DECAF DECAF-FSEFI ~300x ~109x 109x -270x DECAF-FSEFI over DECAF Best Case: almost 0% overhead Worst case: 2.48x 2.48% if only 1% of the code is Inspected

13 Flexibility Flexibility
How much effort to develop a new fault injector? Injector Name LOC Time(hour) ProbabilisticInjector 97 2 DeterministicInjector 100 GroupInjector 98 2.5

14 Error Propagation Matvec - 5k times, two nodes, random faults
Propagation between nodes Total NO Output MPI_ERROR SDC 208 147 55 6

15 Case Study- CLAMR Experiment 1 – Fault Injection Analysis
Run CLAMR 5195 times and when the floating point instruction i is executed n times, inject x bits transient errors into registers used by I Total Detected Faults Undetected Faults Incorrect Results Correct Results 5195 4349(83.71%) 228(4.38%) 618(11.89%) 846(16.28%)

16 Case Study- CLAMR Experiment 2- Fault Propagation Analysis
Run CLAMR 2973 times and besides registers, also inject x bits transient errors into memories accessed by instruction i. Tainted Bytes in the propagation

17 Case Study- CLAMR Memory Operation In the Propagation
Distribution of # of tainted memory read Distribution of # of tainted memory write

18 Case Study- CLAMR Memory Operation In the Propagation
 118 (3.97%) runs Ratio of tainted memory read in memory operations  118 (3.97%) only tainted memory read  444(14.93%) only tainted memory write

19 Successful Stories Coding teams CLAMR: DoE mini-app
FleCSALE: continuum dynamics Legion: parallel programming model Algorithm-Based Fault Tolerance (ABFT) HPDC’17, PPoPP’17, HPDC’16

20 Conclusion We propose DECAF-FSEFI, a fine-grained, accountable, flexible, and efficient soft error injection framework built on top of QEMU We implement and evaluate DECAF-FSEFI’s performance and flexibility We demonstrate the usage of DECAF-FSEFI with case study

21 Thanks !! Find us via http://ultrascale.org/
Find PFSEFI via me via

22 Current status and Future Work
The implications of SDC on the File System A fault injector on the file system driver Locate kernel driver of the file system. Inject faults into the kernel driver. Fault injection experiments on the file system. Inject faults into different file systems. Study the results. Resilient File System design Vulnerability of file system Protection approach study Prototype implementation and evaluation

23 References [1] Andrew Henderson, Aravind Prakash, Lok Kwong Yan, Xunchao Hu, Xujiewen Wang, Rundong Zhou, and Heng Yin, Make It Work, Make It Right, Make It Fast: Building a Platform-Neutral Whole-System Dynamic Binary Analysis Platform, In Proceedings of International Symposium on Software Testing and Analysis (ISSTA'14), San Jose, CA, July 2014.

24 Backups

25 Fault Injection Experiments
Rodina - 5K times per program, random faults


Download ppt "Ultrascale Systems Research Center, Los Alamos National Laboratory2"

Similar presentations


Ads by Google