Soft Error Detection for Iterative Applications Using Offline Training

Slides:

Advertisements

Similar presentations

10/28/2009VLSI Design & Test Seminar1 Diagnostic Tests and Full- Response Fault Dictionary Vishwani D. Agrawal ECE Dept., Auburn University Auburn, AL.

Advertisements

Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.

CENG536 Computer Engineering Department Çankaya University.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

ECIV 201 Computational Methods for Civil Engineers Richard P. Ray, Ph.D., P.E. Error Analysis.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.

Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.

Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

1 Error Analysis Part 1 The Basics. 2 Key Concepts Analytical vs. numerical Methods Representation of floating-point numbers Concept of significant digits.

Hub Queue Size Analyzer Implementing Neural Networks in practice.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.

Introduction and Analysis of Error Pertemuan 1

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

An Example Use Case Scenario

December 5, 2012Introduction to Artificial Intelligence Lecture 20: Neural Network Application Design III 1 Example I: Predicting the Weather Since the.

We introduce the use of Confidence c as a weighted vote for the voting machine to avoid low confidence Result r of individual expert from affecting the.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

An Overview of Intrusion Detection Using Soft Computing Archana Sapkota Palden Lama CS591 Fall 2009.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

A Power Grid Analysis and Verification Tool Based on a Statistical Prediction Engine M.K. Tsiampas, D. Bountas, P. Merakos, N.E. Evmorfopoulos, S. Bantas.

Look-ahead Linear Regression Trees (LLRT)

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Fault-Tolerant Systems Design Part 1.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

2011/IX/27SEU protection insertion in Verilog for the ABCN project 1 Filipe Sousa Francis Anghinolfi.

Using Memory to Cope with Simultaneous Transient Faults Authors: Universidade Federal do Rio Grande do Sul Programa de Pós-Graduação em Engenharia Elétrica.

1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Week#3 Software Quality Engineering.

MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Experience Report: System Log Analysis for Anomaly Detection

Application Level Fault Tolerance and Detection

QianZhu, Liang Chen and Gagan Agrawal

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

CSE 4705 Artificial Intelligence

Application Level Fault Tolerance and Detection

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Supporting Fault-Tolerance in Streaming Grid Applications

Chapter 10 Verification and Validation of Simulation Models

TT-Join: Efficient Set Containment Join

Roland Kwitt & Tobias Strohmeier

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Fault Injection: A Method for Validating Fault-tolerant System

Solution of Equations by Iteration

Properties of Random Numbers

Adversarial Evasion-Resilient Hardware Malware Detectors

An Adaptive Middleware for Supporting Time-Critical Event Response

Yiyu Shi*, Jinjun Xiong+, Howard Chen+ and Lei He*

Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs

CS4470 Computer Networking Protocols

RHMD: Evasion-Resilient Hardware Malware Detectors

EE513 Audio Signals and Systems

Fast and Exact K-Means Clustering

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

Machine Learning: UNIT-3 CHAPTER-2

COMP6321 MACHINE LEARNING PROJECT PRESENTATION

Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme

Presentation transcript:

Soft Error Detection for Iterative Applications Using Offline Training 12/2/2018 Soft Error Detection for Iterative Applications Using Offline Training Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University 12/2/2018

12/2/2018 Bigger Picture Exploiting monotonic behavior of residuals for direct solvers to detect SDC Algorithm level fault tolerance for Molecular Dynamics Offline training for SDC detection Resilience for ``Simulation + In-Situ Analytics’’ 12/2/2018

12/2/2018 High-level View For iterative solvers that do not have monotonic residuals SDC leaves a signature on time-series of residual values This signature is application-dependent but problem-size/dataset independent Train application offline, attach the models with application 12/2/2018

Motivating Study – Impact of Soft Error 12/2/2018 Motivating Study – Impact of Soft Error Inspect Impact of Soft Error to iterative applications Inject bit flips in different bit of variable in different execution stages Observe how bit flip in different bit impacts the output Different execution stage (denoted as percentage of total iterations) impacts the output Mimics Single Event Upset (SEU) with only one bit flip in one execution 12/2/2018

Impact from SEU to Iterative Applications 12/2/2018 Impact from SEU to Iterative Applications The residual value gradually reduces and converges relatively fast when no bit flip is introduced. The runtime takes longer to complete or even cannot complete under the impact of soft error. Bitflip within range of 0-32 did not lead to any change. Presented bit flips are injected in around 20%, 40% and 60% of runtime, respectively Impact of SEU to CG: The value of residual in the runtime with bit flipped in different bit ranges 12/2/2018

against the output from the normal execution. 12/2/2018 Impact from SEU to Iterative Applications Impact of SEU to CG: measured in Normalized Relative Difference against the output from the normal execution. 12/2/2018

12/2/2018 Design Observation – value of residual can be served as signature of soft error for iterative convergent applications Create an input-independent solution by applying machine learning techniques: Collect behavior of residual using sample inputs with and without bit flips Train the model with machine learning tech. that can classify correct/incorrect behavior. Apply the models in the runtime to verify the correctness of current execution 12/2/2018

Design – Overview Execution Stage Application Completes 12/2/2018 Design – Overview Machine Learning Library Neural Networks Decision Trees Support Vector Machine Others… Training Stage (train models with both correct runtime and those bit flips) Scale input data set Use a Classifier Algorithm to generate models Sampling Stage Generates the data set for profiling. Run application with both correct execution and soft errors. Data sets Train Models Model Pool 10% Model 20% Model 30% Model 70% Model 80% Model 90% Model 40% Model 50% Model 60% Model Execution Stage Does all model agree? Application Completes Perform Recovery YES NO 10% 20% 30% 40% 50% 60% 70% 80% 90% 10% Model 20% Model 30% Model 40% Model 50% Model 60% Model 70% Model 80% Model 90% Model 12/2/2018

Design – Sampling Stage 12/2/2018 Design – Sampling Stage Goal: To generate sufficient records of values of residuals for training models Execute application using a set of different input sizes for multiple times with Bit-flips performed nondeterministically to the critical data structures. CORRECT runtime data soft error free SDC that does not lead to a significant change in the final result CORRUPTED runtime data 12/2/2018

Design – Training Stage 12/2/2018 Design – Training Stage Goal: To generate the models that would classify the runtime sequence of residual values into CORRECT or CORRUPTED class. Apply classifier algorithm to train models that could determine the runtime correctness. Discard the iterations beyond the maximum iterations to avoid irrelevant values from delays/infinite loops. For each runtime time step, collect corresponding residual values, scale and train the model that classifies into two classes. Store models in a model pool for use in execution stage 12/2/2018

Design – Training Stage 12/2/2018 Design – Training Stage Algorithm Example of training models. Support Vector Machine is used in this example. 12/2/2018

Design – Execution Stage 12/2/2018 Design – Execution Stage Execute application with model pool. Invoke available model during the runtime to verify runtime correctness. Model will classify current runtime into one of two classes: CORRECT and CORRUPTED CORRECT: suggests that current execution is either soft error free or little impact is observed. CORRUPTED: significant amount of computation is corrupted by the soft error. Recovery is needed. 12/2/2018

Design – Execution Stage 12/2/2018 Design – Execution Stage Algorithm Example of execution stage, invoking corresponding models from Model Pool 12/2/2018

Experiment Result Applications Datasets Evaluation miniFE, CG, HPCCG Accuracy, latency, overhead, generalized fault injection For Training: For performance Evaluation: 12/2/2018

Experimental Result - Accuracy Results for detection with off-line training on different problem sizes. The figure shows bit flips occurs in 5% interval (20-45, 55-80 is omitted due to the similarity to 50%) 12/2/2018

Experimental Result - Latency Detection rate with different models in different execution stage 12/2/2018

Experimental Result - Overhead Overhead shown in slowdown percentage compared to AID Overhead of Model60 on different input sizes for CG shown in absolute time(TOP) and cost percentage(BOTTOM) 12/2/2018

Experimental Result – Generalized Fault Injection Accuracy: Detection rate with double flips occuring in different 5% intervals (20%-45% and 55%-80% are omitted due to the similarity to the 50% case) Latency: Detection rate with different models in different execution stage on double flips. Each model shows its detection rate for bit injection in its covered execution range. Input sizes are: MiniFE 150*150, CG 6.2k * 6.2k, HPCCG 6.6k * 6.6k 12/2/2018

Thanks for your attention! Q & A 12/2/2018