Soft Error Detection for Iterative Applications Using Offline Training

Soft Error Detection for Iterative Applications Using Offline Training
12/2/2018 Soft Error Detection for Iterative Applications Using Offline Training Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University 12/2/2018

12/2/2018 Bigger Picture Exploiting monotonic behavior of residuals for direct solvers to detect SDC Algorithm level fault tolerance for Molecular Dynamics Offline training for SDC detection Resilience for ``Simulation + In-Situ Analytics’’ 12/2/2018

12/2/2018 High-level View For iterative solvers that do not have monotonic residuals SDC leaves a signature on time-series of residual values This signature is application-dependent but problem-size/dataset independent Train application offline, attach the models with application 12/2/2018

Motivating Study – Impact of Soft Error
12/2/2018 Motivating Study – Impact of Soft Error Inspect Impact of Soft Error to iterative applications Inject bit flips in different bit of variable in different execution stages Observe how bit flip in different bit impacts the output Different execution stage (denoted as percentage of total iterations) impacts the output Mimics Single Event Upset (SEU) with only one bit flip in one execution 12/2/2018

Impact from SEU to Iterative Applications
12/2/2018 Impact from SEU to Iterative Applications The residual value gradually reduces and converges relatively fast when no bit flip is introduced. The runtime takes longer to complete or even cannot complete under the impact of soft error. Bitflip within range of 0-32 did not lead to any change. Presented bit flips are injected in around 20%, 40% and 60% of runtime, respectively Impact of SEU to CG: The value of residual in the runtime with bit flipped in different bit ranges 12/2/2018

against the output from the normal execution.
12/2/2018 Impact from SEU to Iterative Applications Impact of SEU to CG: measured in Normalized Relative Difference against the output from the normal execution. 12/2/2018

12/2/2018 Design Observation – value of residual can be served as signature of soft error for iterative convergent applications Create an input-independent solution by applying machine learning techniques: Collect behavior of residual using sample inputs with and without bit flips Train the model with machine learning tech. that can classify correct/incorrect behavior. Apply the models in the runtime to verify the correctness of current execution 12/2/2018

Design – Overview Execution Stage Application Completes
12/2/2018 Design – Overview Machine Learning Library Neural Networks Decision Trees Support Vector Machine Others… Training Stage (train models with both correct runtime and those bit flips) Scale input data set Use a Classifier Algorithm to generate models Sampling Stage Generates the data set for profiling. Run application with both correct execution and soft errors. Data sets Train Models Model Pool 10% Model 20% Model 30% Model 70% Model 80% Model 90% Model 40% Model 50% Model 60% Model Execution Stage Does all model agree? Application Completes Perform Recovery YES NO 10% 20% 30% 40% 50% 60% 70% 80% 90% 10% Model 20% Model 30% Model 40% Model 50% Model 60% Model 70% Model 80% Model 90% Model 12/2/2018

Design – Sampling Stage
12/2/2018 Design – Sampling Stage Goal: To generate sufficient records of values of residuals for training models Execute application using a set of different input sizes for multiple times with Bit-flips performed nondeterministically to the critical data structures. CORRECT runtime data soft error free SDC that does not lead to a significant change in the final result CORRUPTED runtime data 12/2/2018

Design – Training Stage
12/2/2018 Design – Training Stage Goal: To generate the models that would classify the runtime sequence of residual values into CORRECT or CORRUPTED class. Apply classifier algorithm to train models that could determine the runtime correctness. Discard the iterations beyond the maximum iterations to avoid irrelevant values from delays/infinite loops. For each runtime time step, collect corresponding residual values, scale and train the model that classifies into two classes. Store models in a model pool for use in execution stage 12/2/2018

Design – Training Stage
12/2/2018 Design – Training Stage Algorithm Example of training models. Support Vector Machine is used in this example. 12/2/2018

Design – Execution Stage
12/2/2018 Design – Execution Stage Execute application with model pool. Invoke available model during the runtime to verify runtime correctness. Model will classify current runtime into one of two classes: CORRECT and CORRUPTED CORRECT: suggests that current execution is either soft error free or little impact is observed. CORRUPTED: significant amount of computation is corrupted by the soft error. Recovery is needed. 12/2/2018

Design – Execution Stage
12/2/2018 Design – Execution Stage Algorithm Example of execution stage, invoking corresponding models from Model Pool 12/2/2018

Experiment Result Applications Datasets Evaluation miniFE, CG, HPCCG
Accuracy, latency, overhead, generalized fault injection For Training: For performance Evaluation: 12/2/2018

Experimental Result - Accuracy
Results for detection with off-line training on different problem sizes. The figure shows bit flips occurs in 5% interval (20-45, is omitted due to the similarity to 50%) 12/2/2018

Experimental Result - Latency
Detection rate with different models in different execution stage 12/2/2018

Experimental Result - Overhead
Overhead shown in slowdown percentage compared to AID Overhead of Model60 on different input sizes for CG shown in absolute time(TOP) and cost percentage(BOTTOM) 12/2/2018

Experimental Result – Generalized Fault Injection
Accuracy: Detection rate with double flips occuring in different 5% intervals (20%-45% and 55%-80% are omitted due to the similarity to the 50% case) Latency: Detection rate with different models in different execution stage on double flips. Each model shows its detection rate for bit injection in its covered execution range. Input sizes are: MiniFE 150*150, CG 6.2k * 6.2k, HPCCG 6.6k * 6.6k 12/2/2018

Thanks for your attention! Q & A
12/2/2018

Soft Error Detection for Iterative Applications Using Offline Training

Similar presentations

Presentation on theme: "Soft Error Detection for Iterative Applications Using Offline Training"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Soft Error Detection for Iterative Applications Using Offline Training

Similar presentations

Presentation on theme: "Soft Error Detection for Iterative Applications Using Offline Training"— Presentation transcript:

Similar presentations

About project

Feedback