Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
1 Saad Arrabi 2/24/2010 CS  Definition of soft errors  Motivation of the paper  Goals of this paper  ACE and un-ACE bits  Results  Conclusion.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Shoestring: Probabilistic.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
Statistical Critical Path Selection for Timing Validation Kai Yang, Kwang-Ting Cheng, and Li-C Wang Department of Electrical and Computer Engineering University.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
Unit 3a Industrial Control Systems
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Accuracy-Configurable Adder for Approximate Arithmetic Designs
Fault-tolerant Typed Assembly Language Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker Princeton University.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Comparison of Differential Evolution and Genetic Algorithm in the Design of a 2MW Permanent Magnet Wind Generator A.D.Lilla, M.A.Khan, P.Barendse Department.
CML CML Compiler-Managed Protection of Register Files for Energy-Efficient Soft Error Reduction Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
LOGO Soft-Error Detection Through Software Fault-Tolerance Techniques by Gökhan Tufan İsmail Yıldız.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
CML CML Compiler Optimization to Reduce Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture Lab Department of Computer.
1 Compacting Test Vector Sets via Strategic Use of Implications Kundan Nepal Electrical Engineering Bucknell University Lewisburg, PA Nuno Alves, Jennifer.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
EnerJ: Approximate Data Types for Safe and General Low-Power Computation (PLDI’2011) Adrian Sampson, Werner Dietl, Emily Fortuna Danushen Gnanapragasam,
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Soft-Error Detection through Software Fault-Tolerance Techniques
Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.
nZDC: A compiler technique for near-Zero silent Data Corruption
Optimization Code Optimization ©SoftMoore Consulting.
Daya S Khudia, Griffin Wright and Scott Mahlke
Hwisoo So. , Moslem Didehban#, Yohan Ko
Soft Error Detection for Iterative Applications Using Offline Training
NEMESIS: A Software Approach for Computing in Presence of Soft Errors
InCheck: An In-application Recovery Scheme for Soft Errors
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
SDC is in the eye of the beholder: A Survey and preliminary study
Presentation transcript:

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor

Soft Errors Soft errors, also called single-event upsets(SEUs) – Occur because of High energy particle strikes Electrical noise Circuit cross talk Image credit: Certichip 2

Increasing Soft Error Rate Parameters affecting soft error rates – Shrinking dimensions – Voltage scaling 14x increase in soft error rate 250nm 40nm [Dixit, IRPS’11] Oracle(Sun)’s neutron beam experiments over past 10 years Increasing 10nm 3

Instruction Duplication Traditional way – Duplicate producer chain – Compare at strategic points  Global stores  Function calls The average overhead is ~50% == Recovery or continue execution original instrs duplicated cmps and branches Load Start point --- 4

Soft Applications 100% accuracy not always required Image Processing Computer Vision Data Analytics Media Applications Robotics 5

Acceptable Vs. Unacceptable Outputs Particle StrikeElectrical Noise 6 Reduce unacceptable outputs efficiently ✓ PSNR > thr ✗ PSNR < thr

Classification Refinement Error in a bit Affects program output? No Masked (Benign) Yes Silent Data Corruption (SDC) Error > thr? Acceptable Silent Data Corruption (ASDCs) Unacceptable Silent Data Corruption (USDCs) NoYes ✓ ✗ ✗ 7

Acceptable Vs Unacceptable No need to pay the cost of detection for acceptable 8

wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } Hierarchy of Protection Needs 9 All variables are not created equal Varying level of protection required for different variables Soft Checks Hard Checks No Checks

Proposed Solution Duplication is expensive -Use it sparingly for critical variables op2 op2 = op3 * op4 -Produces 0 more than thr -Insert value comparison Instructions and outputs of code regions have value locality -Exploit value locality and check for deviations --- DuplicatedOriginal --- Efficient to check for deviations from expectations 10

Expected Value Checks cmp R1R1 R1R1 V1V1 Recovery or continue execution Produces V 1 frequently < R1R1 R1R1 V1V1 Recovery or continue execution Produces between V 1 and V 2 frequently > or V2V2 11

Opt 1: Reducing Value Checks Amenable for value checks 12 R2R2 R3R3 R1R1 R4R4 R5R5

Opt 1: Reducing Value Checks Naïve way: Insert check for all the amenable instructions 13 cmp br cmp br cmp br R2R2 R3R3 R1R1 R4R4 R5R5 V1V1 V4V4 V5V5

Opt 1: Reducing Value Checks 14 cmp br R2R2 R3R3 R1R1 R4R4 R5R5 V5V5 Sufficient to insert check for dominating instruction -Early large variation might get subdued -Get caught at dominating instruction

--- Opt 2: Reducing Duplication op2 Target instr op1 = op2 + 1 op1--- op2 = op3 * op4 D op1 D op1 = D op2 + 1 cmp D op2 = D op3 * D op4 cmp Trigger recovery br original instrs duplicated cmps and branches Trigger recovery br F T F --- 0D op2 --- T Produces 0 more than thr times 15

Value Profiling and Value Ranges Key observations – Recording and storing all values is time consuming – Compact range of values produced by an instruction  Greedy algorithm 16 Algorithms and full details are in the paper

Compilation Flow No annotations required – Identify state (critical) variables by loop carried dependence Expected value checks for less critical variables Intermediate Representation (IR) Code analysis and intelligent duplication (IR to IR) Code generation Application source code Application binary Analyses and optimizations Classification DuplicationValue checks 17 State variables have snowball effect

Evaluation Methodology Program analysis and duplication/checks – Implemented as compiler pass in the LLVM compiler Statistical fault injection (SFI) experiments – GEM5 simulator in ARM syscall emulation mode Random (single) bit flip faults – Simulated entire benchmarks after fault injection – Results classification after completion 18

Benchmarks Image Processing JPEG encoding/decoding tiff to BW Audio/Video Processing G721 encoder/decoder MP3 encoding/decoding H264 encoding/decoding Robotics Kmeans clustering Support vector machine Computer Vision Image segmentation Texture synthesis Diversified set of benchmarks for evaluation 19

Performance Overhead TraditionalSelectiveSelective + Checks 53% 20%

Fault Coverage Analysis Unacceptable Silent Data Corruptions (USDCs) -Worst of the worst outcomes 21 Masked + ASDCsDetectsFailuresUSDCs (in USDCs) 2.8 x Reduction

Conclusion Transient faults are important – Output classification refinement – Selective duplication with value checks In comparison to traditional duplication – Reduces performance overhead from 53% down to 20% – Fault coverage is comparable (98.6% vs 98.8%) 22 ✓ ✗

23

24

Fault Outcome Classification Masked – Acceptable outputs SWDetects – Detected by duplication HWDetects – Produces a symptom such as page fault in 1000 cycles of fault injection Failures – Fail status on program termination or program did not terminate in reasonable time USDCs (Unacceptable Silent Data Corruptions) – Faults that result in unacceptable outputs 25