University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Shoestring: Probabilistic.

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Microprocessor Reliability
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Efficient Software-Based Fault Isolation—sandboxing Presented by Carl Yao.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Processor Architecture
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
Gill 1 MAPLD 2005/234 Analysis and Reduction Soft Delay Errors in CMOS Circuits Balkaran Gill, Chris Papachristou, and Francis Wolff Department of Electrical.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
nZDC: A compiler technique for near-Zero silent Data Corruption
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
MAPLD 2005 BOF-L Mitigation Methods for
Daya S Khudia, Griffin Wright and Scott Mahlke
RegLess: Just-in-Time Operand Staging for GPUs
Hwisoo So. , Moslem Didehban#, Yohan Ko
Superscalar Processors & VLIW Processors
Fault Injection: A Method for Validating Fault-tolerant System
Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003
Soft Error Detection for Iterative Applications Using Offline Training
NEMESIS: A Software Approach for Computing in Presence of Soft Errors
InCheck: An In-application Recovery Scheme for Soft Errors
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
Levels of Parallelism within a Single Processor
Chapter 2 Operating System Overview
Fault Tolerant Systems in a Space Environment
Software Techniques for Soft Error Resilience
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Shoestring: Probabilistic Reliability on the Cheap Authors: Shuguang Feng Shantanu Gupta Amin Ansari Scott Mahlke University of Michigan

Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 2 The Problem...  Soft Errors / Transient Faults / Single Event Upsets (SEU)  Caused by a variety of phenomena  Cosmic radiation (high energy neutrons)  Alpha particles (packaging impurities)  Voltage fluctuations (dI/dt)  Timing speculation  Can alter the value stored in state elements as well as the output of combinational logic  Attenuated by masking at several levels (circuit, logic, μarch, etc.)  Soft error rates (SER) are on the rise

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 3 Soft error rate (SER) Past Present Future Aggressive voltage scaling (near-threshold computing) One failure per MONTH per 100 chips One failure per DAY per 100 chips One failure per DAY per chip [Shivakumar’02]  At high error rates, reliability cannot be reserved only for mission-critical systems  There is a need for low-cost mainstream solutions  At high error rates, reliability cannot be reserved only for mission-critical systems  There is a need for low-cost mainstream solutions

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science  Traditional dual/triple – modular redundancy  Run on separate hardware and compare results  Mission-critical reliability w/ high hardware costs [IBM Z-series, HP NonStop]  Traditional dual/triple – modular redundancy  Run on separate hardware and compare results  Mission-critical reliability w/ high hardware costs [IBM Z-series, HP NonStop]  Utilize multiple threads (temporal) instead of separate hardware (spatial)  Retain high coverage but sacrifice performance costs to save area [AR-SMT, Reunion]  Utilize multiple threads (temporal) instead of separate hardware (spatial)  Retain high coverage but sacrifice performance costs to save area [AR-SMT, Reunion]  Perform selective checking  software invariants  critical μarch structures [ARGUS, DIVA, Reddy:DSN`08]  Perform selective checking  software invariants  critical μarch structures [ARGUS, DIVA, Reddy:DSN`08]  Redundant execution in a single- threaded context  compiler interleaves original and redundant instructions  “tunable” coverage [SWIFT, EDDI]  Redundant execution in a single- threaded context  compiler interleaves original and redundant instructions  “tunable” coverage [SWIFT, EDDI]  Relies on anomalous behavior to identify faults  extremely cheap  moderate coverage [RESTORE, SWAT]  Relies on anomalous behavior to identify faults  extremely cheap  moderate coverage [RESTORE, SWAT]  Bridge the gap between symptom- based schemes and instruction duplication  “reliability for the masses”  sacrifice a little on coverage to maintain very low costs  Bridge the gap between symptom- based schemes and instruction duplication  “reliability for the masses”  sacrifice a little on coverage to maintain very low costs 4 Traditional Architecture Solutions Increasing Fault Coverage Increasing Overheads (area, power, performance, etc.) n-Modular Redundancy Redundant Multi-threading Invariant Checking Symptom-based Shoestring Instruction Duplication

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Shoestring One failure per WEEK One failure per MONTH Fault coverage from different sources of masking (92%) 5 Shoestring Increasing Overheads (performance) One failure per DAY Instruction duplication-based detection Symptom-based detection Hardware exceptions Branch mispredicts Cache misses Shoestring Increasing Fault Coverage

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Shoestring One failure per WEEK One failure per MONTH Fault coverage from different sources of masking (92%) Increasing Overheads (performance) One failure per DAY Instruction duplication-based detection Symptom-based detection Hardware exceptions Branch mispredicts Cache misses Shoestring Increasing Fault Coverage  Not all applications/domains require enterprise-level fault tolerance  The cost of “five-nines” reliability is very high  Spent in pursuit of the last few “nines”  Not all applications/domains require enterprise-level fault tolerance  The cost of “five-nines” reliability is very high  Spent in pursuit of the last few “nines” Provide affordable reliability for commodity systems on a “shoestring” budget  Exploit cheap symptom-based fault detection  Judiciously apply sw-level instruction duplication to improve coverage Provide affordable reliability for commodity systems on a “shoestring” budget  Exploit cheap symptom-based fault detection  Judiciously apply sw-level instruction duplication to improve coverage 6 Shoestring

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 7 Shoestring: System Components  Analyzes program structure  Generate enhanced binary w/ selective duplication for vulnerable code  Analyzes program structure  Generate enhanced binary w/ selective duplication for vulnerable code  Trap runtime exceptions  Selected symptoms  Faults in duplicated code  Recovers w/ pipeline flush / lightweight rollback  Trap runtime exceptions  Selected symptoms  Faults in duplicated code  Recovers w/ pipeline flush / lightweight rollback Selective Duplication Compilation OS / Virtual Machine Physical Hardware Runtime Environment

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Code Generation Passes Instruction Selection Register Allocation Code Emission 8 Shoestring: Compilation Application Source Code Application Binary Shoestring Passes Preliminary Classification Preliminary Classification Vulnerability Analysis Vulnerability Analysis Code Duplication Instruction Selection Machine-independent Passes Traditional Optimizations Front-end Shoestring Passes Preliminary Classification Preliminary Classification Vulnerability Analysis Vulnerability Analysis Code Duplication

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 9 Preliminary Classification Shoestring Passes Preliminary Classification Preliminary Classification Vulnerability Analysis Vulnerability Analysis Code Duplication Preliminary Classification Preliminary Classification Identify instructions that are …  Symptom-generating  Instructions that can aid in detecting the presence of a soft-error  Used during vulnerability analysis  High-value  Instructions that affect “correct” program execution (user-visible output)  Used during selective code duplication Identify instructions that are …  Symptom-generating  Instructions that can aid in detecting the presence of a soft-error  Used during vulnerability analysis  High-value  Instructions that affect “correct” program execution (user-visible output)  Used during selective code duplication

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 10 Preliminary Classification: Symptom-generating r0 = mem[old_base] r1 = mem[new_base] r2 = 128 r4 = 1 r5 = r2 / r4 r6 = mem[r0+r5] r4 = r4 + 1 mem[r1+r5] = r6 r7 = mem[new_base] printf(r7,…) stack[n_copied] = r4 If an input operand is corrupted the instruction is likely to cause anomalous behavior  segment fault  divide by zero  cache miss  branch mispredict ...  Limit consideration to symptoms that rarely, if ever, occur during normal operation

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 11 Preliminary Classification: High-value r0 = mem[old_base] r1 = mem[new_base] r2 = 128 r4 = 1 r5 = r2 / r4 r6 = mem[r0+r5] r4 = r4 + 1 mem[r1+r5] = r6 r7 = mem[new_base] printf(r7,…) stack[n_copied] = r4 Instructions that can corrupt program output after consuming an erroneous operand  stores to global memory  arguments to function/library calls Ignore local bookeeping variable

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Shoestring Passes Preliminary Classification Preliminary Classification Vulnerability Analysis Vulnerability Analysis Code Duplication Vulnerability Analysis Vulnerability Analysis 12 Vulnerability Analysis Identify safe instructions  Instructions with outputs inherently “covered” by symptom-generating instructions  Data/Control flow analysis  Duplication can safely ignore these instructions without losing fault coverage Identify safe instructions  Instructions with outputs inherently “covered” by symptom-generating instructions  Data/Control flow analysis  Duplication can safely ignore these instructions without losing fault coverage

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 13 Vulnerability Analysis Dataflow Graph (DFG) Symptom-generating Machine-dependent Representation

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 14 Vulnerability Analysis Dataflow Graph (DFG) For each instruction …  Count the # of symptom-generating consumers (N tot )  Consider an instruction with “enough” (S t ) symptom-generating consumers as safe  Limit to instructions within a fixed distance (S lat ) in the statically scheduled code  Scale indirect consumers by S iscale S t = 2 S lat = 100 S iscale = Symptom-generating M M N tot = 1N tot = 1 + 1(0.8) N tot = 1 + 1(0.8) + 1(0.8) 2 = 2.44, SAFE! Instruction 3 Instruction 6 Instruction 7 Instruction 9

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 15 Vulnerability Analysis Dataflow Graph (DFG) N tot = 2.44 N tot = 0 N tot = 1 N tot = 1.8 N tot = 2.44 N tot = 0 V V V V V V V V V V S S V V V V S S S S Safe V V Vulnerable Symptom-generating For each instruction …  Count the # of symptom-generating consumers (N tot )  Consider an instruction with “enough” (S t ) symptom-generating consumers as safe  Limit to instructions within a fixed distance (S lat ) in the statically scheduled code  Scale indirect consumers by S iscale S t = 2 S lat = 100 S iscale = 0.8 What about control flow... Symptom-generating consumers may not be on the common execution path  Scale N tot based on execution probabilities (profiled) What about control flow... Symptom-generating consumers may not be on the common execution path  Scale N tot based on execution probabilities (profiled)

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Shoestring Passes Preliminary Classification Preliminary Classification Vulnerability Analysis Vulnerability Analysis Code Duplication 16 Selective Code Duplication Protect high-value instructions  Ensure high-value instructions operate on correct data (dataflow)  (Partially) duplicate input operands  Ensure they execute when they are supposed to (control flow) Protect high-value instructions  Ensure high-value instructions operate on correct data (dataflow)  (Partially) duplicate input operands  Ensure they execute when they are supposed to (control flow)

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 Selective Code Duplication  Duplicate backward slice of input operands, terminating at  Load instructions (“L”)  safe instructions (“S”)  Insert checks prior to use (compare and branch)  Each use has its own check S S S S Dataflow Graph (DFG) 5 5 3’ 2’ 1’ 4’ = = = = = = L L SS Safe Vulnerable High-value Starting at high-value instructions … What about control flow...  Ideally, protect all branches potentially impacting high- value instructions (transitive closure)  Inspired by work on Y-branches [Wang`03] we relax constraints and only protect the subset of branches with control dependence edges What about control flow...  Ideally, protect all branches potentially impacting high- value instructions (transitive closure)  Inspired by work on Y-branches [Wang`03] we relax constraints and only protect the subset of branches with control dependence edges

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 18 Evaluation Methodology  Program analysis and selective duplication  Implemented as additional backend passes in the LLVM compiler  Statistical fault injection (SFI) campaign  Instrumented the PTLSim (x86) simulator (AMD-K8 model)  Random (single) bit flip in physical register file  Simulated 10M instructions in detail after fault injection  Simulated remainder of the program on the host machine in native mode  Performance overhead measured on real hardware (Intel Core2)

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 19 Coverage: w/ Symptoms Only Could be the difference between one fault per day/week vs. month

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 20 Coverage: w/ Symptoms and Duplication

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 21 Runtime Overhead [SWIFT, CGO`05] Increasing values for S t (fewer safe instructions) Increasing values for S t (fewer safe instructions)

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 22 Conclusions  Unlike traditional schemes, Shoestring provides mainstream, commodity systems with reliability on the cheap.  Shoestring can augment symptom-based schemes with selective duplication to cover an additional 33.9% of unmasked faults.  Overall, Shoestring allows just 1.6% of all faults to manifest as user-visible corruptions (~5x reduction). 15.8% performance penalty (vs. ~80%)  Developing more sophisticated heuristics (vulnerability analysis, high value instructions) presents opportunities for improvement.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Questions? 23