Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois.

Slides:



Advertisements
Similar presentations
A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Performance of Cache Memory
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
MACAU: A Markov Model for Reliability Evaluations of Caches Under Single-bit and Multi-bit Upsets Jinho Suh Murali Annavaram Michel Dubois.
1 Saad Arrabi 2/24/2010 CS  Definition of soft errors  Motivation of the paper  Goals of this paper  ACE and un-ACE bits  Results  Conclusion.
Segmentation and Paging Considerations
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce.
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Planning under Uncertainty
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
On Model Validation Techniques Alex Karagrigoriou University of Cyprus "Quality - Theory and Practice”, ORT Braude College of Engineering, Karmiel, May.
Power Reduction for FPGA using Multiple Vdd/Vth
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Software Reliability SEG3202 N. El Kadri.
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 System-Level Vulnerability Estimation for Data Caches.
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Sunpyo Hong, Hyesoon Kim
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
CS203 – Advanced Computer Architecture Dependability & Reliability.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
EE 653: Group #3 Impact of Drowsy Caches on SER Arjun Bir Singh Mohammad Abdel-Majeed Sameer G Kulkarni.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
CS161 – Design and Architecture of Computer
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Memory COMPUTER ARCHITECTURE
Lecture 12 Virtual Memory.
How will execution time grow with SIZE?
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Cache Memory Presentation I
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CSCI1600: Embedded and Real Time Software
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
Cache - Optimization.
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois

Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 2

Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 3

Soft Errors Random errors on perfect circuits, mostly affect SRAMs From alpha particles (electric noises) From neutron strikes in cosmic rays Severe problem, especially with power saving techniques Superlinear increases with Voltage & Capacitance scaling Near-/sub-threshold Vdd operation for power savings Concerns in: Large servers Avionic or space electronics SRAM caches with drowsy modes eDRAM caches with reduced refresh rates 4

Why Benchmark Soft Errors Designers need good estimation of expected errors to incorporate ‘just-right’ solution at design time Good estimation is non-trivial Multi-Bit Errors are expected Masking effects: Not every Single Event Upset leads to an error [Mukherjee’03] Faults become errors when they propagate to the outer scope Faults can be masked off at various levels Design decision When the protection under consideration is too much or too little? Is a newly proposed protection scheme better? 5 The impact of soft errors needs to be addressed at design time Estimating soft error rates for target application domains is an important issue

Evaluating Soft Errors: Some Reliability Benchmarking Approaches Fundamental difficulty: Soft errors happen very rarely 6 Field Analysis Life Testing Accelerated Testing Fault Injection Require massive experiments Distortion in measurement/interpretation Better for estimating SER in short time Complexity determines preciseness Difficulty in collecting data Obsolete for design iteration [Ziegler] Analytical Modeling AVF SoftArch Intrinsic SER Intrinsic FIT (Failure-in-Time) rate Highly pessimistic: no consideration of masking effects Unclear for protected caches AVF [Mukherjee’03] and SoftArch [Li’05] Quickly compute SDC without protection or DUE under parity Ignores temporal/spatial MBEs Can’t account for error detection/correction schemes Intrinsic FIT (Failure-in-Time) rate Highly pessimistic: no consideration of masking effects Unclear for protected caches AVF [Mukherjee’03] and SoftArch [Li’05] Quickly compute SDC without protection or DUE under parity Ignores temporal/spatial MBEs Can’t account for error detection/correction schemes

Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 7

Two Components of PARMA (Precise Analytical Reliability Model for Architecture) 1.Fault generation model 2.Fault propagation model Fault becomes Error when faulty bit is consumed Instruction with faulty bit commits Load commits and its operand has a faulty bit 8 Probability distribution of having k faulty bit(s) in a domain (set of bits) during multiple cycles PARMA measures: Generated faults  Propagated faults  Expected errors  Error rate Poisson Single Event Upset model

Using Vulnerability Clocks Cycles to Track Bit Lifetime Used to track cycles that any bit spends in vulnerable component: L2$ Ticks when a bit resides in L2 Stops when a bit stays outside L2 Similar to lifetime analysis in AVF method 9 L2$ Main Memory L1$ Proc Set of bits VC: stops Set of bits VC: ticks Accesses to L1$ determines REAL impact of Soft Error to the system L2 block is NOT dead even when it is evicted to MEM because it can be refilled into L2 later When L1 block is evicted, consumption of the faulty bits is finalized When a word is updated to hold new data, its VC resets to zero VC Word# VC Word# VC Word# VC Word# When this block is refilled later, VCs should start ticking from here

Probability of a Bit Flip in One Cycle SEU Model p : probability that one bit is flipped during one cycle period Poisson probability mass function gives p λ: Poisson rate of SEUs ex) nm 3GHz CPU 10

Temporal Expansion: Probability of a Bit Flip in N c vulnerability cycles q(N c ) : probability of a bit being faulty in N c vulnerability cycles To be faulty at the end of N c cycles, a bit must flip an odd number of times in N c 11

Spatial Expansion: from a Bit to the Protection Domain (Word) S Q(k) Probability of set of bits S having k faulty bits inside (during N c cycles) Choose cases where there are k faulty bits in S S has [S] bits inside Assumed that all the bits in the word have the same VCs Otherwise, discrete convolution should be used 12

Faults in the Access Domain (Block) D Q(k) Probability of k faulty bits in any protection domain inside of D (  S m ) Choose cases where there are k faulty bits in each S m Sum for all S m in D 13 So far, masking effect has not been considered Expected number of intrinsic faults/errors are calculated so far

Considering Masking Effect: Separating TRUE from Intrinsic Faults If all faults occur in unconsumed bits, then don’t care (FALSE events) TRUE faults = {All faults in S} – {All faults in unconsumed bits} Probability that has k faults, and C has 0 fault: FALSE or masked faults Deduct the probability that ALL k faulty bits are in the unconsumed bytes from the probability that the protection domain S has k faulty bits to obtain the probability of TRUE faults which becomes SDCs or TRUE DUEs C and are obtained through simulations

Using PARMA to Measure Errors in Block Protected by block-level SECDED Undetected error that affects reliability (SDC): three or more faulty bits in the block; at least one faulty bit in the consumed bits Detected error that affects reliability (TRUE DUE): exactly two faulty bits in the block; at least one faulty bit in the consumed bits 15 k>=3 is SDC >=3 faults in Block All faulty bits unconsumed See paper for how to apply PARMA on the different protection schemes k =2 is DUE 2 faults in Block All faulty bits unconsumed

Four Contributions 1.Development of the rigorous analytical model called PARMA 2.Measuring SERs on structures protected by various schemes 3.Observing possible distortions in the accelerated studies Quantitatively Qualitatively 4.Verifying approximate models 16 Modeling Application

Measuring SERs on Structures Protected by Various Schemes Target Failures-In-Time of IBM Power6 SDC: 114 DUE: 4,566 Average L2 (256KB, 32B block) cache FITs: 100M SimPoint simulations of 18 benchmarks from SPEC2K, on sim-outorder Implies word-level SECDED might be overkill in most cases Implies increasing the protection domain size: ex) Partially protected caches or caches with adaptive protection schemes need to be carefully quantified for their FITs PARMA provides comprehensive framework that can measure the effectiveness of such schemes 17 SchemesSDC (TRUE+FALSE) DUE Latency Checkbits per 256 bits No Protection155.66N/A100 1-bit Odd Parity2.53E Block-level SECDED8.34E E Word-level SECDED2.92E E Results were verified with AVF simulations

Observing Possible Distortions in the Accelerated Tests Highly accelerated tests SPEC2K benchmarks end in several minutes (wall-clock time) Needs to accelerate SEU rate times to see reasonable faults 18 SDC > DUE ? Having more than two errors overwhelms the cases of having two errors Can be misleading qualitatively SDC DUE MAX Possible Errors How to scale down the results? Results multiplied by times? Can distort results quantitatively Results were verified with fault-injection simulations

Mean of geometric distribution Verifying Approximate Models Example: model for word level SECDED protected cache Methods for determining cache scrubbing rates [Mukherjee’04][Saleh’90] Ignoring cleaning effects at accesses: overestimate by how much? New model with geometric distribution of Bernoulli trials Assumption: at most two bits are flipped between two accesses to the same word Every access results in a detected error or in no-error (corrected) 19 PARMA provides rigorous reliability measurements Hence, it is useful to verify the faster, simpler approximate models AVF x FIT from previous method FIT New approximate model E-14 FIT PARMA E-16 FIT Average ACE intervalAverage unACE interval T AVG : Average access interval between two accesses to the same word P DUE : pmf of two Poisson arrivals

Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 20

Conclusion and Future Work Summary + PARMA is a rigorous model measuring Soft Error Rates in architectures + PARMA works with wide range of SEU rates without distortion + PARMA handles temporal MBEs + PARMA quantifies SDC or DUE rates under various error detection/protection schemes - PARMA does not address spatial MBEs yet - PARMA does not model TAG yet - Due to the complexity, PARMA is slow Future Work Extend PARMA to account for spatial MBEs and TAG vulnerability Develop sampling methods to accelerate PARMA 21

THANK YOU! QUESTIONS?

(Some) References [Biswas’05] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. Mukherjee, R Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of the 32nd International Symposium on Computer Architecture, , 2005 [Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors. In Proceedings of the International Conference on Dependable Systems and Networks, , [Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to calculate the architectural vulnerability factors for a high- performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40, [Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium on Dependable Computing, 37-42, [Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques for Memory Systems. In IEEE Transactions on Reliability, 39(1), , [Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp 23

Addendum

Some Definitions SDC = Silent Data Corruption DUE = Detected and unrecoverable error SER = Soft Error Rate = SDC + DUE Errors are measured as MTTF = Mean Time to Failure FIT = Failure in Time ; 1 FIT = 1 failure in billion hours 1 year MTTF = 1billion/(24*365)= 114,155 FIT FIT is commonly used since FIT is additive Vulnerability Factor = fraction of faults that become errors Also called derating factor or soft error sensitivity 25

Soft Errors and Technology Scaling Hazucha & Svensson model For a specific size of SRAM array: Flux depends on altitude & geomagnetic shielding (environmental factor) (Bit)Area is process technology dependent (technology factor) Qcoll is charge collection efficiency, technology dependent Qcrit  Cnode * Vdd According to scaling rules both C and V decrease and hence Q decreases rapidly Static power saving techniques (on caches) with drowsy mode or using near-/sub-threshold Vdd make cells more vulnerable to soft errors 26 Hazucha et al, “Impact of CMOS technology scaling on the atmospheric neutron soft error rate ”

Error Classification Silent Data Corruption (SDC) TRUE- and FALSE- Detected Unrecoverable Error (DUE) 27 Consumed ? Consumed ? C. Weaver et al, “Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor,” ISCA 2004

Soft Error Rate (SER) Intrinsic SER – more from the component’s view Assumes all bits are important all the time Intrinsic SER projections from ITRS2007 (High Performance model) Intrinsic SER of caches protected by SECDED code? Cleaning effect on every access Realistic SER – more from the system’s view Some soft errors are masked and do not cause system failure EX) AVF x Intrinsic SER: what about caches with protection code? 28 Year or production Feature size [nm] Gate Length [nm] Soft Error Rate [FIT per Mb] Failure Rate in 1Mb [fails/hour]1.2E-61.25E-61.3E-61.35E-61.4E-6 % Multi-Bit Upsets in Single Event Upsets32%64%100%

Soft Error Estimation Methodologies: Industries Field analysis Statistically analyzes reported soft errors in market products Using repair record, sales of replacement parts Provides obsolete data Life testing Tester constantly cycles through 1,000 chips looking for errors Takes around six months Expensive, not fast enough for chip design process Usually used to confirm the accuracy of accelerated testing (x2 rule) Accelerated testing Chips under various beams of particles, under well-defined test protocol Terrestrial neutrons – particle accelerators (protons) Thermal neutrons – nuclear reactors Radioactive contamination – radioactive materials Hardship Data rarely published: potential liability problems of products Even rarer the comparison of accelerated testing vs life testing IBM, Cypress published small amount of data showing correlation 29 J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp

Soft Error Estimation Methodologies: Common Ways in Researches Fault-injection Generate artificial faults based on the fault model +Applicable to wide level of designs (from RTL to system simulations) -Massive number of simulations necessary to be statistically valid -Highly accelerated Single Event Upset (SEU) rate is required for Soft Errors -How to scale down the measurements to ‘real environment’ is unclear Architectural Vulnerability Factor Find derating factor (Faults  Errors) by {ACE bits}/{total bits} per cycle SoftArch Extrapolate AVG(TTFs) from one program to MTTF using infinite executions AVF and SoftArch – uses simplified Poisson fault generation model +Works well with small scale system in the current technology at earth’s surface: single bit error dominant environment -Can’t account for error protection/detection schemes (ECC) -Unable to address temporal & spatial MBEs AVF is NOT an absolute metric for reliability FIT structure = intrinsic_FIT structure * AVF structure 30 M. Li et al, “Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults,” HPCA 2009 S. S Mukherjee et al, “A systematic methodology to calculate the architectural vulnerability factors for a high-performance microprocessor.” MICRO 2003 X. Li et al, “SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors.” DSN 2005

Evaluating Soft Errors: Some Reliability Benchmarking Approaches Intrinsic FIT (Failure-in-Time) rate – highly pessimistic Every bit is vulnerable in every cycle Unclear how to compute intrinsic FIT rates for protected caches Architectural Vulnerability Factor [Mukherjee’03] Lifetime analysis on Architecturally Correct Execution bits De-rating factor (Faults  Errors); realistic FIT = AVF x Intrinsic FIT SoftArch [Li’05] Computes TTF for one program run and extrapolates to MTTF AVF and SoftArch + Quickly compute SDC with no parity or DUE under parity - Ignores temporal MBEs Two SEUs on one word become two faults instead of one fault Two SEUs on the same bit become two faults instead of zero fault - Ignores spatial MBEs - Can’t account for error detection / correction schemes To compare SERs of various error correcting schemes: Temporal/spatial MBEs must be accurately counted

Prior State of the Art Reliability Model: AVF Architectural Vulnerability Factor (AVF) AVF bit = Probability a bit matters (for a user-visible error) = # of bits affects to user-visible outcome / total # of bits If we assume AVF = 100% then we will over design the system Need to estimate AVF to optimize the system design for reliability AVF equation for a target structure AVF is NOT an absolute metric for reliability FIT structure = intrinsic_FIT structure * AVF structure 32 ……(Eq. 1) Shubu Mukherjee, “Architecture design for soft errors”

ACEness of a bit ACE (Architecturally Correct Execution) bit ACE bit affects program outcome: correctness is subjective (user-visible) Microarchitectural ACE bits Invisible to programmer, but affects program outcome Easier to consider Un-ACE bits –Idle/Invalid/Misspeculated state –Predictor structures –Ex-ACE state (architecturally dead or invisible states) Architectural ACE bits Visible to programmer Transitive (ACE bit in the word makes the Load instruction ACE) Easier to consider Un-ACE bits –NOP instructions –Performance-enhancing operations (non-opcode field of non-binding prefetch, branch prediction hint) –Predicated false instructions (except predicate status bit) –Dynamically dead instructions –Logical masking AVF framework = lifetime analysis to correctly find ACEness of bits in the target structure for every operating cycle 33 Shubu Mukherjee, “Architecture design for soft errors”

Rigorous Failure/Error Rate Modeling In existing methodologies such as AVF multiplied by intrinsic rate Estimation is simple and easy Imprecise estimation but safe-overestimation Downside of classical approach (i.e. AVF-based methodology) SEU is very rare event while program execution time is rather short In 3GHz processor, SEU rate is E-25 within one cycle for one bit Equivalently, the probability of being hit by SEU and being faulty bit is E-25 Simplified assumption that one SEU results one Fault/Error directly same bit may be hit multiple times, and/or multiple bits may become faulty in a word In space, or when extremely low Vdd is supplied to SRAM cell: SEU rate could rise high (more than 10E6 times) Second order effects become significant With data protection methodology: How to measure vulnerability is uncertain due to the simplified assumption 34

Reliability Theory (1) Fundamental definition of probability in Reliability Theory Number(Event)/Number(Trials): Approximations of true Prob(Event) True probability is barely known approx  true when trials  ∞ by the Law of Large Numbers Two events in R-T: Survival & Failure of a component/system Reliability Functions (Component/system) Reliability R(t), and Probability of Failure Q(t) Prob(Event) up to and at time t: conditional probability Note that R(t), Q(t) are time dependent in general (Conditional) Instantaneous Failure Rate λ(t) - a.k.a, Hazard function h(t) 35

Reliability Theory (2) Reliability functions (cont’d) (Unconditional) Failure Density Function f(t) Average Failure Rate from time 0 to T Discrete dual of λ(t) - Hazard Probability Mass Function h(j) Average Failure Rate from timeslot 0 to T 36

Reliability Theory (3) How to measure Reliability R(t) itself Events with constant failure rate MTTF Sampling issue: Usually no test can aggregate total test time to ∞ (Right) censorship with no replacement, then Maximum Likelihood Estimation –by B. Epstein, 1954 –At the end of the test time t r, measure TTFs (t i ) for samples that failed and truncate the lifetime of all survived samples to t r –Then, MLE of MTTF is FIT – one intuitive form of failure rate Failures in time 1E9 hours Interchangeable with MTTF only when failure rate is constant Additive between independent components 37

Vulnerability Clock Used to track cycles that any bit spends in vulnerable component: L2$ Ticks when a bit resides in L2 Stops when a bit stays outside L2 38 Cold Miss VC_MEM = 0 VC_L2 = 0 VC_L1 := VC_L2 = 0 VC_L2 = 0 Store VC_L1 := 0 VC_L2 = 100 Writeback VC_L1 = 0 VC_L2 = 150VC_L2 := VC_L1= 0VC_L2 = 1 VC_L2 = 80 VC_MEM := VC_L2 = 80VC_MEM = 80 VC_L2 := VC_MEM = 80

PARMA Model: Measuring Soft Error FIT with PARMA PARMA measures failure rate by accumulating failure probability mass Index processor cycle by j (1 ≤ j ≤ Texe) Total failures observed during Texe (failure rate): Equivalent to expected number of failures of type ERR FIT extrapolation with infinite program execution assumption How to calculate ? Let’s start with p: probability that one bit is flipped during one cycle period Obtained from Poisson SEU model 39

PARMA Model: Fault Generation Model SEU Model Assumptions: All clock cycles are independent to SEUs All bits are independent to SEUs (do not account for spatial MBEs) Widely accepted model for SEU: Poisson model p : probability that one bit is flipped during one cycle period (in SBE cases) Spatial MBE case: probability that multi-bits become faulty during one cycle Poisson probability mass function gives p λ: Poisson rate of SEUs, ex) nm 3GHz CPU 40

PARMA Model: Measuring Soft Error FIT with PARMA PARMA measures failure rate by accumulating failure probability mass Index processor cycle by j (1 ≤ j ≤ Texe) A (conditional) failure probability mass at cycle j : Total failures observed during Texe (failure rate): Equivalent to expected number of failures of type ERR FIT extrapolation with infinite program execution assumption Average FIT with multiple programs 41

Failures Measured in PARMA No-protection, 1-bit Parity, 1-bit ECC on Word and 1-bit ECC on Block 42 No parity1-bit Parity1-bit ECC SDCTRUE DUESDC TRUE DUESDC word-levelblock-levelword-levelblock-level Access Domain Block Blk containing M words Block Blk containing M words Block Protection Domain N/ABlock WordBlockWordBlock Faulty bits≥1 in C ∀ odd in S, ≥1 in C ∀ even >0 in S, ≥1 in C 2 in any S m, ≥1 in that C m 2 in S, ≥1 in C ≥3 in any S m, ≥1 in that C m ≥3 in S, ≥1 in C Notation

Spatial Expansion: From a Bit to a Byte in N c Vulnerability Cycles q b (k) Probability of a Byte having k faulty bits (in N c vulnerability cycles) From 8 bits in the Byte, choose k faulty bit 43

Spatial Expansion: from a Byte to the Protection Domain (Word) S Q(k) Probability of set of bits S having k faulty bits inside (during N c cycles) Choose cases where there are k faulty bits in S Enumerate all possibilities of faulty bits in bytes of S such that their total number = k 44

Faults in the Access Domain (Block) D Q(k) Probability of k faulty bits in any protection domain inside of D (  S m ) Choose cases where there are k faulty bits in S m Sum for all S m in D 45 So far, masking effect has not been considered Expected number of intrinsic faults/errors are calculated so far

Odd parity per block SDCs: having at least one faulty bit in the consumed bytes, from having nonzero, even number of faulty bits in the block TRUE DUEs: having at least one faulty bit in the consumed bytes, from having odd number of faulty bits in the block PARMA Model: Failures Measured in PARMA (1) Unprotected cache Without protection, any non- zero faulty bit(s) will cause SDC failure SDCs: having at least one faulty bit in the consumed bits 46 Nonzero, even # k faulty bits in the block is SDC All faults Unconsumed

PARMA Model: Failures Measured in PARMA (2) SECDED per block SDCs: having at least one faulty bit in the consumed bits, from having more than two faulty bits in the block TRUE DUEs: having at least one faulty bit in the consumed bits, from having exactly two faulty bits in the block 47 SECDED per word Same to ‘per block’ case except protection domain is word Because access domain is block, all the words in the same block are addressed by adding FITs k>=3 is SDC >=3 faults in Block All faults Unconsumed For all the words in the block Additive because FITs from each word is independent and counted separately

PARMA Simulations Target processor 4-wide OoO processor 64-entry ROB 32-entry LSQ McFarling’s hybrid branch predictor Cache configuration sim-outorder was modified and executed with alpha ISA 18 benchmarks from SPEC2000 were used with SimPoint Sampling of 100M-instruction samples 48 CacheSizeAssociativityLatency [cyc] IL1: 32B BLK16KB1-way2 DL1: 32B BLK16KB4-way3 UL2: 32B BLK256KB8-way NP/P1b: 10 SW(4B):13 SB(64B): 14

Evaluating Soft Errors: AVF or Fault-Injection, Why Not? AVF fails for handling scenarios under error protection schemes Why not use fault injection for such scenarios? Possible distortion in the interpretation of results due to the highly accelerated experiments 49

Simulations with PARMA: Results in FIT (1) Bench(a)(b)(c)(d)(e)(f)(g)(h)(i) ammp E E E E E E E-31 art E E E E E E-36 crafty E E E E E E E-32 eon E E E E E E E-32 facerec E E E E E E E-36 galgel E E E E E E-36 gap E E E E E E E-35 gcc E E E E E E E-33 gzip E E E E E E E-33 mcf E E E E E E-37 mesa E E E E E E E-34 parser E E E E E E E-34 perlbmk E E E E E E E-34 sixtrack E E E E E E E-34 twolf E E E E E E E-34 vortex E E E E E E E-33 vpr E E E E E E E-34 wupwise E E E E E E E-34 Average E E E E E E E (a) NP_ SDC: no-protection/SDC (≈ AVF_SDC) P1B_TRUE_DUE:odd parity/TRUE DUE (b) P1B_FALSE_DUE:odd parity/FALSE DUE (c) P1B_ SDC: odd parity/SDC (d) SB_TRUE_DUE: block-level SECDED/TRUE DUE (e) SB_FALSE_DUE: block-level SECDED/FALSE DUE (f) SB_SDC: block-level SECDED/SDC (g) SW_TRUE_DUE: word-level SECDED/TRUE DUE (h) SW_FALSE_DUE: word-level SECDED/TRUE DUE (i) SW_SDC: word-level SECDED/SDC

PARMA Application: a Gold-Standard for Developing New Approximate Model (3) Results 51 NameAVF AVFxFIT from previous method FIT from new approximate model FIT from PARMA ammp40.977% E E-15 art2.849% E E-17 crafty61.078% E E-15 eon99.049% E E-15 facerec4.319% E E-17 galgel6.010% E E-17 gap7.118% E E-17 gcc27.658% E E-16 gzip83.466% E E-15 mcf1.267% E E-18 mesa30.070% E E-16 parser22.983% E E-16 perlbmk31.621% E E-15 sixtrack3.916% E E-17 twolf26.750% E E-16 vortex53.171% E E-15 vpr24.232% E E-16 wupwise12.183% E E-16 Average27.33% E E-16 With PARMA, we can verify newly developed approximate models

Simulation with PARMA: Overhead Need to track all memory footprint Vulnerability clock cycles for L1, L2 and Memory copies Data structure: Binary Search Tree Quick search and insertion Memory footprint never decreases Memory overhead: ~17 bytes for tracking 1 byte of memory footprint Computation overhead: O(n 3 ) with non-parallelized code n : number of bits in the block Probability calculation for having k specific faulty bits is O(n 2 ) Need to know the probability distribution on k in [0, n] Overall ~25x slowdown in simulation time from base sim-outorder Still much faster than doing massive number of tests with fault injection 52

PARMA Application: a Gold-Standard for Developing New Approximate Model (1) PARMA provides rigorous reliability measurements Hence, it is useful to verify faster, simpler approximate models Example: model for word-level SECDED protected cache Known methods for determining cache scrubbing rates Model from previous work [Mukherjee’04][Saleh’90] Ignores cleaning effects at accesses –Okay for determining cache scrubbing rates because it overestimates –But by how much does it overestimate? 53

PARMA Application: a Gold-Standard for Developing New Approximate Model (2) 54

PARMA Application: a Gold-Standard for Developing New Approximate Model (3) Word level SECDED average vulnerability, converted to FIT rate 55 With PARMA, we can verify newly developed approximate models AVF x Intrinsic FIT from previous method FIT New approximate model E-14 FIT PARMA E-16 FIT