Presentation is loading. Please wait.

Presentation is loading. Please wait.

Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois.

Similar presentations


Presentation on theme: "Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois."— Presentation transcript:

1 Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois

2 Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 2

3 Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 3

4 Soft Errors Random errors on perfect circuits, mostly affect SRAMs From alpha particles (electric noises) From neutron strikes in cosmic rays Severe problem, especially with power saving techniques Superlinear increases with Voltage & Capacitance scaling Near-/sub-threshold Vdd operation for power savings Concerns in: Large servers Avionic or space electronics SRAM caches with drowsy modes eDRAM caches with reduced refresh rates 4

5 Why Benchmark Soft Errors Designers need good estimation of expected errors to incorporate ‘just-right’ solution at design time Good estimation is non-trivial Multi-Bit Errors are expected Masking effects: Not every Single Event Upset leads to an error [Mukherjee’03] Faults become errors when they propagate to the outer scope Faults can be masked off at various levels Design decision When the protection under consideration is too much or too little? Is a newly proposed protection scheme better? 5 The impact of soft errors needs to be addressed at design time Estimating soft error rates for target application domains is an important issue

6 Evaluating Soft Errors: Some Reliability Benchmarking Approaches Fundamental difficulty: Soft errors happen very rarely 6 Field Analysis Life Testing Accelerated Testing Fault Injection Require massive experiments Distortion in measurement/interpretation Better for estimating SER in short time Complexity determines preciseness Difficulty in collecting data Obsolete for design iteration [Ziegler] Analytical Modeling AVF SoftArch Intrinsic SER Intrinsic FIT (Failure-in-Time) rate Highly pessimistic: no consideration of masking effects Unclear for protected caches AVF [Mukherjee’03] and SoftArch [Li’05] Quickly compute SDC without protection or DUE under parity Ignores temporal/spatial MBEs Can’t account for error detection/correction schemes Intrinsic FIT (Failure-in-Time) rate Highly pessimistic: no consideration of masking effects Unclear for protected caches AVF [Mukherjee’03] and SoftArch [Li’05] Quickly compute SDC without protection or DUE under parity Ignores temporal/spatial MBEs Can’t account for error detection/correction schemes

7 Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 7

8 Two Components of PARMA (Precise Analytical Reliability Model for Architecture) 1.Fault generation model 2.Fault propagation model Fault becomes Error when faulty bit is consumed Instruction with faulty bit commits Load commits and its operand has a faulty bit 8 Probability distribution of having k faulty bit(s) in a domain (set of bits) during multiple cycles PARMA measures: Generated faults  Propagated faults  Expected errors  Error rate Poisson Single Event Upset model

9 Using Vulnerability Clocks Cycles to Track Bit Lifetime Used to track cycles that any bit spends in vulnerable component: L2$ Ticks when a bit resides in L2 Stops when a bit stays outside L2 Similar to lifetime analysis in AVF method 9 L2$ Main Memory L1$ Proc Set of bits VC: stops Set of bits VC: ticks Accesses to L1$ determines REAL impact of Soft Error to the system L2 block is NOT dead even when it is evicted to MEM because it can be refilled into L2 later When L1 block is evicted, consumption of the faulty bits is finalized When a word is updated to hold new data, its VC resets to zero VC Word# 0 0 0 0 0 0 1 1 0 0 2 2 0 0 3 3 VC Word# 100 0 0 1 1 2 2 3 3 VC Word# 200 0 0 0 0 1 1 2 2 3 3 VC Word# 500 0 0 300 1 1 500 2 2 3 3 When this block is refilled later, VCs should start ticking from here

10 Probability of a Bit Flip in One Cycle SEU Model p : probability that one bit is flipped during one cycle period Poisson probability mass function gives p λ: Poisson rate of SEUs ex) 10 -25 /bit @ 65nm 3GHz CPU 10

11 Temporal Expansion: Probability of a Bit Flip in N c vulnerability cycles q(N c ) : probability of a bit being faulty in N c vulnerability cycles To be faulty at the end of N c cycles, a bit must flip an odd number of times in N c 11

12 Spatial Expansion: from a Bit to the Protection Domain (Word) S Q(k) Probability of set of bits S having k faulty bits inside (during N c cycles) Choose cases where there are k faulty bits in S S has [S] bits inside Assumed that all the bits in the word have the same VCs Otherwise, discrete convolution should be used 12

13 Faults in the Access Domain (Block) D Q(k) Probability of k faulty bits in any protection domain inside of D (  S m ) Choose cases where there are k faulty bits in each S m Sum for all S m in D 13 So far, masking effect has not been considered Expected number of intrinsic faults/errors are calculated so far

14 Considering Masking Effect: Separating TRUE from Intrinsic Faults If all faults occur in unconsumed bits, then don’t care (FALSE events) TRUE faults = {All faults in S} – {All faults in unconsumed bits} Probability that has k faults, and C has 0 fault: FALSE or masked faults Deduct the probability that ALL k faulty bits are in the unconsumed bytes from the probability that the protection domain S has k faulty bits to obtain the probability of TRUE faults which becomes SDCs or TRUE DUEs C and are obtained through simulations

15 Using PARMA to Measure Errors in Block Protected by block-level SECDED Undetected error that affects reliability (SDC): three or more faulty bits in the block; at least one faulty bit in the consumed bits Detected error that affects reliability (TRUE DUE): exactly two faulty bits in the block; at least one faulty bit in the consumed bits 15 k>=3 is SDC >=3 faults in Block All faulty bits unconsumed See paper for how to apply PARMA on the different protection schemes k =2 is DUE 2 faults in Block All faulty bits unconsumed

16 Four Contributions 1.Development of the rigorous analytical model called PARMA 2.Measuring SERs on structures protected by various schemes 3.Observing possible distortions in the accelerated studies Quantitatively Qualitatively 4.Verifying approximate models 16 Modeling Application

17 Measuring SERs on Structures Protected by Various Schemes Target Failures-In-Time of IBM Power6 SDC: 114 DUE: 4,566 Average L2 (256KB, 32B block) cache FITs: 100M SimPoint simulations of 18 benchmarks from SPEC2K, on sim-outorder Implies word-level SECDED might be overkill in most cases Implies increasing the protection domain size: ex) CPPC @ISCA2011 Partially protected caches or caches with adaptive protection schemes need to be carefully quantified for their FITs PARMA provides comprehensive framework that can measure the effectiveness of such schemes 17 SchemesSDC (TRUE+FALSE) DUE Latency Checkbits per 256 bits No Protection155.66N/A100 1-bit Odd Parity2.53E-15372.83101 Block-level SECDED8.34E-317.04E-151410 Word-level SECDED2.92E-336.32E-161356 Results were verified with AVF simulations

18 Observing Possible Distortions in the Accelerated Tests Highly accelerated tests SPEC2K benchmarks end in several minutes (wall-clock time) Needs to accelerate SEU rate 10 17 times to see reasonable faults 18 SDC > DUE ? Having more than two errors overwhelms the cases of having two errors Can be misleading qualitatively SDC DUE MAX Possible Errors How to scale down the results? Results multiplied by 10 -17 times? Can distort results quantitatively Results were verified with fault-injection simulations

19 Mean of geometric distribution Verifying Approximate Models Example: model for word level SECDED protected cache Methods for determining cache scrubbing rates [Mukherjee’04][Saleh’90] Ignoring cleaning effects at accesses: overestimate by how much? New model with geometric distribution of Bernoulli trials Assumption: at most two bits are flipped between two accesses to the same word Every access results in a detected error or in no-error (corrected) 19 PARMA provides rigorous reliability measurements Hence, it is useful to verify the faster, simpler approximate models AVF x FIT from previous method 2.1454 FIT New approximate model 2.8246E-14 FIT PARMA 6.3170E-16 FIT Average ACE intervalAverage unACE interval T AVG : Average access interval between two accesses to the same word P DUE : pmf of two Poisson arrivals

20 Outline Introduction Soft Errors Evaluating Soft Errors PARMA: Precise Analytical Reliability Model for Architecture Model Applications Conclusion and Future work 20

21 Conclusion and Future Work Summary + PARMA is a rigorous model measuring Soft Error Rates in architectures + PARMA works with wide range of SEU rates without distortion + PARMA handles temporal MBEs + PARMA quantifies SDC or DUE rates under various error detection/protection schemes - PARMA does not address spatial MBEs yet - PARMA does not model TAG yet - Due to the complexity, PARMA is slow Future Work Extend PARMA to account for spatial MBEs and TAG vulnerability Develop sampling methods to accelerate PARMA 21

22 THANK YOU! QUESTIONS?

23 (Some) References [Biswas’05] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. Mukherjee, R Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of the 32nd International Symposium on Computer Architecture, 532-543, 2005 [Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors. In Proceedings of the International Conference on Dependable Systems and Networks, 496-505, 2005. [Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to calculate the architectural vulnerability factors for a high- performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40, 2003. [Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium on Dependable Computing, 37-42, 2004. [Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques for Memory Systems. In IEEE Transactions on Reliability, 39(1), 114-122, 1990. [Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp 23

24 Addendum

25 Some Definitions SDC = Silent Data Corruption DUE = Detected and unrecoverable error SER = Soft Error Rate = SDC + DUE Errors are measured as MTTF = Mean Time to Failure FIT = Failure in Time ; 1 FIT = 1 failure in billion hours 1 year MTTF = 1billion/(24*365)= 114,155 FIT FIT is commonly used since FIT is additive Vulnerability Factor = fraction of faults that become errors Also called derating factor or soft error sensitivity 25

26 Soft Errors and Technology Scaling Hazucha & Svensson model For a specific size of SRAM array: Flux depends on altitude & geomagnetic shielding (environmental factor) (Bit)Area is process technology dependent (technology factor) Qcoll is charge collection efficiency, technology dependent Qcrit  Cnode * Vdd According to scaling rules both C and V decrease and hence Q decreases rapidly Static power saving techniques (on caches) with drowsy mode or using near-/sub-threshold Vdd make cells more vulnerable to soft errors 26 Hazucha et al, “Impact of CMOS technology scaling on the atmospheric neutron soft error rate ”

27 Error Classification Silent Data Corruption (SDC) TRUE- and FALSE- Detected Unrecoverable Error (DUE) 27 Consumed ? Consumed ? C. Weaver et al, “Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor,” ISCA 2004

28 Soft Error Rate (SER) Intrinsic SER – more from the component’s view Assumes all bits are important all the time Intrinsic SER projections from ITRS2007 (High Performance model) Intrinsic SER of caches protected by SECDED code? Cleaning effect on every access Realistic SER – more from the system’s view Some soft errors are masked and do not cause system failure EX) AVF x Intrinsic SER: what about caches with protection code? 28 Year or production20102013201620192022 Feature size [nm]4535251813 Gate Length [nm]1813964.5 Soft Error Rate [FIT per Mb]12001250130013501400 Failure Rate in 1Mb [fails/hour]1.2E-61.25E-61.3E-61.35E-61.4E-6 % Multi-Bit Upsets in Single Event Upsets32%64%100%

29 Soft Error Estimation Methodologies: Industries Field analysis Statistically analyzes reported soft errors in market products Using repair record, sales of replacement parts Provides obsolete data Life testing Tester constantly cycles through 1,000 chips looking for errors Takes around six months Expensive, not fast enough for chip design process Usually used to confirm the accuracy of accelerated testing (x2 rule) Accelerated testing Chips under various beams of particles, under well-defined test protocol Terrestrial neutrons – particle accelerators (protons) Thermal neutrons – nuclear reactors Radioactive contamination – radioactive materials Hardship Data rarely published: potential liability problems of products Even rarer the comparison of accelerated testing vs life testing IBM, Cypress published small amount of data showing correlation 29 J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp

30 Soft Error Estimation Methodologies: Common Ways in Researches Fault-injection Generate artificial faults based on the fault model +Applicable to wide level of designs (from RTL to system simulations) -Massive number of simulations necessary to be statistically valid -Highly accelerated Single Event Upset (SEU) rate is required for Soft Errors -How to scale down the measurements to ‘real environment’ is unclear Architectural Vulnerability Factor Find derating factor (Faults  Errors) by {ACE bits}/{total bits} per cycle SoftArch Extrapolate AVG(TTFs) from one program to MTTF using infinite executions AVF and SoftArch – uses simplified Poisson fault generation model +Works well with small scale system in the current technology at earth’s surface: single bit error dominant environment -Can’t account for error protection/detection schemes (ECC) -Unable to address temporal & spatial MBEs AVF is NOT an absolute metric for reliability FIT structure = intrinsic_FIT structure * AVF structure 30 M. Li et al, “Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults,” HPCA 2009 S. S Mukherjee et al, “A systematic methodology to calculate the architectural vulnerability factors for a high-performance microprocessor.” MICRO 2003 X. Li et al, “SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors.” DSN 2005

31 Evaluating Soft Errors: Some Reliability Benchmarking Approaches Intrinsic FIT (Failure-in-Time) rate – highly pessimistic Every bit is vulnerable in every cycle Unclear how to compute intrinsic FIT rates for protected caches Architectural Vulnerability Factor [Mukherjee’03] Lifetime analysis on Architecturally Correct Execution bits De-rating factor (Faults  Errors); realistic FIT = AVF x Intrinsic FIT SoftArch [Li’05] Computes TTF for one program run and extrapolates to MTTF AVF and SoftArch + Quickly compute SDC with no parity or DUE under parity - Ignores temporal MBEs Two SEUs on one word become two faults instead of one fault Two SEUs on the same bit become two faults instead of zero fault - Ignores spatial MBEs - Can’t account for error detection / correction schemes To compare SERs of various error correcting schemes: Temporal/spatial MBEs must be accurately counted

32 Prior State of the Art Reliability Model: AVF Architectural Vulnerability Factor (AVF) AVF bit = Probability a bit matters (for a user-visible error) = # of bits affects to user-visible outcome / total # of bits If we assume AVF = 100% then we will over design the system Need to estimate AVF to optimize the system design for reliability AVF equation for a target structure AVF is NOT an absolute metric for reliability FIT structure = intrinsic_FIT structure * AVF structure 32 ……(Eq. 1) Shubu Mukherjee, “Architecture design for soft errors”

33 ACEness of a bit ACE (Architecturally Correct Execution) bit ACE bit affects program outcome: correctness is subjective (user-visible) Microarchitectural ACE bits Invisible to programmer, but affects program outcome Easier to consider Un-ACE bits –Idle/Invalid/Misspeculated state –Predictor structures –Ex-ACE state (architecturally dead or invisible states) Architectural ACE bits Visible to programmer Transitive (ACE bit in the word makes the Load instruction ACE) Easier to consider Un-ACE bits –NOP instructions –Performance-enhancing operations (non-opcode field of non-binding prefetch, branch prediction hint) –Predicated false instructions (except predicate status bit) –Dynamically dead instructions –Logical masking AVF framework = lifetime analysis to correctly find ACEness of bits in the target structure for every operating cycle 33 Shubu Mukherjee, “Architecture design for soft errors”

34 Rigorous Failure/Error Rate Modeling In existing methodologies such as AVF multiplied by intrinsic rate Estimation is simple and easy Imprecise estimation but safe-overestimation Downside of classical approach (i.e. AVF-based methodology) SEU is very rare event while program execution time is rather short In 3GHz processor, SEU rate is 1.0155E-25 within one cycle for one bit Equivalently, the probability of being hit by SEU and being faulty bit is 1.0155E-25 Simplified assumption that one SEU results one Fault/Error directly same bit may be hit multiple times, and/or multiple bits may become faulty in a word In space, or when extremely low Vdd is supplied to SRAM cell: SEU rate could rise high (more than 10E6 times) Second order effects become significant With data protection methodology: How to measure vulnerability is uncertain due to the simplified assumption 34

35 Reliability Theory (1) Fundamental definition of probability in Reliability Theory Number(Event)/Number(Trials): Approximations of true Prob(Event) True probability is barely known approx  true when trials  ∞ by the Law of Large Numbers Two events in R-T: Survival & Failure of a component/system Reliability Functions (Component/system) Reliability R(t), and Probability of Failure Q(t) Prob(Event) up to and at time t: conditional probability Note that R(t), Q(t) are time dependent in general (Conditional) Instantaneous Failure Rate λ(t) - a.k.a, Hazard function h(t) 35

36 Reliability Theory (2) Reliability functions (cont’d) (Unconditional) Failure Density Function f(t) Average Failure Rate from time 0 to T Discrete dual of λ(t) - Hazard Probability Mass Function h(j) Average Failure Rate from timeslot 0 to T 36

37 Reliability Theory (3) How to measure Reliability R(t) itself Events with constant failure rate MTTF Sampling issue: Usually no test can aggregate total test time to ∞ (Right) censorship with no replacement, then Maximum Likelihood Estimation –by B. Epstein, 1954 –At the end of the test time t r, measure TTFs (t i ) for samples that failed and truncate the lifetime of all survived samples to t r –Then, MLE of MTTF is FIT – one intuitive form of failure rate Failures in time 1E9 hours Interchangeable with MTTF only when failure rate is constant Additive between independent components 37

38 Vulnerability Clock Used to track cycles that any bit spends in vulnerable component: L2$ Ticks when a bit resides in L2 Stops when a bit stays outside L2 38 Cold Miss VC_MEM = 0 VC_L2 = 0 VC_L1 := VC_L2 = 0 VC_L2 = 0 Store VC_L1 := 0 VC_L2 = 100 Writeback VC_L1 = 0 VC_L2 = 150VC_L2 := VC_L1= 0VC_L2 = 1 VC_L2 = 80 VC_MEM := VC_L2 = 80VC_MEM = 80 VC_L2 := VC_MEM = 80

39 PARMA Model: Measuring Soft Error FIT with PARMA PARMA measures failure rate by accumulating failure probability mass Index processor cycle by j (1 ≤ j ≤ Texe) Total failures observed during Texe (failure rate): Equivalent to expected number of failures of type ERR FIT extrapolation with infinite program execution assumption How to calculate ? Let’s start with p: probability that one bit is flipped during one cycle period Obtained from Poisson SEU model 39

40 PARMA Model: Fault Generation Model SEU Model Assumptions: All clock cycles are independent to SEUs All bits are independent to SEUs (do not account for spatial MBEs) Widely accepted model for SEU: Poisson model p : probability that one bit is flipped during one cycle period (in SBE cases) Spatial MBE case: probability that multi-bits become faulty during one cycle Poisson probability mass function gives p λ: Poisson rate of SEUs, ex) 10 -25 /bit @ 65nm 3GHz CPU 40

41 PARMA Model: Measuring Soft Error FIT with PARMA PARMA measures failure rate by accumulating failure probability mass Index processor cycle by j (1 ≤ j ≤ Texe) A (conditional) failure probability mass at cycle j : Total failures observed during Texe (failure rate): Equivalent to expected number of failures of type ERR FIT extrapolation with infinite program execution assumption Average FIT with multiple programs 41

42 Failures Measured in PARMA No-protection, 1-bit Parity, 1-bit ECC on Word and 1-bit ECC on Block 42 No parity1-bit Parity1-bit ECC SDCTRUE DUESDC TRUE DUESDC word-levelblock-levelword-levelblock-level Access Domain Block Blk containing M words Block Blk containing M words Block Protection Domain N/ABlock WordBlockWordBlock Faulty bits≥1 in C ∀ odd in S, ≥1 in C ∀ even >0 in S, ≥1 in C 2 in any S m, ≥1 in that C m 2 in S, ≥1 in C ≥3 in any S m, ≥1 in that C m ≥3 in S, ≥1 in C Notation

43 Spatial Expansion: From a Bit to a Byte in N c Vulnerability Cycles q b (k) Probability of a Byte having k faulty bits (in N c vulnerability cycles) From 8 bits in the Byte, choose k faulty bit 43

44 Spatial Expansion: from a Byte to the Protection Domain (Word) S Q(k) Probability of set of bits S having k faulty bits inside (during N c cycles) Choose cases where there are k faulty bits in S Enumerate all possibilities of faulty bits in bytes of S such that their total number = k 44

45 Faults in the Access Domain (Block) D Q(k) Probability of k faulty bits in any protection domain inside of D (  S m ) Choose cases where there are k faulty bits in S m Sum for all S m in D 45 So far, masking effect has not been considered Expected number of intrinsic faults/errors are calculated so far

46 Odd parity per block SDCs: having at least one faulty bit in the consumed bytes, from having nonzero, even number of faulty bits in the block TRUE DUEs: having at least one faulty bit in the consumed bytes, from having odd number of faulty bits in the block PARMA Model: Failures Measured in PARMA (1) Unprotected cache Without protection, any non- zero faulty bit(s) will cause SDC failure SDCs: having at least one faulty bit in the consumed bits 46 Nonzero, even # k faulty bits in the block is SDC All faults Unconsumed

47 PARMA Model: Failures Measured in PARMA (2) SECDED per block SDCs: having at least one faulty bit in the consumed bits, from having more than two faulty bits in the block TRUE DUEs: having at least one faulty bit in the consumed bits, from having exactly two faulty bits in the block 47 SECDED per word Same to ‘per block’ case except protection domain is word Because access domain is block, all the words in the same block are addressed by adding FITs k>=3 is SDC >=3 faults in Block All faults Unconsumed For all the words in the block Additive because FITs from each word is independent and counted separately

48 PARMA Simulations Target processor 4-wide OoO processor 64-entry ROB 32-entry LSQ McFarling’s hybrid branch predictor Cache configuration sim-outorder was modified and executed with alpha ISA 18 benchmarks from SPEC2000 were used with SimPoint Sampling of 100M-instruction samples 48 CacheSizeAssociativityLatency [cyc] IL1: 32B BLK16KB1-way2 DL1: 32B BLK16KB4-way3 UL2: 32B BLK256KB8-way NP/P1b: 10 SW(4B):13 SB(64B): 14

49 Evaluating Soft Errors: AVF or Fault-Injection, Why Not? AVF fails for handling scenarios under error protection schemes Why not use fault injection for such scenarios? Possible distortion in the interpretation of results due to the highly accelerated experiments 49

50 Simulations with PARMA: Results in FIT (1) Bench(a)(b)(c)(d)(e)(f)(g)(h)(i) ammp320.32419.272.50E-142.53E-141.53E-141.32E-291.99E-152.92E-151.02E-31 art48.7616.741.22E-16 3.70E-173.89E-341.03E-179.01E-183.21E-36 crafty429.47716.451.99E-142.34E-144.74E-144.24E-301.85E-156.41E-152.83E-32 eon382.25298.231.45E-141.63E-146.03E-152.72E-301.69E-159.77E-163.57E-32 facerec98.080.599.80E-179.79E-171.37E-188.88E-351.18E-172.45E-191.22E-36 galgel60.3577.619.52E-17 8.53E-173.54E-348.15E-181.38E-172.72E-36 gap138.5922.273.32E-163.94E-165.26E-172.34E-334.30E-178.80E-182.59E-35 gcc349.11229.963.76E-154.86E-153.25E-151.89E-314.65E-164.68E-161.86E-33 gzip547.531115.561.05E-141.17E-149.56E-152.86E-311.30E-151.24E-153.67E-33 mcf14.7114.431.28E-17 3.23E-172.93E-351.13E-184.35E-181.89E-37 mesa460.52112.194.50E-155.03E-158.57E-164.01E-325.44E-161.52E-164.73E-34 parser138.54380.341.86E-152.00E-154.12E-154.59E-321.47E-165.82E-162.97E-34 perlbmk100.82315.372.74E-152.97E-158.85E-156.96E-322.36E-161.17E-154.46E-34 sixtrack76.247.923.75E-163.91E-161.39E-169.70E-334.26E-172.14E-171.12E-34 twolf193.40419.251.52E-151.56E-152.88E-151.45E-321.24E-164.11E-161.06E-34 vortex831.26324.748.57E-159.63E-152.80E-151.70E-311.04E-154.31E-161.93E-33 vpr184.31369.252.16E-152.18E-153.27E-153.04E-321.83E-164.79E-162.45E-34 wupwise146.69130.001.99E-152.02E-157.28E-161.75E-321.63E-161.70E-161.42E-34 Average155.66217.172.53E-153.45E-153.59E-158.34E-312.25E-164.07E-162.92E-33 50 (a) NP_ SDC: no-protection/SDC (≈ AVF_SDC) P1B_TRUE_DUE:odd parity/TRUE DUE (b) P1B_FALSE_DUE:odd parity/FALSE DUE (c) P1B_ SDC: odd parity/SDC (d) SB_TRUE_DUE: block-level SECDED/TRUE DUE (e) SB_FALSE_DUE: block-level SECDED/FALSE DUE (f) SB_SDC: block-level SECDED/SDC (g) SW_TRUE_DUE: word-level SECDED/TRUE DUE (h) SW_FALSE_DUE: word-level SECDED/TRUE DUE (i) SW_SDC: word-level SECDED/SDC

51 PARMA Application: a Gold-Standard for Developing New Approximate Model (3) Results 51 NameAVF AVFxFIT from previous method FIT from new approximate model FIT from PARMA ammp40.977%2.93748.4182E-144.9114E-15 art2.849%0.20424.5179E-161.93476E-17 crafty61.078%4.37843.3463E-148.25685E-15 eon99.049%7.10031.3441E-132.67121E-15 facerec4.319%0.30961.3138E-151.20444E-17 galgel6.010%0.43087.0577E-162.19513E-17 gap7.118%0.51034.5248E-165.18293E-17 gcc27.658%1.98277.7612E-159.33375E-16 gzip83.466%5.98326.9763E-142.53328E-15 mcf1.267%0.09082.9364E-165.47892E-18 mesa30.070%2.15551.6881E-146.96557E-16 parser22.983%1.64751.8796E-147.28453E-16 perlbmk31.621%2.26672.9209E-141.41053E-15 sixtrack3.916%0.28075.9788E-166.39658E-17 twolf26.750%1.91762.2392E-145.35443E-16 vortex53.171%3.81153.6437E-141.4704E-15 vpr24.232%1.73714.0074E-146.61992E-16 wupwise12.183%0.87331.1242E-143.33145E-16 Average27.33%2.14542.8246E-146.31707E-16 With PARMA, we can verify newly developed approximate models

52 Simulation with PARMA: Overhead Need to track all memory footprint Vulnerability clock cycles for L1, L2 and Memory copies Data structure: Binary Search Tree Quick search and insertion Memory footprint never decreases Memory overhead: ~17 bytes for tracking 1 byte of memory footprint Computation overhead: O(n 3 ) with non-parallelized code n : number of bits in the block Probability calculation for having k specific faulty bits is O(n 2 ) Need to know the probability distribution on k in [0, n] Overall ~25x slowdown in simulation time from base sim-outorder Still much faster than doing massive number of tests with fault injection 52

53 PARMA Application: a Gold-Standard for Developing New Approximate Model (1) PARMA provides rigorous reliability measurements Hence, it is useful to verify faster, simpler approximate models Example: model for word-level SECDED protected cache Known methods for determining cache scrubbing rates Model from previous work [Mukherjee’04][Saleh’90] Ignores cleaning effects at accesses –Okay for determining cache scrubbing rates because it overestimates –But by how much does it overestimate? 53

54 PARMA Application: a Gold-Standard for Developing New Approximate Model (2) 54

55 PARMA Application: a Gold-Standard for Developing New Approximate Model (3) Word level SECDED average vulnerability, converted to FIT rate 55 With PARMA, we can verify newly developed approximate models AVF x Intrinsic FIT from previous method 2.1454 FIT New approximate model 2.8246E-14 FIT PARMA 6.3170E-16 FIT


Download ppt "Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois."

Similar presentations


Ads by Google