Download presentation
Presentation is loading. Please wait.
Published byIsaac Martin Wilkins Modified over 9 years ago
2
Copyright 2005, M. Tahoori1 Soft Error Modeling and Mitigation Mehdi B. Tahoori Northeastern University mtahoori@ece.neu.edu
3
Copyright 2005, M. Tahoori2 Outline Soft Error Introduction Soft Error Modeling for Memory Hierarchy Soft Error Modeling in Random Logic Combinational logic Sequential logic More Issues
4
Copyright 2005, M. Tahoori3 Soft Error: Introduction
5
Copyright 2005, M. Tahoori4 Evidence of Cosmic Ray Strikes Documented strikes in large servers found in error logs Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996. Sun Microsystems, 2000 (R. Baumann, 2002 IRPS Workshop talk) Cosmic ray strikes on L2 cache with defective error protection caused Sun’s flagship servers to suddenly and mysteriously crash! Companies affected Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations Verisign moved to IBM Unix servers (for the most part)
6
Copyright 2005, M. Tahoori5 Reactions from Companies Fujitsu SPARC in 130 nm technology 80% of 200k latches protected with parity compare with very few latches protected in Mckinley ISSCC, 2003 IBM declared 1000 years system MTBF as product goal for Power4 line very hard to achieve this goal in a cost-effective way Bossen, 2002 IRPS Workshop Talk
7
Copyright 2005, M. Tahoori6 Figure 3, Ziegler, et al., “IBM experiments in soft fails in computer electronics (1978 - 1994),” IBM J. of R. & D., Vol. 40, No. 1, Jan. 1996. Impact of Neutron Strike on a Si Device Strike creates electron-hole pairs that can be absorbed by source/diffusion areas to change state of device
8
Copyright 2005, M. Tahoori7 Figure 8, Ziegler, et al., “IBM experiments in soft fails in computer electronics (1978 - 1994),” IBM J. of R. & D., Vol. 40, No. 1, Jan. 1996. Impact of Elevation 3x - 5x increase in Denver at 5,000 feet 100x increase in airplanes at 30,000+ feet
9
Copyright 2005, M. Tahoori8 Physical Solutions are hard Shielding? No practical absorbent (e.g., approximately > 10 ft of concrete) unlike Alpha particles Technology solution: SOI? Partially-depleted SOI probably no help in 250 nm and beyond Fully-depleted SOI can help, but very hard to manufacture in high volumes Radiation-hardened cells? 10x improvement possible with significant penalty in performance, area, cost 2-4x improvement may be possible with less penalty Some of these techniques will help alleviate the impact of Soft Errors, but not completely remove it
10
Copyright 2005, M. Tahoori9 Bit Read Bit has error protection Error is only detected (e.g., parity + no recovery) Error can be corrected (e.g, ECC) yes no Does bit matter? Silent Data Corruption (SDC) yes no Detected, but unrecoverable error (DUE) no error yes no benign fault no error benign fault no error Strike on state bit (e.g., in register file)
11
Copyright 2005, M. Tahoori10 Definitions Interval-based MTTF = Mean Time to Failure MTTR = Mean Time to Repair MTBF = Mean Time Between Failures = MTTF + MTTR Availability = MTTF / MTBF Rate-based FIT = Failure in Time = 1 failure in a billion hours 1 year MTTF = 109 / (24 * 365) FIT = 114,155 FIT SER FIT = SDC FIT + DUE FIT Total of 198K FIT + Cache: 40K FIT IQ: 100K FIT FU: 58K FIT + Hypothetical Example
12
Copyright 2005, M. Tahoori11 # Vulnerable Bits Growing with Moore’s Law Fujitsu SPARC has 20% of 200k latches vulnerable in 2003 Additional SDC FIT from RAM cells, static logic, & dynamic logic Higher SDC FIT in multiprocessor systems Gap ~= 100x for 8 processor system! A data center with 300 such systems will encounter a data corruption almost every week 12x GAP
13
Copyright 2005, M. Tahoori12 Soft Error Issues 1. Why is soft error a problem today? Industry is at the cross-over point Future is worse, IF we don’t do anything 2. What about system FIT contribution? System FIT decreased dramatically (e.g., RAID, ECC on DRAM) Large part of system moving on-chip (e.g., memory controller) 3. Is this a server problem or a desktop problem? Definitely a server (e.g., data center) problem Desktop problem from IT manager’s point of view 4. How do software bugs compare to soft error rates? Limited # of bugs in mature software (e.g., servers, company environment) If we don’t do anything, soft errors will be your dominant failure rate
14
Copyright 2005, M. Tahoori13 Balancing Reliability and Performance in Memory Hierarchy
15
Copyright 2005, M. Tahoori14 Goals Accurately estimate reliability in cache memory early in the design cycle Methods to increase reliability of cache memory Minimize power / performance impacts
16
Copyright 2005, M. Tahoori15 Motivation Memory elements most vulnerable components to soft errors (Gaisler 97) Previous method: Fault Injection (Faure 03) Software (during design cycle) time-consuming Radiation-based only after chip fab
17
Copyright 2005, M. Tahoori16 Cache Reliability Model Separately measures reliability of: Data array Tag array Status bits Valid bit Dirty bit Can be extended to other status bits (coherency, etc) Model provides an upper bound on error rate for a given workload
18
Copyright 2005, M. Tahoori17 Errors in Data RAM Critical Words (CW) Read by CPU or written to memory Critical Time (CT)
19
Copyright 2005, M. Tahoori18 Reliability Computation (I) Vulnerability factor Fraction of faults that become errors M: Cache size TT: Total execution time CT: Critical time
20
Copyright 2005, M. Tahoori19 Reliability Computation (II) Define vulnerability as follows: Independent of environment and cache size Goal decrease vulnerability while not impacting power and performance
21
Copyright 2005, M. Tahoori20 Experimental Setup SimpleScalar 4.0 SPEC2000 benchmarks Programs run for 500M instructions
22
Copyright 2005, M. Tahoori21 Experiments Examined three methods to reduce vulnerability Flushing periodically flush entire cache Write Policy change write-thru policy (from write-back) Refreshing periodically refetch cache blocks from L2 cache
23
Copyright 2005, M. Tahoori22 D-Cache: Flushing 4x reduction in vulnerability
24
Copyright 2005, M. Tahoori23 D-Cache: Write Policy 10x reduction in vulnerability
25
Copyright 2005, M. Tahoori24 D-Cache: Refresh 3x reduction in vulnerability using write-thru (30x total)
26
Copyright 2005, M. Tahoori25 Summary Reliability estimation of cache hierarchy Based on critical words and critical times Several methods to reduce vulnerability Flushing Write-thru policy Refreshing 30x decrease in vulnerability with minimal IPC impact
27
Copyright 2005, M. Tahoori26 Soft Error Modeling at Logic-Level
28
Copyright 2005, M. Tahoori27 Exponential increase of Soft Errors e-Qcrit/Qs trend with technology scaling (Shivakumar, DSN 2002) Qcrit: the critical charge (depend on characteristics of the circuit) Qs: the charge collection efficiency of a particle strike on the device Particles of lower energies occur far frequently than particles of higher energy
29
Copyright 2005, M. Tahoori28 Soft Error Rate Error rate of node n Nominal FIT Logic Derating Timing Derating Norminal FIT Occurrence rate of SEUs at node n causing a glitch Logic Derating Propagation of error from node n to system bistables or outputs Timing Derating Propagated transient captured in system bistables
30
Copyright 2005, M. Tahoori29 Combinational Logic (Logic Derating)
31
Copyright 2005, M. Tahoori30 SER Estimation in Combinational Logic The main idea: Traversing structural paths from faults origin to POs Using signal probabilities for SER estimation
32
Copyright 2005, M. Tahoori31 Example: Simple Path EPP(gate C) = 1 0.2=0.2 EPP(gate D) = 0.2 (1-SPB)= 0.2 0.7=0.14
33
Copyright 2005, M. Tahoori32 Propagation Rules P a (U i ) + P ā (U i ) + P 1 (U i ) + P 0 (U i ) = 1 Need 4 logic values 0, 1 : no propagation a, ā : propagation (same and opposite polarities)
34
Copyright 2005, M. Tahoori33 Algorithm For any gate, g i : 1. Extract all on-path signals (and gates) from g i to any reachable primary output PO j and/or flip-flop ff j 2. Levelize signals on these paths 3. Traverse the paths in order Use signal probabilities for off-path signals Use propagation rules for on-path signals
35
Copyright 2005, M. Tahoori34 Example: Reconvergent Fanouts
36
Copyright 2005, M. Tahoori35 Sequential Logic (Timing Derating)
37
Copyright 2005, M. Tahoori36 Glitch Propagation: Simple Path Duration and time of propagated glitch Depends on propagation and transition delays along the path Glitch propagation probabilitiy Depends on signal probabilities of off-path signals Error propagation probability (EPP) Propagation probability (PP) Latching Probability (LP)
38
Copyright 2005, M. Tahoori37 Latching Probability LP = (S+H+W)/T S,H : Setup and Hold time W: glitch width T : clock period
39
Copyright 2005, M. Tahoori38 Reconvergent Paths Propagated waveforms Multiple waveforms, not simple glitches
40
Copyright 2005, M. Tahoori39 Approach Find all possible propagated waveforms Enhanced static timing analysis All possible transitions at each reachable gate Due to glitch at error site Find the probability of each waveform Using an approach similar to logic derating Compute time-logic derating at each node Compute overall SER
41
Copyright 2005, M. Tahoori40 Example
42
Copyright 2005, M. Tahoori41 Validation Monte-Carlo simulation Inject glitches at the outputs of random gates At random time Perform timing-accurate simulation Identify if error captured in a flip-flop Compute soft error rate Stop if the computed value reaches confidence interval E.g. 3% error margin Or, if simulation doesn’t converge after N iterations Too time-consuming
43
Copyright 2005, M. Tahoori42 Results: Timing Derating Run time(sec)Speedupw=50nsw=70ns Circuit#GatesMC SimProposed% DiffSim Status% DiffSim Status s29811947510.26182720.47C5.03NC s34416025170.3571911.81C3.3NC s38615941870.15279130.91C0.51C s40016442371.528252.45NC1.01NC s52619378481.649052.59NC3.48NC s119652933282.215130.45C1.47C s123850891743.327801.49C0.92C s1423657172383.944201.11C2.25C s1488653115624.824091.12C1.29C s3593216065N/A537--NC- s3841719253N/A671--NC- s3858422179N/A645--NC- average-720415580251.37-2.14-
44
Copyright 2005, M. Tahoori43 SER vs Glitch Width
45
Copyright 2005, M. Tahoori44 Summary Analytical method for SER estimation at logic-level Logic and timing derating Based on signal probabilities Traversing topological paths in the netlist Very fast and accurate Compared to Monte-Carlo Fault Injection 4-5 orders of magnitude faster More than 96% accurate Application Reliability measurement Cost-effective soft error hardening
46
Copyright 2005, M. Tahoori45 More Issues Soft error hardening For individual gate Gate sizing Using isolation device (c-pass transistors) … For entire design Balancing reliability and overheads Area, power, delay
47
Copyright 2005, M. Tahoori46 References J. Kumar, M.B. Tahoori, “Use Of Pass Transistor Logic To Minimize The Impact Of Soft Errors In Combinational Circuits”, In Workshop on System Effects of Logic Soft Errors (SELSE), 2005. G. Asadi, M.B. Tahoori, “An Analytical Approach for Soft Error Rate Estimation In Digital Circuits”, In IEEE International Symposium on Circuits and Systems (ISCAS), 2005. G. Asadi, V. Sridharan, M. B. Tahoori, D. Kaeli, “Balancing Performance and Reliability in the Memory Hierarchy”, In IEEE Boston Area Architecture (BARC) Workshop, 2005. G. Asadi, M.B. Tahoori, “Soft Error Mitigation for SRAM-based FPGAs”, In VLSI Test Symposium (VTS), 2005. G. Asadi, V. Sridharan, M. B. Tahoori, D. Kaeli, “Balancing Performance and Reliability in the Memory Hierarchy”, In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2005. G. Asadi, M.B. Tahoori, “An Accurate SER Estimation Method Based on Propagation Probability”, In Design Automation and Test in Europe (DATE) Conference, 2005. G. Asadi, M.B. Tahoori, “Soft Error Rate Estimation and Mitigation for SRAM- based FPGAs”, In ACM International Conference on Field Programmable Gate Arrays (FPGA), 2005.
48
Copyright 2005, M. Tahoori47 Questions?
49
Copyright 2005, M. Tahoori48 Cosmic rays come from deep space Figure 2, Ziegler, et al., “IBM experiments in soft fails in computer electronics (1978 - 1994),” IBM J. of R. & D., Vol. 40, No. 1, Jan. 1996. Origin of Cosmic Rays
50
Copyright 2005, M. Tahoori49 Computing FIT rate of a Chip FIT Rate Law: FIT rate of a system is the sum of the FIT rates of its individual components Vulnerable Bit Law: FIT rate of a chip is the sum of the FIT rate of vulnerable bits in that chip! Total Soft Error FIT = (for each vulnerable device i) (intrinsic error rate i vulnerability factor i ) Vulnerability Factor = fraction of faults that become errors Vulnerability Factor is also known as “derating factor” and “soft error sensitivity (SES).”
51
Copyright 2005, M. Tahoori50 Issues: Output Dependency Same error propagated to multiple outputs Solution forward signal probability of one PO to the other PO. SP of PO k forwarded to next stages instead of EPP of PO k,
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.