Presentation is loading. Please wait.

Presentation is loading. Please wait.

® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If.

Similar presentations


Presentation on theme: "® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If."— Presentation transcript:

1 ® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If a problem has no solution, it may not be a problem, but a FACT, not to be solved, but to be coped with over time,” Shimon Peres, Nobel Laureate Fault Aware Computing Technology (FACT) Group, Intel 2 VSSAD, Intel 3 University of Michigan, Ann Arbor 11th International Symposium on High-Performance Computer Architecture (HPCA), 2005

2 ® 2 Shubu Mukherjee, FACT Group Evidence of Cosmic Ray Strikes Documented strikes in large servers found in error logs Documented strikes in large servers found in error logs ØNormand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Sun Microsystems, 2000 (R. Baumann, Workshop talk) Sun Microsystems, 2000 (R. Baumann, Workshop talk) ØCosmic ray strikes on L2 cache with defective error protection –caused Sun’s flagship servers to suddenly and mysteriously crash! ØCompanies affected –Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations –Verisign moved to IBM Unix servers (for the most part)

3 ® 3 Shubu Mukherjee, FACT Group Reactions from Companies Typical server system data corruption target around 1000 years MTBF Typical server system data corruption target around 1000 years MTBF Øvery hard to achieve this goal in a cost-effective way ØBossen, 2002 IRPS Workshop Talk Fujitsu SPARC in 130 nm technology (2003) Fujitsu SPARC in 130 nm technology (2003) Ø80% of 200k latches protected with parity Øcompare with very few latches protected in Mckinley ØISSCC, 2003

4 ® 4 Shubu Mukherjee, FACT Group Evolution of a Product’s Team’s Psyche Shock Ø Ø“SER is the crabgrass in the lawn of computer design” Denial Ø Ø“We will do the SER work two months before tapeout” Anger Ø Ø“Our reliability target is too ambitious” Acceptance Ø Ø“You can deny physics only for so long”

5 ® 5 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

6 ® 6 Shubu Mukherjee, FACT Group Strike Changes State of a Single Bit 0 1

7 ® 7 Shubu Mukherjee, FACT Group Impact of Neutron Strike on a Si Device Secondary source of upsets: alpha particles from packaging Secondary source of upsets: alpha particles from packaging Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device Transistor Device source drain neutron strike

8 ® 8 Shubu Mukherjee, FACT Group Cosmic Rays Come From Deep Space Earth’s Surface p n p p n n p p n n n Neutron flux is higher in higher altitudes

9 ® 9 Shubu Mukherjee, FACT Group Impact of Elevation Figure 8, Ziegler, et al., “IBM experiments in soft fails in computer electronics ( ),” IBM J. of R. & D., Vol. 40, No. 1, Jan x - 5x increase in Denver at 5,000 feet 3x - 5x increase in Denver at 5,000 feet 100x increase in airplanes at 30,000+ feet 100x increase in airplanes at 30,000+ feet

10 ® 10 Shubu Mukherjee, FACT Group Physical Solutions are hard Shielding? Shielding? ØNo practical absorbent (e.g., approximately > 10 ft of concrete) Øunlike Alpha particles Technology solution: SOI? Technology solution: SOI? ØPartially-depleted SOI of some help, effect on logic unclear ØFully-depleted SOI may help, hard to manufacture in high volumes Radiation-hardened cells? Radiation-hardened cells? Ø10x improvement possible with significant penalty in performance, area, cost Ø2-4x improvement may be possible with less penalty We think some of these techniques will help alleviate the impact of Soft Errors, but not completely remove it We think some of these techniques will help alleviate the impact of Soft Errors, but not completely remove it

11 ® 11 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

12 ® 12 Shubu Mukherjee, FACT Group Strike Changes State of a Single Bit 0 1

13 ® 13 Shubu Mukherjee, FACT Group Strike on state bit (e.g., in register file) Bit Read Bit has error protection Error is only detected (e.g., parity + no recovery) Error can be corrected (e.g, ECC) yes no Does bit matter? Silent Data Corruption (SDC) yes no Detected, but unrecoverable error (DUE) no error yes no benign fault no error benign fault no error

14 ® 14 Shubu Mukherjee, FACT Group Definitions 1 SDC = Silent Data Corruption SDC = Silent Data Corruption DUE = Detected & unrecoverable error DUE = Detected & unrecoverable error SER = Soft Error Rate = Total of SDC & DUE SER = Soft Error Rate = Total of SDC & DUE

15 ® 15 Shubu Mukherjee, FACT Group Definitions 2 Interval-based Interval-based ØMTTF = Mean Time to Failure ØMTTR = Mean Time to Repair ØMTBF = Mean Time Between Failures = MTTF + MTTR ØAvailability = MTTF / MTBF Rate-based Rate-based ØFIT = Failure in Time = 1 failure in a billion hours Ø1 year MTTF = 10 9 / (24 * 365) FIT = 114,155 FIT ØSER FIT = SDC FIT + DUE FIT Total of 158K FIT + Cache: 0 FIT IQ: 100K FIT FU: 58K FIT + Hypothetical Example

16 ® 16 Shubu Mukherjee, FACT Group Typical Server System Reliability Goals (D.C.Bossen, 2002 IRPS Tutorial Reliability Notes) Error Type System MTBF Goal SDC (Silent Data Corruption) 1000 years (114 FIT) DUE for system crash 25 years DUE for application crash 10 years

17 ® 17 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

18 ® 18 Shubu Mukherjee, FACT Group Measuring a Chip’s FIT Like performance measurement Like performance measurement Chip Physically bombard with neutrons in neutron accelerators Expose to alpha particles in radioactive foils Chip Study error logs of running machines Circuit Models + RTL Obtain raw error rate Statistical fault injection Circuit Models + Performance Model Obtain raw error rate Work in progress in FACT group

19 ® 19 Shubu Mukherjee, FACT Group Computing FIT rate of a Chip FIT Rate Law: FIT rate of a system is the sum of the FIT rates of its individual components FIT Rate Law: FIT rate of a system is the sum of the FIT rates of its individual components Vulnerable Bit Law: FIT rate of a chip is the sum of the FIT rate of vulnerable bits in that chip! Vulnerable Bit Law: FIT rate of a chip is the sum of the FIT rate of vulnerable bits in that chip! Total Soft Error FIT =  (for each vulnerable device i) (intrinsic error rate i * vulnerability factor i ) Ø ØVulnerability Factor = fraction of faults that become errors ØVulnerability Factor is also known as “derating factor” and “soft error sensitivity (SES).”

20 ® 20 Shubu Mukherjee, FACT Group FIT Equation: Raw Soft Error Rate FIT =  (for each vulnerable device i) (intrinsic error rate i * vulnerability factor i ) SRAM cells SRAM cells ØFIT/bit decreasing slightly across generations w/ usu. voltage scaling ØFIT/chip increasing overall Latch cells Latch cells ØFIT/bit constant across generations w/ usu. voltage scaling Static Logic Gates Static Logic Gates Øsee later Dynamic Logic Dynamic Logic Økeeper similar to latches, but extra reduction due to specific function implemented

21 ® 21 Shubu Mukherjee, FACT Group FIT Equation: Vulnerability Factors FIT =  (for each vulnerable device i) (intrinsic error rate i * vulnerability factor i ) Vulnerability Factor = Timing Vulnerability Factor * Architectural Vulnerability Factor  Timing Vulnerability Factor  fraction of time bit is vulnerable  Architectural Vulnerability Factor (AVF)  fraction of time bit matters for final output of a program

22 ® 22 Shubu Mukherjee, FACT Group Timing Vulnerability Factor SRAM cells SRAM cells Ø Ø100% Latch cells Ø Ø~ 50% Ø Ødepends on min. delay of signal propagation through logic chain (ref: Norbert Seifert, Intel) Static Logic Gates Ø ØShivakumar, et al. (DSN 2002) predict near zero today Ø Øsignal attenuation, latch window, & logical masking Ø Ømay be a problem in future Dynamic Logic Ø Øsame as latches

23 ® 23 Shubu Mukherjee, FACT Group Architectural Vulnerability Factor Does a bit matter? Branch Predictor Branch Predictor Ø Doesn’t matter at all (AVF = 0%) Program Counter Program Counter Ø Almost always matters (AVF ~ 100%) Computing AVF for complex structures Computing AVF for complex structures ØStatistical Fault Injection ØACE Analysis (next) ØOther methods being researched

24 ® 24 Shubu Mukherjee, FACT Group Architecturally Correct Execution (ACE) ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) Anything else (un-ACE path ) can be derated away Anything else (un-ACE path ) can be derated away Program Input Program Outputs

25 ® 25 Shubu Mukherjee, FACT Group Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output

26 ® 26 Shubu Mukherjee, FACT Group Dynamic Instruction Breakdown Average across Spec2K slices

27 ® 27 Shubu Mukherjee, FACT Group Mapping ACE & un-ACE Instructions to the Instruction Queue Architectural un-ACEMicro-architectural un-ACE Wrong- Path Inst Idle NOPPrefetch ACE Inst Ex- ACE Inst

28 ® 28 Shubu Mukherjee, FACT Group Instruction Queue ACE percentage = AVF = 29%

29 ® 29 Shubu Mukherjee, FACT Group Punchline: Simple Conceptual Model FIT rate = sum of FIT rate of “vulnerable” bits FIT rate = sum of FIT rate of “vulnerable” bits Vulnerable bits (RAM & latch cells) Vulnerable bits (RAM & latch cells) Øfor SDC, this means unprotected bits Rule of thumb: vulnerability factor Rule of thumb: vulnerability factor Øarchitectural vulnerability factor ~= 20% Øtiming vulnerability factor = 50% for latches & 13% dynamic Rule of thumb: raw FIT rate Rule of thumb: raw FIT rate Ø0.001 – FIT/bit (Normand 1996, Tosaka 1996)

30 ® 30 Shubu Mukherjee, FACT Group # Vulnerable Bits Growing with Moore’s Law Fujitsu SPARC has 20% of 200k latches vulnerable in 2003 Fujitsu SPARC has 20% of 200k latches vulnerable in 2003 Øaggressive designs have significantly higher number of vulnerable latches Additional SDC FIT from RAM cells, static logic, & dynamic logic Additional SDC FIT from RAM cells, static logic, & dynamic logic Higher SDC FIT in multiprocessor systems Higher SDC FIT in multiprocessor systems ØGap ~= 100x for 8 processor system! ØA data center with 300 such systems will encounter a data corruption almost every week 12x GAP

31 ® 31 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

32 ® 32 Shubu Mukherjee, FACT Group The Soft Error Opportunity Key differences with classical fault tolerance Key differences with classical fault tolerance ØFIT budget 100x – 1000x more than Tandem-style machines ØTraditional “big hammer” solutions too expensive for volume market & can be an overkill Why architecture plays a critical role? Why architecture plays a critical role? Øerror often defined in architecture & microarchitecture –e.g., strike on a branch predictor doesn’t cause an error Øarchitectural solutions are often more cost-effective –one bit of parity can protect 64 bits, overhead < 2% –radiation-hardened cells can have overhead around 20-40%

33 ® 33 Shubu Mukherjee, FACT Group Research Directions 1. AVF characterization of processor structures  architectural abstraction for soft errors 2. AVF reduction techniques & tradeoff with performance  reduce exposure  reduce false errors  fault detection & recovery techniques 3. Protecting un-core components  data flows unchanged  microarchitectural state changes 4. Software solutions  e.g., the Princeton CRAFT approach  but, software doesn’t have full visibility into hardware 5. AVF vs. AF (activity factor) tradeoff  structures with high AF and low AVF may require a closer look 6. Other sources of soft errors, definitions carry over  timing errors, Vcc reduction errors, etc.

34 ® 34 Shubu Mukherjee, FACT Group Summary Soft Errors: real problem today Soft Errors: real problem today ØPrimary culprit: neutrons from deep space ØIndustry seeing this now Major problem in next few technology generations Major problem in next few technology generations ØProblem scales with Moore’s Law, die size, & system size ØIndustry will have a hard time making chips reliable SER effort across Intel SER effort across Intel Ønumber of projects aimed at modeling, measuring, detecting, and correcting soft errors

35 ® 35 Shubu Mukherjee, FACT Group BACKUPS FOLLOW

36 ® 36 Shubu Mukherjee, FACT Group Faults, Errors, Failures (From Pradhan, “Fault-Tolerant Computer System Design”) Fault Fault Ødefect in hardware or software component Ødefect for cosmic ray = upset from high-energy neutron strike Error Error Ømanifestation of a fault, resulting in deviation from accuracy Øfaults cause errors (but, not vice versa) Øa masked fault is not an error! Øvulnerability factor = fraction of faults that cause errors (Intel term) Failure Failure Ønon-performance of expected action Øerrors cause failures (but not vice versa) Øa corrected error doesn’t cause a failure

37 ® 37 Shubu Mukherjee, FACT GroupReferences Documented Strikes Documented Strikes Ø(Sun Microsystems) R. Baumann, “Soft Errors in Commercial Semiconductor Technology,” 2002 IRPS Tutorial Notes ØNormand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Raw soft error rate: – FIT/bit Raw soft error rate: – FIT/bit ØY.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, ØNormand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Typical Server System Goals Typical Server System Goals ØD.C.Bossen, “CMOS Soft Errors and Server Design,” IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pp. 121_07.1 – 121_07.6, April 7, 2002.

38 ® 38 Shubu Mukherjee, FACT Group FIT/bit for SRAM Cells decreasing Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, ØFIT/bit decreasing, FIT/chip increasing Hareland, et al., “Impact of CMOS Process Scaling and SOI on the soft error rates of logic processes,” 2001 Symposium on VLSI Technlogy Digest of Technical papers Hareland, et al., “Impact of CMOS Process Scaling and SOI on the soft error rates of logic processes,” 2001 Symposium on VLSI Technlogy Digest of Technical papers ØFIT/bit decreasing R.Baumann, 2002 IRPS Tutorial Notes R.Baumann, 2002 IRPS Tutorial Notes ØFIT/bit decreasing because of voltage saturation ØFIT/bit increasing in products with B10

39 ® 39 Shubu Mukherjee, FACT Group FIT/bit for Latches Constant Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, Øprediction using models ØFIT/bit constant (within 2x error range) Karnik, et al., “Scaling Trends of Cosmic Rays induced Soft Errors in Static Latches beyond 0.18 ,” 2001 Symposium on VLSI Circuits Digest of Technical Papers Karnik, et al., “Scaling Trends of Cosmic Rays induced Soft Errors in Static Latches beyond 0.18 ,” 2001 Symposium on VLSI Circuits Digest of Technical Papers ØNeutron beam experiment ØFIT/bit constant

40 ® 40 Shubu Mukherjee, FACT Group Raw FIT Equation Raw Neutron FIT rate Raw Neutron FIT rate Ø  Neutron Flux * Area * e -(Qcrit/Qs) When Qcrit >> Qs When Qcrit >> Qs Øexponential dominates Øwe are still in this region When Qcrit <= Qs When Qcrit <= Qs Øreached saturation Øarea dominates, so FIT/bit will continue to decrease with area

41 ® 41 Shubu Mukherjee, FACT Group e -Qcrit/Qs trends (Shivakumar et al., DSN 2002) exp(-Qcrit/Qs) increasing area decreasing quadratically

42 ® 42 Shubu Mukherjee, FACT Group SRAM: FIT/bit decreasing Source: Shivakumar, et al., DSN 2002 Source: Shivakumar, et al., DSN 2002

43 ® 43 Shubu Mukherjee, FACT Group Latch: FIT/bit roughly constant Source: Shivakumar, et al., DSN 2002 Source: Shivakumar, et al., DSN 2002

44 ® 44 Shubu Mukherjee, FACT Group Timing vulnerability Factor for latches Timing vulnerability factor = latch time / clock time ~= 50% Timing vulnerability factor = latch time / clock time ~= 50% flow-through latch data setup time hold time

45 ® 45 Shubu Mukherjee, FACT Group Energy Spectrum of Cosmic Ray Particles Neutrons constitute > 96% of cosmic ray particles at sea level Neutrons constitute > 96% of cosmic ray particles at sea level Higher # of lower energy particles (significant) Higher # of lower energy particles (significant) Figure 4, Ziegler, et al., “Terrestrial Cosmic Rays,” IBM J. of R. & D., Vol. 40, No. 1, Jan

46 ® 46 Shubu Mukherjee, FACT Group SFI vs. ACE analysis SFIACE Accuracy of Microarchitectural un-ACE Better than ACE analysis Conservative Accuracy of Architectural un-ACEConservative Better than SFI (e.g., covers dynamically dead instructions) Insight Per-structure insights harder Little’s Law & per- structure breakdown easier # of experiments Large # required to be statistically significant Small # of experiments can give good accuracy


Download ppt "® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If."

Similar presentations


Ads by Google