Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microprocessor Reliability

Similar presentations


Presentation on theme: "Microprocessor Reliability"— Presentation transcript:

1 Microprocessor Reliability
Robert Pawlowski ECE 570 – 2/19/2013

2 Reliability Involves different aspects about a processor that can affect performance and functionality. Ultimately can reduce the lifetime of the processor. Issues typically manifest themselves at the device level. Solutions can be implemented at multiple design levels.

3 Why the concern? Operating at highest frequencies and/or lowest power possible increases sensitivity to process-related variabilities. Gate length/doping concentration variations Temperature Supply voltage droops This decreases processor yield Decreasing device sizes  Increased effect of external issues

4 Outline Error Classification Hard Errors Soft Errors
Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact

5 Processor Error Classification
Hard Errors will result in permanent processor failure. Processor lifetime is inversely proportional to hard error rate. Soft Errors do not permanently damage the device.

6 Hard Errors Extrinsic failures Intrinsic failures
Caused by process and manufacturing defects Occur with decreasing rate over time No impact from micro-architecture Intrinsic failures Related to processor wear-out Occur with increasing rate over time Related to wafer packaging, process parameters, and processor design.

7 Hard Errors

8 Soft Errors Occur in both memory and logic
External radiation main issue in memory Alpha particles High energy neutrons Thermal neutrons Different causes of transient errors in logic External radiation Supply voltage droop Power supply fluctuations Ground bounce, cross-talk Process variation, temperature Affect delay of computational paths

9 Outline Error Classification Hard Errors Soft Errors
Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact

10 Radiation-Induced Soft Errors
Ionized particle strike causing a state change No permanent damage (Hard-error) Combo logic – Single Event Transients (SET) Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU) Three causes of soft errors Alpha particles Thermal neutrons High-energy neutrons

11 Alpha-Particles Emitted from impurities in packaging materials.
Create electron-hole pairs through direct ionization Range for a 10 MeV particle < 100um Typical energy 4-9MeV Improved manufacturing trends  Reduced effect Purified materials Shielding layers

12 Neutrons Result of cosmic ray reactions with atmosphere
High-Energy neutrons react with chip materials. Concrete only shielding material 1.4x lower flux/foot of thickness

13 Neutrons Thermal neutrons (<<< 1MeV) react with Boron-Doped Phosphosilicate Glass (BPSG) dielectric layer. Produce ionized particles that can cause soft-errors Solution  Remove BPSG from advanced processes Mostly solved – SEU’s still found in 45nm, 90nm

14 Outline Error Classification Hard Errors Soft Errors
Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact

15 Device-level solutions
Larger device sizes  Larger capacitance Increase the amount of charge necessary to flip bit (critical charge) Multiple VT design Sensitivity to variation at low-VDD may limit effectiveness. Body biasing also common to both radiation hardening and variation tolerance

16 Circuit-level solutions
DICE cell Used for SRAM, FF’s, latches Built-in current sensors on supply lines of memory cells.

17 Outline Error Classification Hard Errors Soft Errors
Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact

18 Modular redundancy Dual Modular Redundancy Triple Modular Redundancy

19 Redundant Circuits Redundancy increases area/power
DMR/TMR in sub/near-VT Timing variation between circuits increases Utilization of redundant lanes for parallel operation can increase throughput at low-VDD

20 Self-Checking Circuits
Partition circuit into smaller blocks Error checker for each block Use error detection codes Berger codes Arithmetic codes Increases circuit delay for error computation

21 Circuit-Level Speculation
Uses approximated circuit implementation Goal is to reduce critical path

22 Tunable Replica Circuits
Mirrors delay of critical path Monitors for errors over voltage/frequency changes

23 Timing Speculation Razor timing error detection
Designed for transient faults Effective against SET’s and SBU’s on flip-flops Requires error recovery

24 Outline Error Classification Hard Errors Soft Errors
Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact

25 Error Recovery Options in Scalar Processors
Clock Gating: Global error signal Clock gating 1-cycle penalty

26 Error Recovery Options in Scalar Processors
Multiple Issue: Error signals propagated to control unit Instructions must be flushed Error instruction then replayed 2N-cycle penalty

27 Error Recovery Options in Scalar Processors
Counter-flow pipelining Micro-rollback

28 Error correcting codes for memories
Most common is Hamming code Check bits stored when data written Identifies error and erroneous bit position

29 Error correcting codes for memories
Single-bit ECC adds area/power and delay Low-VDD  Increased delay Hybrid VDD operation will reduce delay Overhead increases for multi-bit ECC Increased memory density  higher probability of MBU Current research increase in ratio of MBU to total SER in sub-VT

30 Outline Error Classification Hard Errors Soft Errors
Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact

31 System-Level Impact Soft errors can have a large affect on processor functionality Increasing issue with further device scaling All methods off error detection/correction are costly Need to be added to system blocks wisely SEU distribution Effects of process variation

32 System-Level Impact How to determine what blocks have the highest system-level impact? Mostly through simulation For radiation: all-encompassing Includes fault circuit level Different models have been developed ReStore – University of Illinois at Urbana-Champaign Focuses on system level effect of radiation-induced errors RAMP – IBM Directed more towards hard-errors and processor failure.

33 Questions?


Download ppt "Microprocessor Reliability"

Similar presentations


Ads by Google