Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013 1.

Similar presentations


Presentation on theme: "Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013 1."— Presentation transcript:

1 Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013 1

2 Reliability Involves different aspects about a processor that can affect performance and functionality. – Ultimately can reduce the lifetime of the processor. Issues typically manifest themselves at the device level. – Solutions can be implemented at multiple design levels. 2

3 Why the concern? Operating at highest frequencies and/or lowest power possible increases sensitivity to process- related variabilities. – Gate length/doping concentration variations – Temperature – Supply voltage droops This decreases processor yield Decreasing device sizes  Increased effect of external issues 3

4 Outline Error Classification Hard Errors Soft Errors Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact 4

5 Processor Error Classification Hard Errors will result in permanent processor failure. Processor lifetime is inversely proportional to hard error rate. Soft Errors do not permanently damage the device. 5

6 Hard Errors Extrinsic failures – Caused by process and manufacturing defects – Occur with decreasing rate over time – No impact from micro-architecture Intrinsic failures – Related to processor wear-out – Occur with increasing rate over time – Related to wafer packaging, process parameters, and processor design. 6

7 Hard Errors 7

8 Soft Errors Occur in both memory and logic – External radiation main issue in memory Alpha particles High energy neutrons Thermal neutrons Different causes of transient errors in logic – External radiation – Supply voltage droop Power supply fluctuations – Ground bounce, cross-talk – Process variation, temperature – Affect delay of computational paths 8

9 Outline Error Classification Hard Errors Soft Errors Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact 9

10 Radiation-Induced Soft Errors Ionized particle strike causing a state change No permanent damage (Hard-error) Combo logic – Single Event Transients (SET) Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU) Three causes of soft errors – Alpha particles – Thermal neutrons – High-energy neutrons 10

11 Alpha-Particles Emitted from impurities in packaging materials. Create electron-hole pairs through direct ionization Range for a 10 MeV particle < 100um – Typical energy 4-9MeV Improved manufacturing trends  Reduced effect – Purified materials – Shielding layers 11

12 Neutrons Result of cosmic ray reactions with atmosphere High-Energy neutrons react with chip materials. Concrete only shielding material – 1.4x lower flux/foot of thickness 12

13 Neutrons Thermal neutrons (<<< 1MeV) react with Boron- Doped Phosphosilicate Glass (BPSG) dielectric layer. – Produce ionized particles that can cause soft-errors Solution  Remove BPSG from advanced processes Mostly solved – SEU’s still found in 45nm, 90nm 13

14 Outline Error Classification Hard Errors Soft Errors Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact 14

15 Device-level solutions Larger device sizes  Larger capacitance – Increase the amount of charge necessary to flip bit (critical charge) Multiple V T design – Sensitivity to variation at low-V DD may limit effectiveness. Body biasing also common to both radiation hardening and variation tolerance 15

16 Circuit-level solutions DICE cell – Used for SRAM, FF’s, latches Built-in current sensors on supply lines of memory cells. 16

17 Outline Error Classification Hard Errors Soft Errors Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact 17

18 Modular redundancy Dual Modular Redundancy Triple Modular Redundancy 18

19 Redundant Circuits Redundancy increases area/power DMR/TMR in sub/near-V T – Timing variation between circuits increases Utilization of redundant lanes for parallel operation can increase throughput at low-V DD 19

20 Self-Checking Circuits Partition circuit into smaller blocks – Error checker for each block Use error detection codes – Berger codes – Arithmetic codes Increases circuit delay for error computation 20

21 Circuit-Level Speculation Uses approximated circuit implementation – Goal is to reduce critical path 21

22 Tunable Replica Circuits Mirrors delay of critical path Monitors for errors over voltage/frequency changes 22

23 Timing Speculation Razor timing error detection – Designed for transient faults – Effective against SET’s and SBU’s on flip-flops Requires error recovery 23

24 Outline Error Classification Hard Errors Soft Errors Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact 24

25 Error Recovery Options in Scalar Processors Clock Gating: –Global error signal –Clock gating –1-cycle penalty 25

26 Multiple Issue: –Error signals propagated to control unit –Instructions must be flushed –Error instruction then replayed –2N-cycle penalty Error Recovery Options in Scalar Processors 26

27 Counter-flow pipelining Micro-rollback Error Recovery Options in Scalar Processors 27

28 Error correcting codes for memories Most common is Hamming code Check bits stored when data written Identifies error and erroneous bit position 28

29 Error correcting codes for memories Single-bit ECC adds area/power and delay – Low-V DD  Increased delay – Hybrid V DD operation will reduce delay Overhead increases for multi-bit ECC – Increased memory density  higher probability of MBU – Current research increase in ratio of MBU to total SER in sub-V T 29

30 Outline Error Classification Hard Errors Soft Errors Sources of radiation Device/Circuit approaches Architectural approaches Error detection Error correction System level impact 30

31 System-Level Impact Soft errors can have a large affect on processor functionality – Increasing issue with further device scaling All methods off error detection/correction are costly – Need to be added to system blocks wisely SEU distribution Effects of process variation 31

32 System-Level Impact How to determine what blocks have the highest system-level impact? – Mostly through simulation For radiation: all-encompassing – Includes fault circuit level Different models have been developed – ReStore – University of Illinois at Urbana-Champaign Focuses on system level effect of radiation-induced errors – RAMP – IBM Directed more towards hard-errors and processor failure. 32

33 Questions? 33


Download ppt "Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013 1."

Similar presentations


Ads by Google