Presentation on theme: "Microprocessor Reliability"— Presentation transcript:
1Microprocessor Reliability Robert PawlowskiECE 570 – 2/19/2013
2ReliabilityInvolves different aspects about a processor that can affect performance and functionality.Ultimately can reduce the lifetime of the processor.Issues typically manifest themselves at the device level.Solutions can be implemented at multiple design levels.
3Why the concern?Operating at highest frequencies and/or lowest power possible increases sensitivity to process-related variabilities.Gate length/doping concentration variationsTemperatureSupply voltage droopsThis decreases processor yieldDecreasing device sizes Increased effect of external issues
4Outline Error Classification Hard Errors Soft Errors Sources of radiationDevice/Circuit approachesArchitectural approachesError detectionError correctionSystem level impact
5Processor Error Classification Hard Errors will result in permanent processor failure.Processor lifetime is inversely proportional to hard error rate.Soft Errors do not permanently damage the device.
6Hard Errors Extrinsic failures Intrinsic failures Caused by process and manufacturing defectsOccur with decreasing rate over timeNo impact from micro-architectureIntrinsic failuresRelated to processor wear-outOccur with increasing rate over timeRelated to wafer packaging, process parameters, and processor design.
8Soft Errors Occur in both memory and logic External radiation main issue in memoryAlpha particlesHigh energy neutronsThermal neutronsDifferent causes of transient errors in logicExternal radiationSupply voltage droopPower supply fluctuationsGround bounce, cross-talkProcess variation, temperatureAffect delay of computational paths
9Outline Error Classification Hard Errors Soft Errors Sources of radiationDevice/Circuit approachesArchitectural approachesError detectionError correctionSystem level impact
10Radiation-Induced Soft Errors Ionized particle strike causing a state changeNo permanent damage (Hard-error)Combo logic – Single Event Transients (SET)Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU)Three causes of soft errorsAlpha particlesThermal neutronsHigh-energy neutrons
11Alpha-Particles Emitted from impurities in packaging materials. Create electron-hole pairs through direct ionizationRange for a 10 MeV particle < 100umTypical energy 4-9MeVImproved manufacturing trends Reduced effectPurified materialsShielding layers
12Neutrons Result of cosmic ray reactions with atmosphere High-Energy neutrons react with chip materials.Concrete only shielding material1.4x lower flux/foot of thickness
13NeutronsThermal neutrons (<<< 1MeV) react with Boron-Doped Phosphosilicate Glass (BPSG) dielectric layer.Produce ionized particles that can cause soft-errorsSolution Remove BPSG from advanced processesMostly solved – SEU’s still found in 45nm, 90nm
14Outline Error Classification Hard Errors Soft Errors Sources of radiationDevice/Circuit approachesArchitectural approachesError detectionError correctionSystem level impact
15Device-level solutions Larger device sizes Larger capacitanceIncrease the amount of charge necessary to flip bit (critical charge)Multiple VT designSensitivity to variation at low-VDD may limit effectiveness.Body biasing also common to both radiation hardening and variation tolerance
16Circuit-level solutions DICE cellUsed for SRAM, FF’s, latchesBuilt-in current sensors on supply lines of memory cells.
17Outline Error Classification Hard Errors Soft Errors Sources of radiationDevice/Circuit approachesArchitectural approachesError detectionError correctionSystem level impact
19Redundant Circuits Redundancy increases area/power DMR/TMR in sub/near-VTTiming variation between circuits increasesUtilization of redundant lanes for parallel operation can increase throughput at low-VDD
20Self-Checking Circuits Partition circuit into smaller blocksError checker for each blockUse error detection codesBerger codesArithmetic codesIncreases circuit delay for error computation
21Circuit-Level Speculation Uses approximated circuit implementationGoal is to reduce critical path
22Tunable Replica Circuits Mirrors delay of critical pathMonitors for errors over voltage/frequency changes
23Timing Speculation Razor timing error detection Designed for transient faultsEffective against SET’s and SBU’s on flip-flopsRequires error recovery
24Outline Error Classification Hard Errors Soft Errors Sources of radiationDevice/Circuit approachesArchitectural approachesError detectionError correctionSystem level impact
26Error Recovery Options in Scalar Processors Multiple Issue:Error signals propagated to control unitInstructions must be flushedError instruction then replayed2N-cycle penalty
27Error Recovery Options in Scalar Processors Counter-flow pipeliningMicro-rollback
28Error correcting codes for memories Most common is Hamming codeCheck bits stored when data writtenIdentifies error and erroneous bit position
29Error correcting codes for memories Single-bit ECC adds area/power and delayLow-VDD Increased delayHybrid VDD operation will reduce delayOverhead increases for multi-bit ECCIncreased memory density higher probability of MBUCurrent research increase in ratio of MBU to total SER in sub-VT
30Outline Error Classification Hard Errors Soft Errors Sources of radiationDevice/Circuit approachesArchitectural approachesError detectionError correctionSystem level impact
31System-Level ImpactSoft errors can have a large affect on processor functionalityIncreasing issue with further device scalingAll methods off error detection/correction are costlyNeed to be added to system blocks wiselySEU distributionEffects of process variation
32System-Level ImpactHow to determine what blocks have the highest system-level impact?Mostly through simulationFor radiation: all-encompassingIncludes fault circuit levelDifferent models have been developedReStore – University of Illinois at Urbana-ChampaignFocuses on system level effect of radiation-induced errorsRAMP – IBMDirected more towards hard-errors and processor failure.