University of Michigan

University of Michigan
Sidestepping performance bottlenecks and design crises with Better-Than-Worst-Case design Todd Austin Valeria Bertacco Krisztian Flautner Dept. of EECS University of Michigan Ann Arbor, MI – USA ARM, Ltd. Cambridge, UK Munich - March 7th, 2005

Introduction Limitations of traditional design approaches
in light of current technology trends

Pressing Design Challenges in the Nanometer Regime
Design complexity Billions of transistors lead to untenable designs… Uncertainty in design parameters Process and temperature variation, supply noise… Soft errors upset logic and memory Cosmic rays, alpha particles, neutrons, etc… Power demands Bounding performance, area, battery life

Design Complexity Trends
PowerIV GeForceFX Radeon97 Prescott PentiumIV GeForceIV Crusoe K7 GeforceII PentiumPro Riva TNT2 PentiumII Pentium Nvidia NV2 i486 i386 i286

The Burden of Verification
Immense test space Impossible to fully test the system Example: 32 regs, 8k caches, 300 pins = states Done with respect to ill-defined reference What is correct? Often defined by old designs + gurus Expensive Large fraction of design team dedicated to verification Increases time-to-market, often as much as 1-2 years High-risk Typically only one chance to “get it right” Failures can be costly: replacement parts, bad PR, lawsuits, fatalities

Extreme Variations Heat Flux (W/cm2) Temperature Variation (°C)
Results in Vcc variation Temperature Variation (°C) Results in Hot spots Random Dopant Fluctuations

Uncertainty in Design Parameters
Temperature Si variation Noise Model uncertainty + = f, Yield, MTTF Vdd Uncertainly leads to performance and power overheads Increasing uncertainty with design scaling Intra-die process/temperature variations, inductive noise, deep Key Observation: worst-case scenario is highly improbable Significant gain for circuits optimized for common case Efficiency mechanisms needed to tolerate worst-case scenarios …all of this uncertainty becomes costly because we have to add margins to design to guarantee correct operation…

Impact of Neutron Strike on a Si Device
source drain Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device + + + - + - - - Transistor Device Secondary source of upsets: alpha particles from packaging

Soft-Error Trends SER per chip of logic circuits
[P. Shivkumar et al., DSN 2002] The plot shows the SER reduction as technology scales from 600 nm to 50 nm. The x-axis shows the minimum feature size of different technology generations while Y-axis plots the soft error rate of the chip. The contribution of SRAM, latch and logic chain to the SER of a chip are separately plotted. The SER contribution of logic chain is shown by the red line. As can be seen the SER of the logic chain increases by nine orders of magnitude from 600 nm to 50 nm. In 50 nm SER contribution of logic is expected to increase beyond that of latches due to the large number of logic nodes used in a chip. 600nm 350nm 250nm 180nm 130nm 100nm 70nm 50nm Technology Generation SER per chip of logic circuits Nine orders of magnitude increase from 600 nm to 50 nm Dominant source of soft errors after 50 nm

Fried Egg a la Athlon XP1500+
Source: The New York Times, 25 June 2002

Power Density Trends Sun's Surface Rocket Nozzle
1000 Power doubles every 4 years 5-year projection: 200W total, 125 W/cm2 ! Nuclear Reactor 100 Pentium® 4 Watts/cm 2 Hot plate Pentium® III Pentium® II 10 Pentium® Pro Pentium® P=VI: 1.5V = 50 A! i386 i486 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note Courtesy Avi Mendelson, Intel.

Better-Than-Worst-Case (BTWC) design
Traditional worst-case design works to avoid errors/faults by assuming worst-case conditions for design validation Better than worst-case design couples a complex designs with a checker component that validates correctness during operation Reduces design effort and enables typical-case optimizations

What is this tutorial about
BTWC design Basic Concepts DIVA Checker Razor Logic Other BTWC solutions CAD challenges and opportunities Typical-Case design Optimization (TCO) Circuit-level observability and system-level performance Open discussion Conclusion

Goals of this tutorial Introduce and motivate the concept of Better Than Worst-Case design Familiarize the attendees with a number of BTWC designs (ours and others) Introduce efforts (circuit-aware architectural simulation typical case optimization) that highlight the challenges and opportunities that BTWC poses to CAD Facilitate an open discussion on the implications of BTWC design on CAD

Better-Than-Worst-Case design

Traditional Worst-Case Design
Design-Time Verification and Optimization L H Time-to-Market L H Performance

Better-Than-Worst-Case (BTWC) design
Online Checker Hardware Run-Time Verification Typical Case Optimization L H Time-to-Market L H Time-to-Market L H Performance L H Performance

Outline DIVA Checker Razor Logic Other BTWC designs

Motivating Observations
Online functional verification cover most faults Single-event upsets and noise-related faults Design faults and incomplete implementation Untestable silicon defects and in field circuit failures Utilize N(2)-version hardware to detect and correct faults Increasing speculation reduces exposure to faults Predictors need not be correct, functionally or electrically Approach leverages a maximally speculative architecture While complex, processors have simple semantics Need not validate all internals, only exposed semantics Only check instruction semantics for low overheads

Example BTWC Design: DIVA Checker [Austin’99]
Performance Correctness Core Online Checker Hardware Checker speculative instructions in-order with PC, inst, inputs, addr EX/ MEM IF ID REN REG SCHEDULER CHK CT All core function is validated by checker Simple checker detects and corrects faulty results, restarts core Checker relaxes burden of correctness on core processor Tolerates design errors, electrical faults, defects, and failures Core has burden of accurate prediction, as checker is 15x slower Core does heavy lifting, removes hazards that slow checker …DIVA stands for “Dynamic Implementation Verification Architecture”…

Checker Processor Architecture
PC IF PC inst = core PC I-cache Core Processor Prediction Stream ID inst regs = core inst RF OK CT EX result regs res/addr = core regs WT MEM addr result core res/addr/nextPC D-cache

Check Mode = = = IF Core Processor Prediction ID Stream CT EX MEM WT
PC inst = core PC I-cache Core Processor Prediction Stream ID inst regs = core inst RF OK CT EX result regs res/addr = core regs WT MEM addr result core res/addr/nextPC D-cache

Recovery Mode PC IF ID CT EX MEM PC inst inst regs result regs
I-cache ID inst regs RF CT EX result regs res/addr MEM addr result D-cache

How Can the Simple Checker Keep Up?
Slipstream Slipstream reduces power requirements of trailing car Checker processor executes inside core processor’s slipstream fast moving air  branch predictions and cache prefetches Core processor slipstream reduces complexity requirements of checker Checker rarely sees branch mispredictions, data hazards, or cache misses

How Can the Simple Checker Keep Up?
Slipstream EX/ MEM IF ID REN REG SCHEDULER CHK CT Slipstream reduces power requirements of trailing car Checker processor executes inside core processor’s slipstream fast moving air  branch predictions and cache prefetches Core processor slipstream reduces complexity requirements of checker Checker rarely sees branch mispredictions, data hazards, or cache misses

Checker Performance Impacts
Checker throughput bounds core IPC Only cache misses stall checker pipeline Core warms cache, leaving few stalls Checker latency stalls retirement Stalls decode when speculative state buffers fill (LSQ, ROB) Stalled instructions mostly nuked! Storage hazards stall core progress Checker may stall core if it lacks resources Faults flush core to recover state Small impact if faults are infrequent

REMORA: Physical Checker Design
Alpha 21264 Physical checker design effort underway Alpha integer ISA subset 4-wide checker, 0.5k I-cache, 4k D-cache Synthesized design (using Synopsys) Physical design estimates 950 MHz clock speed (degree-8 pipe) 12 mm2 total area in 0.25um technology 941 mW worst-case power Design also includes: Pipelined checker design, simple core Clock/voltage tuning infrastructure Extensive BIST support REMORA Checker 12 mm2 (in 0.25um) 205 mm2 (in 0.25um) data cache inst pipeline BIST

Verifying the Checker Processor
j Unspecified Core Predictions Checker Model Always true if uArch model == Ref model output X == Identical state? Reference Model (ISA sim) output Simple checker permits complete functional verification In-order blocking pipelines (trivial scheduler, no rename/reorder/commit) No “internal” non-architected state Fully verified design using Sakallah’s GRASP SAT-solver For Alpha integer ISA without exceptions With small register file and memory, and small data types

Example BTWC Design: DIVA Checker
Performance Correctness Core Online Checker Hardware Checker speculative instructions in-order with PC, inst, inputs, addr EX/ MEM IF ID REN REG SCHEDULER CHK CT All core function is validated by checker Simple checker detects and corrects faulty results, restarts core Checker relaxes burden of correctness on core processor Tolerates design errors, electrical faults, defects, and failures Core has burden of accurate prediction, as checker is 15x slower Core does heavy lifting, removes hazards that slow checker …DIVA stands for “Dynamic Implementation Verification Architecture”…

Motivating Study: Voltage vs. Circuit Error Rate

Circuit Under Test Slow Pipeline A != Slow Pipeline B Fast Pipeline
X 18 36 18x18 48-bit LFSR != clk/2 Slow Pipeline B clk/2 40-bit Error Counter X 36 clk/2 18x18 48-bit LFSR clk/2 clk/2 18 Fast Pipeline X 36 stabilize 18x18 clk clk clk

Error Rate Studies – Empirical Results
35% energy savings with 1.3% error 22% saving once every 20 seconds! Make next slide a bullet on here.

Error Rate Studies – SPICE-Level Simulations
Based on a SPICE-level simulations of a Kogge-Stone adder 200 mV Show decrease bar (200 mV) for different inputs

Another BTWC Design: Razor Logic [Ernst’03]
5 3 9 9 MEM Main FF 4 Main FF Online Checker Hardware 9 clk clk Shadow Latch clk_del Double-sampling metastability tolerant latches detect timing errors Second sample is correct-by-design Microarchitectural support restores state Timing errors treated like branch mispredictions

Razor Short Path Constraint
3 5 9 9 8 MEM Main FF 2 4 Main FF 8 clk clk Shadow Latch Hold Constraint (~1/2 cycle) clk_del Short-path timing constraint prevents shadow latch corruption

Razor Flip-Flop Clk_p Clk_n D_in Clk_p P-skewed N-skewed Error Q Clk_n
Restore D_in Clk_p P-skewed N-skewed Error Q Clk_n Clk_p Restore_p Restore_n Driver Clk_n Rbar_latched

Centralized Pipeline Recovery Control
Cycle: 4 6 3 1 2 5 inst3 inst1 inst2 inst5 inst4 inst6 IF ID EX MEM WB (reg/mem) PC Razor FF Razor FF Razor FF Razor FF error error error error recover recover recover recover clock Once cycle penalty for timing failure Global synchronization may be difficult for fast, complex designs

Distributed Pipeline Recovery
Cycle: 2 3 4 5 7 8 9 1 6 inst3 inst4 inst2 inst1 inst3 inst4 inst6 inst5 inst7 inst8 inst2 IF ID EX MEM (read-only) WB (reg/mem) PC Razor FF Razor FF Razor FF Razor FF Stabilizer FF error bubble error bubble error bubble error bubble recover recover recover recover Flush Control flushID flushID flushID flushID Builds on existing branch prediction framework Multiple cycle penalty for timing failure Scalable design as all communication is local

Shaving Voltage Margins with Razor
Goal: reduce voltage margins with in-situ error detection and correction Approach: Tune processor voltage based on error rate Eliminate margins, run below critical voltage Trade-off: power savings vs. overhead of correction Zero margin Sub-critical Traditional DVS …this work focuses primarily on the uncertainty and software challenges…

Razor Opportunity: Typical-Case Energy Reduction
Eref Voltage Control Function  . Pipeline reset Vdd Ediff = Eref - Esample - Esample Regulator Ediff signals error Energy reduction can be realized with a simple proportional control function Control algorithm implemented in software

Energy/Performance Characteristics
Pipeline Throughput 1% Total Energy, Etotal = Eproc + Erecovery Energy Energy of Processor Operations, Eproc Energy of Pipeline Recovery, Erecovery Energy of Processor w/o Razor Support IPC 50% Optimal Etotal Decreasing Supply Voltage

Simulation Results: Optimal Voltage Sweep
Recovery cost includes energy to recover entire pipeline (18x an add)

Voltage Controller Performance
120MHz 27C Percentage Error Rate Voltage Output of Controller Runtime Samples

Simulation Results: Razor DVS Performance
10 20 30 40 50 60 70 80 90 100 bzip crafty eon gap gcc gzip mcf parser twolf vortex vpr Average Relative Energy (%) Total Energy DVS Energy IPC DVS IPC Energy bars -> DVS vs. Fixed Optimal Arrow -> one better -> DVS can be better – can take advantage of dynamic program information

Razor Prototype Silicon
3 mm 4 stage 64-bit Alpha pipeline MHz operation 0.18mm technology, 1.8V Razor overhead: Total of 192 Razor flip-flops out of 2408 total (9%) Error-free power overhead: Razor flip-flops: < 1% Short path buffer: 2.1% Recovery power overhead: Razor latch power overhead: 2% at 10% error rate Additional power overhead due to re-execution of instructions I-Cache Register File WB 3.3 mm IF ID EX MEM D-Cache

Razor Prototype Testbed

Error Rate and Normalized Energy Savings
120MHz 140MHz Percentage Error Rate Normalized Energy Voltage (in Volts)

Point of 0.1% Error Rate and Point of First Failure
120MHz 140MHz Voltage at 0.1%Error Rate Voltage at 0.1%Error Rate Voltage at First Failure Voltage at First Failure

Razor Prototype Testbed

Temperature Margins Percentage Error Rate 120MHz Voltage 45C 65C 95C

Razor Energy Savings@120MHz,45C
27.3mW 180mV Power Supply Integrity 11.3mW 80mV Temp 17.3mW 130mV Process 104.5mW 4.2mW 30mV 89.7mW 99.6mW 119.4mW 11.5mW 27.7mW chip1 chip2 Measured Power with supply, temperature and process margins Power with Razor DVS when Operating at Point of First Failure of 0.1% Error Rate Measured Power (in mW) 160.5mW 162.8mW

Another BTWC Design: Razor Logic
5 3 9 9 MEM Main FF 4 Main FF Online Checker Hardware 9 clk clk Shadow Latch clk_del Double-sampling metastability tolerant latches detect timing errors Second sample is correct-by-design Microarchitectural support restores state Timing errors treated like branch mispredictions

Other Better Than Worst-Case designs
Algorithmic-Noise Tolerance, Shanbhag et al. Converting circuit faults to S/N component Approximate Circuits, Lu et al. Architecture-level speculation on computation TEAtime Adaptive Clock, Uht et al. Adaptive clock control On-Chip Self-Calibrating Busses, Worm et al. Error recovery logic for on-chip busses Self-Tuning Circuits, Kehl et al. Early work on dynamic timing error avoidance Time Based Transient Fault Detection, Anghel et al. Double sampling latches for speed testing …for the remainder of the talk, I want to introduce you to two “Better Than Worst-Case Designs”… March 2004

Algorithmic Noise Tolerance
[Shanbhag ’04]

Approximate Circuits [Lu ’04]

TEAtime Adaptive Clock
[Uht ’04]

On-Chip Self-Calibrating Busses
FIFO Controller Encoder Decoder Ack errors [Worm ’04]

Self-Tuning Circuits [Kehl ’93]

Time Redundancy Based Transient Fault Detection
[Anghel ’00]

CAD opportunities for BTWC

infrequent faults in the core design are tolerable.
Key observation In a BTWC context infrequent faults in the core design are tolerable. A fault is infrequent if: Environmental conditions trigger it rarely Normal system operation activate a faulty configuration rarely

CAD opportunities Synthesis Verification
Optimize performance/power for the most common scenarios (typical-case optimization) Flexible synthesis tools - Optimization constraints can be relaxed Finer granularity of synthesis objectives – probability density curves Verification Accurate evaluation- Statistical analysis of execution scenarios Verification focuses on “frequent” transactions/ path of execution Verification focuses on critical components Safety mechanisms detect “rare” problems at a performance cost Performance and verification are intertwined

Outline Synthesis - Typical-Case Optimized Adders
Performance/verification - Circuit-Aware simulation Verification - Beta-release processors

Parallel Prefix Computation
Kogge-Stone Adder Parallel Prefix Computation € G0 P0 G1 P1 G2 P2 G3 P3 G4 P4 G5 P5 G6 P6 G7 P7 G8 P8 G9 P9 G10 P10 G11 P11 G12 P12 G13 P13 G14 P14 G15 P15 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16

BTWC Opportunity: Typical-Case Optimized Adder
Kogge-Stone Adder G0 P0 G1 P1 G2 P2 G3 P3 G4 P4 G5 P5 G6 P6 G7 P7 G8 P8 G9 P9 G10 P10 G11 P11 G12 P12 G13 P13 G14 P14 G15 P15 Cin …

Carry Propagations for Random Data
Probability Bit Position Carry Distance

Carry Propagations for Typical Data
Probability Bit Position Carry Distance

Typical Case Optimized Adder
G0 P0 G1 P1 G2 P2 G3 P3 G4 P4 G5 P5 G6 P6 G7 P7 G8 P8 G9 P9 G10 P10 G11 P11 G12 P12 G13 P13 G14 P14 G15 P15 Cin … ripple carry circuit carry-lookahead circuit

Benefits of Typical Case Optimization
Adder Topology Latency (in gate delays) Worst-Case Typical-Case Random Kogge-Stone 8 5.08 7.09 TCO Adder 128 3.03 3.69 Typical-case performance much better than worst case, relevant in a TCO context

Performance/verification - Circuit-Aware simulation Verification - Beta-release processors

Example: Razor latches
Simulation for BTWC Electrical, transient phenomena affect the performance of BTWC designs Simulation tools need to be “electrically accurate” Functional correctness is re-defined in a statistical context Need ability to gather statistical simulation data Example: Razor latches Main FF Shadow Latch clk clk_del 5 4 MEM 9 3

Key Challenges: Speed and Fidelity
GOAL: Balance between accuracy and speed of simulation SPICE Fidelity Circuit-Aware Simulation ~4 hr/cycle 5 hrs/prog Analytical Circuit Model Can I change analytical with “statistical” ? 30min/prog Speed

Circuit-Aware is not only for BTWC designs
There is a recent trend in computer architecture design toward system that can adapt to circuit-level phenomena e.g., di/dt, thermal throttling These novel circuit-aware architectural optimization share a modeling requirement of detailed circuit Needs to be interaction between architectural state and circuit behavior ( e.g., device switching activity, detail timing information of pipeline states ) Analytical circuit modeling has been widely used Simple and fast At the cost of accuracy

Circuit-Aware Architectural Simulation Platform Overview
App Output Architectural Simulator Speed and Scope IF ID EX MEM WB Arch Config Arch Metrics Inputs, Voltage, Constraints Delay, Power, Switching Module Circuit Models Fidelity and Observability Circuit Simulator Circuit Metrics Tech Models

Architectural Simulator Structure
User Programs SimpleScalar Program Binary Prog/Sim Interface SimpleScalar ISA POSIX System Calls Functional Core Machine Definition Proxy Syscall Handler BPred Simulator Core Stats Performance Core Resource Dlite! Cache Loader Regs Memory

Standard Circuit Simulation Approach
SPICE-characterized Standard Cell extracted wire cap Gate input cap from typical.lib output capacitance voltage slew slew delay power a: U7 b: n45: U9 U10 c: U8 f: n47: n46: U11 g: d:

Event Driven Implementation
Transition Event 1 1 U7 n45 1 1 1 U9 U10 f 1 U8 n46 n47 glitch 1 g U11 1 1 1 Event driven simulation can capture glitch behavior Validation against a set of SPICE simulation Error rates are consistently less than 11%, with most less than 3% The initial speed of simulation without optimization is 150 insts / sec

Some serious performance boosters
Constraint-based pruning Static pruning Dynamic pruning Circuit timing memoization (a.k.a. Cashing of electrical simulation results ) SimPoints

Constraint-Based Pruning – An overview
Constraints not violated IF/ID ID/EX EX/MEM MEM/WB Check constraints Logic Logic under sim. Logic if (delay < Tclk) stop else simulate Some analyses have a “don’t care” scenario, e.g. ,“less than X switches”, “less than tclk” Architecture does not react to “don’t care” sets

Static Constraint Pruning
At each new voltage and temperature, domain specific STA computes worst case values of the constraints measure and can be pruned where the constraints cannot be violated Whenever the supply voltage is changed, constraint based pruning is re-evaluated before simulation continues t_max_req 0.87 Clock cycle time: 1 1.8V Domain specific STA a a b b 1.32 ns c c 0.65 ns d d Less than clock threshold Statically pruned 0.83 ns e e

Example: Clock cycle time: 5 ns
Dynamic Pruning During simulation, an event can be dropped if a particular input vector causes a transition such that: tevent + Tmax2output < Tconstraint Guarantees that simulation will reach output net without a timing violation Must still perform logic simulation to compute circuit state Example: Clock cycle time: 5 ns Fast logic sim ns ns @ 2.93 ns @ 3.23 ns @ 4.23 ns 1 1 1 1 Tmax2output Timing budget 3.2ns 2.5ns 2ns 1ns

Constraint-Based Circuit Pruning
In our case study of 200Mhz Razor system At 1.8V nominal voltage, pruning eliminated 64% of prime inputs of circuit At most highly constrained voltage 1.4V, 24% of prime inputs of circuit is eliminated Static and Dynamic pruning achieve 445 instructions per second

Circuit Timing Memoization
Remember previous circuit evaluations, reuse results if they recur Leverage value locality by recording switching history Previous vector encodes the internal node values of circuit, input vector indicates new input transition opA Full Circuit Simulation 16 Hash table 0x01FC 0x01FC 0x00AE Check 2.5ns (2.5ns, 250pJ) 0x00AE 0x01FC 1 16 0x003B 0x0012 250pJ Sum opB 16 0x003B 0x0012 0x0012

Circuit Timing Memoization
Size of hash table is limited by 256MB Dynamic reordering hash bucket chains Bringing most recently referenced (MRU) element to the head of chain, reduces average number of hops At most 50% hit rate on average Per-opcode input vector filtering mechanism Observation: load/store instructions ignore “operand B” Each instruction opcode indicates with mask which do not influence stage logic evaluation 70% hit rate on average

SimPoint Analysis After marshalling all optimization technique, we achieve approximately 1000 instructions per second We deploy a recent developed simulation sampling technique, SimPoint Uses Basic Block Distribution Analysis to extract representative samples(10 million insts) of original benchmark (1 billion insts) Drastically reduces the number of instructions observed to characterize the program’s performance Error analysis indicates an error of less than 10% (typically less than 3%) for a wide variety of benchmarks A full benchmark execution can be completed within 5 hours

Circuit-aware simulation to evaluate Razor
Initial simulators utilized a hand-generated EX-stage circuit model insufficient performance Challenge: instruction latency (in cycles) depends on circuit evaluation latency Cycle count may vary with input, voltage, temperature, process variation Circuit-Aware Architectural Simulation combines architectural and circuit simulation SimpleScalar architectural-level simulation Gate-level timing simulation of per-stage logic blocks

Case Study: Razor Timing Speculation

Circuit-Aware simulation in summary
Challenge is integration between architectural simulation and circuit simulation Must balance fidelity and speed of simulation Three optimizations were utilized to enhance speed of simulation Constraint-based Pruning Circuit Simulation Memoization SimPoint Simulation Sampling Demonstrated with Razor pipeline simulations Razor architecture reacts cycle-by-cycle to circuit-level phenomenon

Performance/verification - Circuit-Aware simulation Verification - Beta-release designs

Beta-Release Designs Beta Launch Step Checked Processor Verification Traditional Verification Tape Out Tape Out Traditional verification stalls launch until debug complete Checked processor verification could overlap with launch Beta-release when checker works Launch when performance stable Step as needed without recalls

Additional CAD Opportunities
For synthesis: Typical-case library characterization (e.g., pdf of delay) Synthesize design for target performance, power, etc… TCO-style optimizations possible for macro-modules For verification: Full formal verification for checker components Profile-directed simulation-based verification for core For testing: Checker component can facilitate software-based manufacturing test of core components

Open discussion

Conclusion

Conclusions Better than worst-case design abandons traditional worst-case design constraints Couples complex designs with checkers Enables CAD opportunities for typical-case optimization Requires tool support for observability, synthesis and verification For more information:

References Todd Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” ACM/IEEE 32nd Annual Symposium on Microarchitecture (MICRO-32), November 1999. D. Ernst, N. S. Kim, S. Das, S. Pant, T. Pham, R. Rao, C. Ziesler, D. Blaauw, T. Austin, T. Mudge, and K. Flautner, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation,” in the 36th Annual Int’l Symposium on Microarchitecture (MICRO-36), December 2003. Shanbhag, N.R., “Reliable and efficient system-on-chip design,” Computer, Vol.37, Iss.3, Mar 2004. Uht, A.K., “Going beyond worst-case specs with TEAtime,” Computer, Vol.37, Iss.3, Mar 2004. Austin, T.; Blaauw, D.; Mudge, T.; Flautner, K., “Making typical silicon matter with Razor,” Computer, Vol.37, Iss.3, Mar 2004. S.-L. Lu, “Speeding up processing with approximation circuits,” Computer, Vol.37, Iss.3, Mar 2004. Worm, F.; Ienne, P.; Thiran, P.; DeMicheli, G., ”A Robust Self-Calibrating Transmission Scheme for On-Chip Networks,” IEEE Trans. on VLSI Systems, Vol. 13, Iss. 1, January 2005. T. Kehl. Hardware self-tuning and circuit performance monitoring, in Proceedings of lnternational Conference on Computer Design, 1993. L. Anghel and M. Nicolaidis, “Cost reduction and evaluation of temporary faults detecting technique,” in Proceedings of the conference on Design, automation and test in Europe (DATE-2000), March 2000.

Supplemental Materials

More Details on Meta-Stability
Sub-critical operation invites meta-stability Meta-stability detector itself can become meta-stable double latch error signal to obtain sufficient small probability clk clk_b D Q clk_b pos neg clk restore clk_del_b Dynamic Or / Latch clk_del fail pos neg error restore bubble flush restore bubble flush Flush entire pipe No forward progress Reduce frequency

Razor Short Path Constraint
3 5 9 9 8 MEM Main FF 2 4 Main FF 8 clk clk Shadow Latch Hold Constraint (~1/2 cycle) clk_del Double-sampling metastability tolerant latches detect timing errors Second sample correct-by-design, use guarantees forward progress Microarchitectural support restores correct program state Timing errors treated in the same way as branch mispredictions

Overcoming Short Path Constraints
Delayed clock imposes a short-path constraint clock clock_del tdelay thold Min. path delay Min. Path Delay > tdelay + thold intended path short path Razor necessary only for latches on slow paths Pad fast path for latches with mixed path delays Trade-off between DVS headroom and short path constraints ff Pad with extra delay Razor_ff Long Paths Short Paths clock

Power Overhead of the Razor Flip-Flop
Error Free Operation Standard Flip-Flop Energy (static/switching) 49fJ / 125fJ RFF Energy (static/switching) 60fJ / 203fJ Error Detection and Recovery Energy of RFF per error event 260fJ 38% error-free latch overhead (assuming 20% switching activity) 42% latch overhead with errors (20% switching, 1% error rate) Overhead mitigated by latch-frugal architecture

Simultaneous Events A B F CL=10pF A B NOR WP WP WN WN
Vdd 1 A WP simultaneous events Software Glitch B WP F Vdd GND!!!! 1 CL=10pF A WN B WN GND Cancel a pair of close events that may cause software static glitch, which cannot occur in real circuits; source of inaccuracy

Accuracy of Simulators
Accuracy of simulation Validation against a set of SPICE simulation with number of circuit topology at varied voltages and input slew rates Error rates are consistently less than 11%, with most less than 3% The initial speed of simulation without optimization is 150 instructions per second ( comparable to VCS ) VCS stands for Verilog Compiler Simulator explain VCS ;

Speeding the Checker with Core Computation
ld f1,(X) f4 = f1 * f2 + f3 br f4 < 0, skip r8 = r8 + 1 skip: ... Core Processor Execution Checker Execution ld * + br + cache miss long operation misprediction ld ok Checker executes in wake of core Leverages non-binding predictions & prefetches Virtually no stalls remain to slow checker Control hazards resolved during core execution Data hazards eliminated by prefetches and input value predictions Complex microarchitectural structures only necessary in core * ok + ok br ok + ok

Motivating Observations
Speculative execution is fault-tolerant Design errors, timing errors, and electrical faults only manifest as performance divots Correct checking mechanism will fix errors What if all computation, communication, control, and progress were speculative? Any incorrect computation fixed maximally speculative Any core fault fixed minimally correct branch predictor array PC always not taken stuck-at fault X

Optimized System Architecture
Performance impacts eliminated Checker RF allows core commit No storage hazards Few checker cache misses Less expensive core storage architecture (same as baseline) Core cache failures affect checker Core Checker EX/ MEM IF ID REN REG SCHEDULER CHK CT L1 I-cache 64 KB, 1 ports L1 D-cache 64 KB, 2 ports RF 8 ports L0 Inst 0.5KB 2 ports L0 Data 4 KB 4 ports RF 8 ports L2 Unified Cache 256 KB, 1 port Slowdowns -0.4% Best: -3.2% Worst: 0.2%

Fully Decoupled System Architecture
Checker fully decoupled Core L1 caches may fail All L2 writebacks from checker Core caches flushed on fault Core accesses and misses warm up checker caches Eliminates common mode core cache failures But, generates more L2 traffic Further optimizations possible Core Checker EX/ MEM IF ID REN REG SCHEDULER CHK CT prefetch stream L1 I-cache 64 KB, 1 ports L1 D-cache 64 KB, 2 ports RF 8 ports L1 Inst 0.5KB 2 ports L1 Data 4 KB 4 ports RF 8 ports L2 Unified Cache 256 KB, 1 port Slowdowns 1.2% Best: 0% Worst: 6.7%

Intelligent Energy Management
Slack detector Automatic tuning mechanism ARM’s Intelligent Energy Manager (IEM) Processor voltage automatically tuned to external ambient conditions Inverter chain designed to track most restrictive critical path, margin still required L2 Cache control Floating point and graphics Data cache Cache L2 tags Ex Unit Control Unit I O U N T Mem C ontrol Margin still required for inverter chains – why? DSP stuff?

EX-Stage Analysis – Optimal Voltage Sweep

Simulation Results: Energy-Optimal Voltage
Graph -> relative energy -> bar -> line error rate -> line IPC hit

Low-Cost SER and Noise Protection
IF ID REN REG SCHEDULER EX/ MEM CHK IF CHK ID/REG CT CHK EX CHK MEM CTL CTL 3rd opinion Only need to address transients in checker Checker detects and corrects noise-related faults in core Core processor designed without regard to strikes (e.g., no ECC…) Recycle checker inputs suspected core fault If no error on third execution, transient strike in checker processor If error on third execution, core processor fault occurred (e.g., SER, design error) Protect critical checker control with triple-modular redundant (TMR) logic TMR on simple control results in only 1.3% larger checker (synthesized design) CTL

Fully Testable Microprocessor Designs
Checker structure facilitates manufacturing tests All checker inputs exposed to built-in-self-test logic Checker provides built-in test signature compression Checker can be fully tested with small BIST module less than 0.5% area increase Reduces burden of testing on core Missed core defects corrected Checker acts as core tester PC IF PC inst = I-cache ID inst regs = RF EX regs res/addr OK = CT result MEM addr result WT D-cache result OK BIST ROM and Control Defect Free?

Error Rate and Normalized Energy Savings for Chip1
Percentage Error Rate Normalized Energy 120MHz 140MHz Voltage (in Volts)

Instruction (Power/IPC/Freq)
Measured Power and Energy 120MHz 27C Point of First Failure Point of 0.1% Error Rate Power (mW) Energy per Instruction (Power/IPC/Freq) (pJ) Chip1 104.5 870 89.7 740 Chip2 119.4 990 99.6 830

University of Michigan

Similar presentations

Presentation on theme: "University of Michigan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan

Similar presentations

Presentation on theme: "University of Michigan"— Presentation transcript:

Similar presentations

About project

Feedback