Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Similar presentations


Presentation on theme: "Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan."— Presentation transcript:

1 Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan Peh (MIT)

2 The Tale of Resilient NoCs Silicon technologies move into the nanometer regime  Devices become unreliable due to Process Variation (PV)  System designers propose resilient NoC architectures From 1994 to 2011…  Dally’s Reliable Router (1994)  RoCo (ISCA’06)  BulletProof (HPCA’06)  Vicis (DAC’09) What fault model are these proposals evaluated with? uniform random fault distribution across gates >50% inaccuracy in capturing fault locations can we do better?

3 Methodology for Accurate Fault Modeling What is the golden reference of the expected PV maps? The SPICE models of Standard Cells of the technology How do we use them to capture variation-induced faults? (list of standard cells and their interconnections) Layouts of Standard Cells SPICE models of Standard Cells extraction router RTL synthesized netlist synthesis SPICE model netlist Monte Carlo simulations

4 Methodology for Accurate Fault Modeling Challenge: duration of simulation Solution: hybrid timing / circuit-level simulation Step 1. Find the critical paths and the inputs that result in their longest delays (with Static Timing Analysis) Step 2. Perform Monte Carlo circuit-level simulations only for these paths / input permutations to capture variation-induced timing violations Step 3. Map circuit-level timing violations back to system-level faults

5 Methodology for Accurate Fault Modeling Step 3: mapping circuit-level violations  system-level faults Each Verilog signal piggybacks a vector of system-level faults critical path1 critical path2 X unfair arbitration X X data corruption packet loss 100 Monte Carlo simulations 1/100 3/100 # timing violations? X X P(fault type = packet loss) (1/100) U (3 /100)

6 Probability / System Impact of Faults? (1) for fixed configuration and fixed runtime conditions Probability of occurrence configuration:5-input / 5-output router, 4-stage pipeline, 4 private VCs, 3 buffers/VC, 64bit wires runtime conditions: 2.8GHz, 27C data corruption packet loss misrouting credit generation credit loss erroneous allocation unfair arbitration packet duplication packet conservation flow control

7 Probability / System Impact of Faults? data vnet num VCs2 num buff/VC3 control vnet num VCs2 num buff/VC1 channel width (bits)64 num inputs5 (4 directions, network interface) num outputs5 (4 directions, network interface) frequency75% synthesis frequency (2.85GHz) temperaturenot fixed (input argument) core power1 watt topology8x8 mesh, 4 memory controllers at corners floorplan256mm 2, 2mmx2mm cores, 0.2mmx0.2mm routers L1 cache32KB/node, private unified, 2W, MESI L2 cache1MB/node, shared distributed, 16W workloaduniform random traffic, PARSEC suite temperature Fault Model process parameters - threshold voltage (μ,σ) - transistor width (μ,σ) - transistor length (μ,σ) - oxide thickness (μ,σ) probability of faults Hotspot 5.0 thermal model Orion 2.0 power model Garnet network simulator GEMS multi-core simulator floorplan power temperature = fixed °C (2) for dynamic runtime conditions system and network configuration router configuration

8 Probability / System Impact of Faults? (2) for dynamic runtime conditions 8%-10% fault probabilities for high traffic up to 1% fault probabilities for real workloads

9 Conclusions Presented a fault modeling tool for system-level simulators Accurate + easy-to-integrate into any network simulator (already available in GEMS and GARNET) Do you need a fault model to accurately evaluate…  …a resilient coherence protocol (tolerating lost messages)?  …a resilient routing algorithm (tolerating misrouted packets)?  …an Error Correction Code (protecting data bits)? …then consider integrating our tool into your simulator to accurately model faults! Download here: www.mit.edu/~kaisopos/FaultModel


Download ppt "Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan."

Similar presentations


Ads by Google