Presentation is loading. Please wait.

Presentation is loading. Please wait.

Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level.

Similar presentations


Presentation on theme: "Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level."— Presentation transcript:

1 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level Timing Analysis Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering University of MichiganUniversity of Texas at Austin

2 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 2 Introduction Recently there is a growing concern about transient faults in combinational logic Numerous techniques already exist that deal with the effects of transient faults: – Error Correction Codes (ECC) – DIVA – Simultaneous Redundantly Threading (SRT) – and many other… However, these techniques come with a cost on performance, power, die size and design time.

3 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 3 Introduction Designers have to trade-off between reliability provided and implementation cost Inadequate soft-error protection maybe useless due to poor reliability Excessive soft-error protection uncompetitive in cost and/or performance In order to balance this trade-off, system designers need accurate SERs (Soft-Error Rate) for their designs The device community provides raw SERs for devices of current technologies and projections for devices of future technologies However, architecture-level and circuit-level phenomena derate the raw SER Accurately assessing a design’s SER requires circuit-level detail analysis infrastructure

4 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 4 In This Work… We introduce a high-fidelity, high-performance simulation infrastructure for estimating soft-error rates – asynchronously injects voltage pulses of various durations at the gate level – accurately gauge detailed circuit phenomena to model: fault introduction fault propagation and possible fault masking – simulates with sufficient speed permitting the examination of entire workloads on complex designs (thousands of gates)

5 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 5 Soft Error Masking Fortunately not all transient faults cause an error – Circuit and architectural phenomena prevent the fault from propagating to the design’s output and causing an error Logic masking Timing masking Electrical masking Microarchitecture masking Software masking

6 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 6 Soft Error Masking Logic Masking : Logic Masking : the fault gets blocked by a following gate whose output is completely determined by its other inputs Timing Masking : Timing Masking : the fault affects the input of a latch only in the period of time that the latch is not sensitive to its input Electrical Masking : Electrical Masking : the fault’s pulse is attenuated by subsequent logic gates due to electrical properties, and does not affect any latch’s input Microarchitectural Masking : Microarchitectural Masking : the fault alters a value of at least one flip-flop, but the incorrect values get overwritten without being used in any computation affecting the design’s output Software Masking : Software Masking : the fault propagates to the design’s output but is subsequently masked by software without affecting the application’s correct execution

7 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 7 Design Under Test: Design Under Test: gate-level description of the design (netlist) - Fault-Exposed Model: subjected to fault injection - Golden Model: no fault injected Fault Generator : Fault Generator : injects voltage pulses of various durations at any gate in the design and flips the value of any flip-flop in the design - faults are uniformly distributed at time, location and duration Simulation Infrastructure Fault Analyzer : Fault Analyzer : Monitors manifested errors and tracks all the possible ways a fault can be masked Model Stimuli Model Stimuli : Workload traces that exercise the design under test

8 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 8 Statistical Model for Transient Faults Pulse-based model for transient faults caused by energetic particle strikes Faults injected into combinational logic are classified based on their duration – 20%, 40%, 60%, 80% and 100% of design’s clock period Faults injected into sequential elements flip their value The arrival rate of each type of fault is modeled by a separate random variable The mean inter-arrival times for each fault type are derived by previously published data and detailed SPICE simulations

9 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 9 Design Under Test – CMP Switch We chose as a design under test a single chip multiprocessor interconnection switch (baseline provided by Li-Shiuan Peh) – Much less complex than a microprocessor yet not too simplistic (it includes finite state machines, buffers, control logic, and buses) Wormhole switch pipelined at the flit level Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements Realistic workload – Communication traces derived from the TRIPS architecture

10 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 10 Characterization per Fault Type High microarchitectural masking – 95% of the faults that flip a flip-flop’s value are masked Timing masking is significant only for faults with small pulse durations Logic masking is increasing as the fault’s pulse duration is decreasing 51.7% logic masking 2.2% timing masking 42.9% μarch masking 3.2% error

11 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 11 Derating Factor Derating factor = error rate -1 – i.e. a derating factor of 30 means that one of every 30 injected faults will cause an error (corresponds to an error rate of 3.3%) Average derating factor for realistic workloads is 31 Synthetic high utilization workload leads to a derating factor of 12 error rate: 3.2% error rate: 8.3%

12 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 12 Failure Rate Projections Taking into account projections from ITRS and raw SER estimates for future process technologies, we make failure rate projections considering the transient-fault derating effects Design architecture is kept intact for future process technologies Two different designs: – one clocked with the projected clock frequencies for microprocessors – and one clocked with the projected clock frequencies for interconnection networks

13 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 13 Transient-fault Vulnerability per Component We observed that each switch component exhibited different vulnerability on transient faults Derating effects greatly depend on the component’s characteristics Most vulnerable component – Switch Arbiter (12.8% error) – 6% of switch’s area Input Controllers – dominate switch design – 86% of switch’s area The switch’s vulnerability match with that of input controllers

14 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 14 Effects of Multi-fault Strikes A single strike causes multiple faults on neighbouring gates or flip-flops – lack of data about frequency of such events or models for multi-fault strikes on logic gates and flip-flops – we assume that each strike causes multiple faults extremely pessimistic – even under this severe environment the failure rates are relatively low

15 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 15 Conclusions – Directions for Future Work Conclusions For complex designs there is significant fault masking, with derating factors as high as 30 Soft-error derating effects highly depend on the design’s characteristics and utilization Our observations suggest that the soft-error reliability threat might have been overstated by the computer architecture community – Designers need to evaluate their design’s soft-error tolerance with detail analysis tools considering circuit level derating effects and better trade-off between the protection provided and the implementation cost Future Work Study the soft-error derating effects for several designs with different amount of complexity and different characteristics Enhance our simulation infrastructure to be able to simulate large high- complexity systems (millions of gates) with short simulation runs

16 Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 16 Questions?


Download ppt "Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level."

Similar presentations


Ads by Google