Presentation is loading. Please wait.

Presentation is loading. Please wait.

Qiang XU CUhk REliable computing laboratory (CURE)

Similar presentations


Presentation on theme: "Qiang XU CUhk REliable computing laboratory (CURE)"— Presentation transcript:

1 Fault-Tolerant Computing – It’s Time to Cross the Layer for Cost-Effectiveness
Qiang XU CUhk REliable computing laboratory (CURE) Department of Computer Science & Engineering The Chinese University of Hong Kong

2 Technology Scaling Continues…
Feature size shrinks to tens of atoms across! Effects Manufacturing defects Process variation Transient errors from radiation Noise fluctuations Fragile devices with shortened lifetimes 2

3 Ever-Increasing Defect Density
IBM’s 8-core Cell processor chips: 10-20% yield 3

4 Defective Chip Identification
Testing is responsible for ensuring the quality of shipped products In the Past … Decision Threshold BAD Population Occurrence Frequency GOOD Population Redraw from [O’Neill-itc07] 4

5 Where is the Decision Threshold?
Nowadays … Decision Threshold Process variation Func./test mode discrepancy BAD Population Occurrence Frequency GOOD Population TEST ESCAPE FALSE REJECT Manufacturing Test is NOT Reliable Any More! Redraw from [O’Neill-itc07]

6 Current Solution for Yield Improvement
Yield-driven redundancy Cisco’s 192-core Metro network processor contains 4 spares nVidia’s 128-core GeForce 8800 GPU can be degraded to 96-core version if some cores are faulty Simple solution but … More and more redundant circuitries are necessary Require precise offline testing 6

7 Other Reliability Threats
Hard errors Time dependent dielectric breakdown (TDDB) Electromigration (EM) Negative bias temperature instability (NBTI) Stress migration (SM) Soft errors Alpha particles; Neutron Intermittent faults Permanent Transient Burst for a Period of Time Hardware solution, again, more redundant circuitries! 7

8 The Impact of Reliability Threats with Scaling
Difficult Burn-in Useful Life Useful Life Failure Rate Higher failure rate Faster aging Time

9 To Keep Scaling … Reliability Cost Total Cost Cost per Transistor
Transistor Cost Year

10 To Achieve Cost-Effective Scaling
Unlike old days, defective/Vulnerable ICs will be shipped to customers! Cross-layer solution as a remedy for resilient system design! 10

11 Defective/Vulnerable
Cross-Layer Reliability Tolerate critical defects and soft/hard error with high failure rates at hardware level Mask non-critical defects and soft/hard errors with low failure rates at Hw.-dependent software level Take advantage of error-tolerance at application level Applications Hw.-dependent Sw. Defective/Vulnerable ICs 11

12 Key Questions in Cross-Layer Reliability
Differentiate the impact of various reliability threats and tackle them at different layers! @ Circuit-level Which defects, soft/hard errors are critical enough requiring hardware redundancy? Protect at which granularity? Traditional pass/fail testing methodology no longer stands, what would be the new metrics for testing? Ever-increasingly important online test and diagnosis 12

13 Key Questions in Cross-Layer Reliability
Differentiate the impact of various reliability threats and tackle them at different layers! @ Hardware-dependent software level How to model various hardware faults accurately at this level? How to allocate workloads intelligently to mitigate such errors? @ Application level How to take application reliability requirements into account? Is it possible to generalize such solutions? 13

14 Key Questions in Cross-Layer Reliability
Differentiate the impact of various reliability threats and tackle them at different layers! @ System-level - Low-cost resilient designs under performance, power, and reliability constraint How to monitor the system’s reliability changes? How do we evaluate the cross-layer reliability for the entire system? Can we separate the layers clearly with only FIT or BER information? 14

15 High-Level Lifetime Reliability Modeling and Simulation Framework
DPM / DTM DVFS Timeout Thermal throttling Power gating Redundancy Level Quantity Task Allocation Round-robin Energy-driven SPECIFICATION IC DESIGN Functionality Expected service life Power consumption Area constraint Thermal issue

16 Only short simulation time is affordable!
The Challenge Wear-out effects of hard errors Reliability at a specific time point depends on current reliability-related factors (e.g., temperature) aging effects due to past usage Significant temperature variation Temperature simulation is time-consuming Temperature Variation Example Only short simulation time is affordable!

17 The Challenge – Simulation Framework
Apparently, it is not possible to trace temperature and aging-related execution parameters in a fine-grained manner throughout the entire lifetime What if we conduct coarse-grained tracing and compute lifetime reliability with average operational temperature? The ignorance of temperature variation results in lack of accuracy How to achieve efficient yet accurate lifetime reliability simulation with limited fine-grained trace information, when failure mechanisms follow arbitrary failure distributions?

18 Aging Rate Calculation
The key issue is to compute a time-independent aging rate Ω effectively with limited fine-grained traced information Given general failure distribution R (t), e.g., Weibull distribution express it as R (t) = R (Θ۰Ω۰t) , we then have Two steps Deduct a close-form lifetime reliability function with time-varying operational states and temperature Extract the time-independent aging rate parameter from this function

19 Lifetime Reliability Simulation Framework – AgeSim
Evaluate lifetime reliability under various usage strategy and workload DPM / DTM Trigger mechanism Load-sharing strategy Redundancy scheme Applicable for any failure distribution Output performance and energy consumption also

20 Asymmetry-Aware Processor Allocation for Chip Multiprocessor
Chip multiprocessor with increasing number of processor cores However, technology scaling also results in … Defective cores on-chip Cores with distinct performance 20

21 Asymmetric Chip Multiprocessor
Performance-asymmetry Process variation Significant frequency deviation on a chip (up to 40%) Dynamic power-performance adaptation Topology-asymmetry Manufacturing defects Wearout effect 21

22 Hide Hardware Defects @ OS Level
Applications A unified topology OS Chip Multiprocessor Fault-free core Faulty core Router Underlying hardware 22

23 Asymmetry-Aware Processor Allocation
We propose two contiguous processor allocation methodologies with different computing power representations considering Performance including communication overhead Processor allocation time System Load = Mean Application Service Rate / Mean Application Arrival Rate 23

24 Thank you for your attention !


Download ppt "Qiang XU CUhk REliable computing laboratory (CURE)"

Similar presentations


Ads by Google