Presentation is loading. Please wait.

Presentation is loading. Please wait.

SDC is in the eye of the beholder: A Survey and preliminary study

Similar presentations


Presentation on theme: "SDC is in the eye of the beholder: A Survey and preliminary study"— Presentation transcript:

1 SDC is in the eye of the beholder: A Survey and preliminary study
Bo Fang*, Panruo Wu✝, Qiang Guan ☨, Nathan DeBardeleben ☨, Laura Monroe ☨, Sean Blanchard ☨, Zizong Chen ✝, Karthik Pattabiraman* and Matei Ripeanu* *The University of British Columbia, Canada ☨ Ultrascale System Research Center, Los Alamos National Lab, USA ✝ The University of California Riverside, USA This is a position paper about how people should think in terms of characterization and detecting SDCs.

2 VS But many of us still do !
do not compare apple and orange. Obviously they are different things. Unobviously, in computer science, sometimes we are still doing this. I will give you an example in a couple of slides. But many of us still do !

3 Error Resilience Fault Error Failure
SoC soft error trends: overall FIT rate per SoC is increasing [DATE 2014, Chandra AMD] A very important concept is called Fault –error –failure chain Fault is referred to hardware faults in our context, which is caused by particle strikes, neutrons, hardware defeats etc. It can cause for example, bit-flips Error: Deviaton of system behavior from the fault-free run Failure Violation of system’s specificaton e.g. crash As fault rates keep going, the error resilience study becomes more and more important.

4 Error Protection Space
How large the space is Error can appear in any layer Only a fraction of the errors at the circuit level impacts the application Protection cost is different across layers protection with software-based techniques are essential for modern systems. Where apple and orange comparison happens is that Try to compare how efficient the two FT techniques where they are designed in different layers

5 Focus: Silent Data Corruption
Crash Fault Hang SDC No Sign of Incorrect Execution Normal execution Report the preliminary study here

6 Preliminary Study: Cross-layer Data Corruptions
Error propagation how much fault masking across different perspectives of the system? Fault injection Fault model PINFI [Wei DSN14] DOE mini apps Fault mode is a single bit-flip in the computation units of the processors.

7 Experimental Configuration
Application output and application-specific correctness check Applications Output Application-specific correctness check LULESH Number of iterations Final origin energy Measures of symmetry Number of iterations: exactly the same Final origin energy: correct to at least 6 digits Measures of symmetry: smaller than 10-8 HPL Solution vector x Residual check on x CLAMR Number of cell units Mass change per iterations Threshold for the mass change per iteration Measure memory, output and app-specific correctness check

8 Cross-layer Error Resilience
46% of faults causes memory corruptions, but no impact on the final correctness 50% of output corruptions do not lead to final correctness deviation Data corruption rate Say that error resilience estimation can be misleading depending on the layer

9 No Sign for Incorrect Execution
SDC Characterization What SDC How No Sign for Incorrect Execution When What parts/layers/data of systems we want to check How to check When to check

10 What: System-level Classification
Memory OS System Call App Data Path Our position is that we need a system-level classification of SDC in the context of the whole system stack for characterization and detection. Benefits: 1. Enbale different point of view/ hardware guys/os guys/ resilience scientist/ application developer/user 2. Understand/improve the effectiveness of FT techniques Error detection mechanisms can be improved to see if the detected error in lower layer can really lead to unacceptable outcome (selective detection): needs cross-layer analysis Checkpoint/recovery schemes based on anomaly data monitoring can determine if a roll-back is needed by predicting the final outcome of a intermediate data corruption.

11 How: Precise vs Approximate
Application output different from golden run Application output not pass check e.g. [Feng ASPLOS2010] [Hari, ASPLOS2012] [Reis, CGO2005] e.g. [Lu CASE2014] [Huang, IEEE TC2006] [Reis, CGO2005] Move that system level classification How Precise or vague checking Various layers

12 Example of the Impact Affects sensitivity 01100110 01100111 01100110
Bit-by-bit equality vs application-specific check ✔️ Here is an example of why this is important. The choice/requirement of how to determine an SDC affects the sensitivity of your determination. A gap can be expected.

13 When: Intermediate vs Final
Most of studies Intermediate Application states (intermediate or final) violation ABFT algorithms Internal states of a linear solver. Make sure it does not violate/break mathematical invariants/algorithm-specific requirement e.g. [Berrocal, HPDC2015] [Chen, SIGPLAN2013] [Sloan, DSN2012]

14 Conclusion SDCs are the most important failure types for modern systems SDC characterization depends on multi-dimension knowledge SDC protection needs cross-layer analysis Advertising: Please attend the talk of our paper in the regular session: ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-layer Resilience Analysis (Wednesday, June 29th, 2016, 16:00 – 17:30)

15 Outline What are SDCs Classification of SDCs
Impact on Fault Tolerance Design Skip this one


Download ppt "SDC is in the eye of the beholder: A Survey and preliminary study"

Similar presentations


Ads by Google