Presentation is loading. Please wait.

Presentation is loading. Please wait.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Similar presentations


Presentation on theme: "Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,"— Presentation transcript:

1 Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor

2 Soft Errors Soft errors, also called single-event upsets(SEUs) – Occur because of High energy particle strikes Electrical noise Circuit cross talk Image credit: Certichip 2

3 Increasing Soft Error Rate Parameters affecting soft error rates – Shrinking dimensions – Voltage scaling 14x increase in soft error rate 250nm 40nm [Dixit, IRPS’11] Oracle(Sun)’s neutron beam experiments over past 10 years Increasing 10nm 3

4 Instruction Duplication Traditional way – Duplicate producer chain – Compare at strategic points  Global stores  Function calls The average overhead is ~50% == Recovery or continue execution original instrs duplicated cmps and branches Load Start point --- 4

5 Soft Applications 100% accuracy not always required Image Processing Computer Vision Data Analytics Media Applications Robotics 5

6 Acceptable Vs. Unacceptable Outputs Particle StrikeElectrical Noise 6 Reduce unacceptable outputs efficiently ✓ PSNR > thr ✗ PSNR < thr

7 Classification Refinement Error in a bit Affects program output? No Masked (Benign) Yes Silent Data Corruption (SDC) Error > thr? Acceptable Silent Data Corruption (ASDCs) Unacceptable Silent Data Corruption (USDCs) NoYes ✓ ✗ ✗ 7

8 Acceptable Vs Unacceptable No need to pay the cost of detection for acceptable 8

9 wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } wIdx = 1; int i, data1, data2, result; for(i = 0; i < N; i += 1;) { wIdx = wIdx << 2; data1 = read_table( ); data2 = read_table( ); result = (data1 * data2)/maxVal; write(wIdx, result); } Hierarchy of Protection Needs 9 All variables are not created equal Varying level of protection required for different variables Soft Checks Hard Checks No Checks

10 Proposed Solution Duplication is expensive -Use it sparingly for critical variables op2 op2 = op3 * op4 -Produces 0 more than thr -Insert value comparison Instructions and outputs of code regions have value locality -Exploit value locality and check for deviations --- DuplicatedOriginal --- Efficient to check for deviations from expectations 10

11 Expected Value Checks cmp R1R1 R1R1 V1V1 Recovery or continue execution Produces V 1 frequently < R1R1 R1R1 V1V1 Recovery or continue execution Produces between V 1 and V 2 frequently > or V2V2 11

12 Opt 1: Reducing Value Checks Amenable for value checks 12 R2R2 R3R3 R1R1 R4R4 R5R5

13 Opt 1: Reducing Value Checks Naïve way: Insert check for all the amenable instructions 13 cmp br cmp br cmp br R2R2 R3R3 R1R1 R4R4 R5R5 V1V1 V4V4 V5V5

14 Opt 1: Reducing Value Checks 14 cmp br R2R2 R3R3 R1R1 R4R4 R5R5 V5V5 Sufficient to insert check for dominating instruction -Early large variation might get subdued -Get caught at dominating instruction

15 --- Opt 2: Reducing Duplication op2 Target instr op1 = op2 + 1 op1--- op2 = op3 * op4 D op1 D op1 = D op2 + 1 cmp D op2 = D op3 * D op4 cmp Trigger recovery br original instrs duplicated cmps and branches Trigger recovery br F T F --- 0D op2 --- T Produces 0 more than thr times 15

16 Value Profiling and Value Ranges Key observations – Recording and storing all values is time consuming – Compact range of values produced by an instruction  Greedy algorithm 16 Algorithms and full details are in the paper

17 Compilation Flow No annotations required – Identify state (critical) variables by loop carried dependence Expected value checks for less critical variables Intermediate Representation (IR) Code analysis and intelligent duplication (IR to IR) Code generation Application source code Application binary Analyses and optimizations Classification DuplicationValue checks 17 State variables have snowball effect

18 Evaluation Methodology Program analysis and duplication/checks – Implemented as compiler pass in the LLVM compiler Statistical fault injection (SFI) experiments – GEM5 simulator in ARM syscall emulation mode Random (single) bit flip faults – Simulated entire benchmarks after fault injection – Results classification after completion 18

19 Benchmarks Image Processing JPEG encoding/decoding tiff to BW Audio/Video Processing G721 encoder/decoder MP3 encoding/decoding H264 encoding/decoding Robotics Kmeans clustering Support vector machine Computer Vision Image segmentation Texture synthesis Diversified set of benchmarks for evaluation 19

20 Performance Overhead 20 101 116208 TraditionalSelectiveSelective + Checks 53% 20%

21 Fault Coverage Analysis Unacceptable Silent Data Corruptions (USDCs) -Worst of the worst outcomes 21 Masked + ASDCsDetectsFailuresUSDCs (in USDCs) 2.8 x Reduction

22 Conclusion Transient faults are important – Output classification refinement – Selective duplication with value checks In comparison to traditional duplication – Reduces performance overhead from 53% down to 20% – Fault coverage is comparable (98.6% vs 98.8%) 22 ✓ ✗

23 23

24 24

25 Fault Outcome Classification Masked – Acceptable outputs SWDetects – Detected by duplication HWDetects – Produces a symptom such as page fault in 1000 cycles of fault injection Failures – Fail status on program termination or program did not terminate in reasonable time USDCs (Unacceptable Silent Data Corruptions) – Faults that result in unacceptable outputs 25


Download ppt "Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,"

Similar presentations


Ads by Google