Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

Similar presentations


Presentation on theme: "GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British."— Presentation transcript:

1 GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications
Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research Distinguish between ubc and amd

2 Graphic Processing Unit
Main accelerators for general purpose applications (e.g. TITAN , 18,000 GPUs) Performance Trade-off Traditionally there are two aspects that people investigate for GPUs, Performance, increasing Power, decreasing The third dimension is un-explored. We want high performance, low power consumption and high reliability. It is not easy as there are trade-offs in hardware and software design But reliability under-explored Reliability Power

3 Reliability trends The soft error rate per chip will continue to increase because of larger arrays; probability of soft errors within the combination logic will also increase. [Constantinescu, Micro 2003] Source: final report for CCC cross-layer reliability visioning study ( Why the reliability is important? How bad this could be for GPGPUs when error rate increases? Show a graph for increasing

4 GPGPUs vs Traditional GPUs
Traditional: image rendering GPGPU: DNA matching Serious result ! ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGT ATATTTTTTCTTGTTTTTTATATCCACAATCTCTTTTCGTACTTTTACACAGTATATCGTGT

5 Vulnerability and error resilience
Vulnerability: Probability that the system experiences a fault that causes a failure Not consider the behaviors of applications Error Resilience: Given a fault occurs, how resilient the application is ✔ Error resilience study Device/Circuit Level Architectural Level Operating System Level Application Level Now we know that GPGPU applications can be seriously affected. Then what? In order to answer it, let me first introduce two dependability metrics: Relate to our goal. Draw a diagram about hardware and software (avf stops at the point that a hardware fault becomes an error ) Architecture Vulnerability Factor study stops here Faults

6 Goal Investigate the error-resilience characteristics of applications
Find correlation with application properties Application-specific, software-based, fault tolerance mechanisms Base step Mention that there are other paths

7 GPU-Qin: a methodology and a tool
Fault injection: introduce faults in a systematic, controlled manner and study the system’s behavior GPU-Qin: on real hardware via cuda-gdb Advantages: Representativeness efficiency Before diving into the details of GPU-Qin, first let me talk about what the fault injection is and the requirements of fault injection methodology This has two advantages comparing to other types of fault injection, e.g. simulator based On real hardware, not like simulator, worry about the accuracy, faster (100X faster, refer to the paper for details)

8 Challenges High level: massive parallelism
Inject into all the threads: impossible! In practice: limitations in cuda-gdb Breakpoint on source code level, not instruction level Single-step to target instructions Explain why we need to single-step: set the breakpoint at the source code line where the instruction belongs In cuda-gdb, in order to inject the fault, we have to stop the application. Overcome those challenges by carefully design the methodology, which is more efficient than architecture-level fault injection

9 Fault Model Transient faults (e.g. soft errors) Single-bit flip
No cost to extend to multiple-bit flip Faults in arithmetic and logic unit(ALU), floating point Unit (FPU) and the load-store unit(LSU). (Otherwise ECC protected) Example of transient faults are soft errors

10 Overview of the FI methodology
Grouping Profiling Fault injection Grouping: use number of instructions executed by threads to tell the behaviors of applications Grouping and profiling are one-time phase for an application, fault injection is repeated What they are and Why we need to do these steps ( challenges) in next a couple of slides

11 Grouping Grouping Profiling Fault injection Detects groups of threads with similar behaviors (i.e. same number of dynamic insts) Pick representative groups (challenge 1) Category Benchmarks Groups Groups to profile % of threads in picked groups Category I AES,MRI-Q,MAT,Mergesort-k0, Transpose 1 100% Category II SCAN, Stencil, Monte Carlo, SAD, LBM, HashGPU 2-10 1-4 95%-100% Category III BFS 79 2 >60% In this context, similar behavior means that Mention the validation Up to 25% difference in crash rate Initution is that GPU programs are SIMT

12 Profiling Grouping Profiling Fault injection Finds the instructions executed by a thread and their corresponding CUDA source-code line (challenge2) Output: Instruction trace: consisting of the program counter values and the source line associated with each instruction consisting of the program counter values and the source line associated with each instruction

13 Fault injection Grouping Profiling Fault injection Breakpoint hit
PC hit Native execution Single-step execution Single-step execution Native execution Until end Activation window Fault injection

14 Heuristics for efficiency
A large number of iterations in loop Limit the number of loop iterations explored Single-step is time-consuming Limit activation window size Delete the graphs for validation Embed the validation into earlier slides

15 Characterization study
NVIDIA Tesla C 2070/2075 12 CUDA benchmarks, 15 kernels Rodinia, Parboil and CUDA SDK Outcomes: Benign: correct output Crash: exceptions from the system or application Silent Data Corruption (SDC): incorrect output Hang: finish in considerably long time SDC: compare to the golden run

16 Error Resilience Characteristics
SDC rates and crash rates vary significantly across benchmarks SDC: 1% - 38% Crash: 5% - 70% 2. Average failure rate is 67% 3. CPUs have 5% - 15% difference in SDC rates Put CPUs: the difference is 5% - 15% in the slide We need application-specific FT mechanisms

17 Hardware error detection mechanisms
Hardware exceptions 1. Exceptions due to memory-related error 2. Crashes are comparable on CPUs and GPUs AES (Crash rate: 43%) MAT (Crash rate: 30%)

18 Crash latency – measure the fault propagation
Device illegal address have highest latency in all benchmarks except Stencil Warp out-of-range has lower crash latency otherwise Crash latency in CPUs is shorter (thousands of cycles) [Gu, DSN04] Two examples: Crash latency measures the time interval between the moment a fault is activated and the moment a crash occurs. We measure crash latency for each exception type above, to Propagation understand how quickly the crash is detected We can use crash latency to design application-specific checkpoint/recovery techniques

19 Resilience Category Resilience Category Benchmarks Measured SDC Dwarfs
Search-based MergeSort 6% Backtrack and Branch+Bound Bit-wise Operation HashGPU, AES 25% - 37% Combinational Logic Average-out Effect Stencil, MONTE 1% - 5% Structured Grids, Monte Carlo Graph Processing BFS 10% Graph Traversal Linear Algebra Transpose, MAT, MRI-Q, SCAN-block, LBM, SAD 15% - 25% Dense Linear Algebra, Sparse Linear Algebra, Structured Grids 5 categories: explain two of them Extend dwarfs (other than performance) Pick two examples Search-based and average-out effect are more resilient than others Predict the resilience of applications without conducting fault injection Design algorithmic specific FT techniques (key spots to put detectors) We also try to find the SDC rate with instruction classifications.

20 Conclusion GPU-Qin: a methodology and a fault injection tool to characterize the error resilience of GPU applications We find that there are significant variations in resilience characteristics of GPU applications Algorithmic characteristics of the application can help us understand the variation of SDC rates Backup slides Simulators Group the summary and what we find

21 Thank you ! Project homepage: Code: Github


Download ppt "GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British."

Similar presentations


Ads by Google