GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

Slides:



Advertisements
Similar presentations
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
CprE 458/558: Real-Time Systems
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Engg, IIT(BHU)
Gwangsun Kim, Jiyun Jeong, John Kim
Soft-Error Detection through Software Fault-Tolerance Techniques
nZDC: A compiler technique for near-Zero silent Data Corruption
Presented by: Isaac Martin
Hwisoo So. , Moslem Didehban#, Yohan Ko
NVIDIA Fermi Architecture
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
SDC is in the eye of the beholder: A Survey and preliminary study
Presentation transcript:

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research Distinguish between ubc and amd

Graphic Processing Unit Main accelerators for general purpose applications (e.g. TITAN , 18,000 GPUs) Performance Trade-off Traditionally there are two aspects that people investigate for GPUs, Performance, increasing Power, decreasing The third dimension is un-explored. We want high performance, low power consumption and high reliability. It is not easy as there are trade-offs in hardware and software design But reliability under-explored Reliability Power

Reliability trends The soft error rate per chip will continue to increase because of larger arrays; probability of soft errors within the combination logic will also increase. [Constantinescu, Micro 2003] Source: final report for CCC cross-layer reliability visioning study (http://www.relxlayer.org/) Why the reliability is important? How bad this could be for GPGPUs when error rate increases? Show a graph for increasing

GPGPUs vs Traditional GPUs Traditional: image rendering GPGPU: DNA matching Serious result ! ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGT ATATTTTTTCTTGTTTTTTATATCCACAATCTCTTTTCGTACTTTTACACAGTATATCGTGT

Vulnerability and error resilience Vulnerability: Probability that the system experiences a fault that causes a failure Not consider the behaviors of applications Error Resilience: Given a fault occurs, how resilient the application is ✔ Error resilience study Device/Circuit Level Architectural Level Operating System Level Application Level Now we know that GPGPU applications can be seriously affected. Then what? In order to answer it, let me first introduce two dependability metrics: Relate to our goal. Draw a diagram about hardware and software (avf stops at the point that a hardware fault becomes an error ) Architecture Vulnerability Factor study stops here Faults

Goal Investigate the error-resilience characteristics of applications Find correlation with application properties Application-specific, software-based, fault tolerance mechanisms Base step Mention that there are other paths

GPU-Qin: a methodology and a tool Fault injection: introduce faults in a systematic, controlled manner and study the system’s behavior GPU-Qin: on real hardware via cuda-gdb Advantages: Representativeness efficiency Before diving into the details of GPU-Qin, first let me talk about what the fault injection is and the requirements of fault injection methodology This has two advantages comparing to other types of fault injection, e.g. simulator based On real hardware, not like simulator, worry about the accuracy, faster (100X faster, refer to the paper for details)

Challenges High level: massive parallelism Inject into all the threads: impossible! In practice: limitations in cuda-gdb Breakpoint on source code level, not instruction level Single-step to target instructions Explain why we need to single-step: set the breakpoint at the source code line where the instruction belongs In cuda-gdb, in order to inject the fault, we have to stop the application. Overcome those challenges by carefully design the methodology, which is more efficient than architecture-level fault injection

Fault Model Transient faults (e.g. soft errors) Single-bit flip No cost to extend to multiple-bit flip Faults in arithmetic and logic unit(ALU), floating point Unit (FPU) and the load-store unit(LSU). (Otherwise ECC protected) Example of transient faults are soft errors

Overview of the FI methodology Grouping Profiling Fault injection Grouping: use number of instructions executed by threads to tell the behaviors of applications Grouping and profiling are one-time phase for an application, fault injection is repeated What they are and Why we need to do these steps ( challenges) in next a couple of slides

Grouping Grouping Profiling Fault injection Detects groups of threads with similar behaviors (i.e. same number of dynamic insts) Pick representative groups (challenge 1) Category Benchmarks Groups Groups to profile % of threads in picked groups Category I AES,MRI-Q,MAT,Mergesort-k0, Transpose 1 100% Category II SCAN, Stencil, Monte Carlo, SAD, LBM, HashGPU 2-10 1-4 95%-100% Category III BFS 79 2 >60% In this context, similar behavior means that Mention the validation Up to 25% difference in crash rate Initution is that GPU programs are SIMT

Profiling Grouping Profiling Fault injection Finds the instructions executed by a thread and their corresponding CUDA source-code line (challenge2) Output: Instruction trace: consisting of the program counter values and the source line associated with each instruction consisting of the program counter values and the source line associated with each instruction

Fault injection Grouping Profiling Fault injection Breakpoint hit PC hit Native execution Single-step execution Single-step execution Native execution Until end Activation window Fault injection

Heuristics for efficiency A large number of iterations in loop Limit the number of loop iterations explored Single-step is time-consuming Limit activation window size Delete the graphs for validation Embed the validation into earlier slides

Characterization study NVIDIA Tesla C 2070/2075 12 CUDA benchmarks, 15 kernels Rodinia, Parboil and CUDA SDK Outcomes: Benign: correct output Crash: exceptions from the system or application Silent Data Corruption (SDC): incorrect output Hang: finish in considerably long time SDC: compare to the golden run

Error Resilience Characteristics SDC rates and crash rates vary significantly across benchmarks SDC: 1% - 38% Crash: 5% - 70% 2. Average failure rate is 67% 3. CPUs have 5% - 15% difference in SDC rates Put CPUs: the difference is 5% - 15% in the slide We need application-specific FT mechanisms

Hardware error detection mechanisms Hardware exceptions 1. Exceptions due to memory-related error 2. Crashes are comparable on CPUs and GPUs AES (Crash rate: 43%) MAT (Crash rate: 30%)

Crash latency – measure the fault propagation Device illegal address have highest latency in all benchmarks except Stencil Warp out-of-range has lower crash latency otherwise Crash latency in CPUs is shorter (thousands of cycles) [Gu, DSN04] Two examples: Crash latency measures the time interval between the moment a fault is activated and the moment a crash occurs. We measure crash latency for each exception type above, to Propagation understand how quickly the crash is detected We can use crash latency to design application-specific checkpoint/recovery techniques

Resilience Category Resilience Category Benchmarks Measured SDC Dwarfs Search-based MergeSort 6% Backtrack and Branch+Bound Bit-wise Operation HashGPU, AES 25% - 37% Combinational Logic Average-out Effect Stencil, MONTE 1% - 5% Structured Grids, Monte Carlo Graph Processing BFS 10% Graph Traversal Linear Algebra Transpose, MAT, MRI-Q, SCAN-block, LBM, SAD 15% - 25% Dense Linear Algebra, Sparse Linear Algebra, Structured Grids 5 categories: explain two of them Extend dwarfs (other than performance) Pick two examples Search-based and average-out effect are more resilient than others Predict the resilience of applications without conducting fault injection Design algorithmic specific FT techniques (key spots to put detectors) We also try to find the SDC rate with instruction classifications.

Conclusion GPU-Qin: a methodology and a fault injection tool to characterize the error resilience of GPU applications We find that there are significant variations in resilience characteristics of GPU applications Algorithmic characteristics of the application can help us understand the variation of SDC rates Backup slides Simulators Group the summary and what we find

Thank you ! Project homepage: http://netsyslab.ece.ubc.ca/wiki/index.php/FTGPU Code: https://github.com/DependableSystemsLab/GPU-Injector Github