Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Lecture 6: Multicore Systems
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
Extensible Networking Platform 1 Liquid Architecture Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain,
DRACO Architecture Research Group. DSN, Edinburgh UK, Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
A Mathematical Model for Balancing Co-Phase Effects in Simulated Multithreaded Systems Joshua L. Kihm, Tipp Moseley, and Dan Connors University of Colorado.
ED 4 I: Error Detection by Diverse Data and Duplicated Instructions Greg Bronevetsky.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
Panda: MapReduce Framework on GPU’s and CPU’s
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,
Evaluating Performance Information for Mapping Algorithms to Advanced Architectures Nayda G. Santiago, PhD, PE Electrical and Computer Engineering Department.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
QCAdesigner – CUDA HPPS project
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Building Your Own Machine The Universal Machine (UM) Introduction Noah Mendelsohn Tufts University Web:
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Ch 1 - Introduction to Computers and Programming Hardware Terminology Main Memory Auxiliary Memory Drives Writing Algorithms Using Pseudocode Programming.
EnerJ: Approximate Data Types for Safe and General Low-Power Computation (PLDI’2011) Adrian Sampson, Werner Dietl, Emily Fortuna Danushen Gnanapragasam,
CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.
Sunpyo Hong, Hyesoon Kim
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Comparison of Threading Programming Models
Measuring Performance II and Logic Design
Parallel Programming Models
Gwangsun Kim, Jiyun Jeong, John Kim
Ultrascale Systems Research Center, Los Alamos National Laboratory2
Selective Code Compression Scheme for Embedded System
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
High Performance Computing on an IBM Cell Processor --- Bioinformatics
nZDC: A compiler technique for near-Zero silent Data Corruption
The Problem Finding a needle in haystack An expert (CPU)
Hwisoo So. , Moslem Didehban#, Yohan Ko
NVIDIA Fermi Architecture
CMSC 611: Advanced Computer Architecture
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
CMSC 611: Advanced Computer Architecture
SDC is in the eye of the beholder: A Survey and preliminary study
CSE 502: Computer Architecture
Automotive-semiconductors Functional Safety
Sculptor: Flexible Approximation with
Presentation transcript:

Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1

Reliability trends  The soft error rate per chip will continue to increase [Constantinescu, Micro 03; Shivakumar, DSN 02]  ASC Q Supercomputer at LANL encountered 27.7 CPU failures per week [Michalak, IEEE transactions on device and materials reliability 05] 2

Goal 3 Application- specific, software- based, fault tolerance mechanisms Investigate the error- resilience characteristics of applications Investigate the error- resilience characteristics of applications Find correlation between error resilience characteristics and application properties ✔

Previous Study on GPUs  Understand the error resilience of GPU applications by building a methodology and tool called GPU-Qin [Fang, ISPASS 14] 4

Operations and Resilience of GPU Applications OperationsBenchmarksMeasured SDC Comparison-basedMergeSort6% Bit-wise OperationHashGPU, AES25% - 37% Average-out EffectStencil, MONTE1% - 5% Graph ProcessingBFS10% Linear AlgebraTranspose, MAT, MRI-Q, SCAN-block, LBM, SAD 15% - 38% 5

This paper: OpenMP programs 6  Thread model  Program structure Input processing Pre algorithm Parallel region Post algorithm Output processing Begin: read_graphics(image_orig); image = image_setup(image_orig); scale_down(image) #pragma omp parallel for shared(var_list1)... scale_up(image) write_graphics(image); End Begin: read_graphics(image_orig); image = image_setup(image_orig); scale_down(image) #pragma omp parallel for shared(var_list1)... scale_up(image) write_graphics(image); End

Evaluate the error resilience  Naïve approach:  Randomly choose thread and instruction to inject  NOT working: biasing the FT design  Challenges:  C1: exposing the thread model  C2: exposing the program structure 7

Methodology  Controlled fault injection: fault injection in master and slave threads separately (C1)  Integrate with the thread ID  Identify the boundaries of segments (c2)  Identify the range of the instructions for each segment by using source-code to instruction- level mapping  LLVM IR level (assembly-like, yet captures the program structure) 8

LLFI and our extension Start Fault injection instruction/regist er selector Instrument IR code of the program with function calls Profiling executable Fault injection executable Custom fault injector Inject ? Next instruction Compile time Runtime YesNo 9 Add support for structure and thread selection(C1) Collect statistics on per-thread basis (C1) Specify code region and thread to inject (C1 & C2)

Fault Model  Transient faults  Single-bit flip  No cost to extend to multiple-bit flip  Faults in arithmetic and logic unit(ALU), floating point Unit (FPU) and the load-store unit(LSU, memory-address computation only).  Assume memory and cache are ECC protected; do not consider control logic of processor (e.g. instruction scheduling mechanism) 10

Characterization study  Intel Xeon CPU  24 hardware cores  Outcomes of experiment:  Benign: correct output  Crash: exceptions from the system or application  Silent Data Corruption (SDC): incorrect output  Hang: finish in considerably long time  10,000 fault injection runs for each benchmark (to achieve < 1% error bar) 11

Details about benchmarks  8 Applications from Rodinia benchmark suite 12 BenchmarkAcronym Bread-first-searchbfs LU Decompositionlud K-nearest neighbornn Hotspothotspot Needleman-Wunschnw Pathfinderpathfinder SRADsrad Kmeanskmeans

Difference of SDC rates 13  Compare the SDC rates for  1. Injecting into master thread only  2. Injecting into a random slave thread  Master threads have higher SDC rates than slave threads (except for lud)  Biggest: 16% in pathfinder; Average: 7.8%  It shows the importance of differentiating between the master and slave threads  For correct characterization  For selective fault tolerance mechanisms  It shows the importance of differentiating between the master and slave threads  For correct characterization  For selective fault tolerance mechanisms

Differentiating between program segments 14 ✖ ✖ ✖ ✖ ✖✖ ✖ ✖ ✖ ✖  Faults occur in each of the program segments

 Output processing has the highest mean SDC rate, and also the lowest standard deviation 15 SegmentInput Processing Pre- algorithm Parallel Segment Post- algorithm Output processing Number of applications Mean28.6%20.3%16.1%27%42.4% Standard Deviation 11%23%14%13%5% Mean and standard deviation of SDC rates for each segment

Grouping by operations OperationsBenchmarksMeasured SDC Comparison- based nn, nwless than 1% Grid computation hotspot, srad 23% Average-out Effect kmeans4.2% Graph Processing bfs, pathfinder 9% ~ 10% Linear Algebralud44% 16 OperationsMeasured SDC Comparison- based 6% Bit-wise Operation 25% - 37% Average-out Effect 1% - 5% Graph Processing 10% Linear Algebra15% - 38%  Comparison-based and average-out operations show the lowest SDC rates in both cases  Linear algebra and grid computation, which are regular computation patterns, give the highest SDC rates  Comparison-based and average-out operations show the lowest SDC rates in both cases  Linear algebra and grid computation, which are regular computation patterns, give the highest SDC rates

Future work  Understand variance and investigate more applications  Compare CPUs and GPUs  Same applications  Same level of abstraction 17 OperationsBenchmarksMeasured SDC Comparison- based nn, nwless than 1% Grid computation hotspot, srad23% Average-out Effect kmeans4.2% Graph Processing bfs, pathfinder9% ~ 10% Linear Algebralud44% OperationsMeasured SDC Comparison- based 6% Bit-wise Operation 25% - 37% Average-out Effect 1% - 5% Graph Processing10% Linear Algebra15% - 38%

Conclusion  Error resilience characterization of OpenMP programs needs to take into account the thread model and the program structure.  Preliminary support for our hypothesis that error resilience properties do correlate with the algorithmic characteristics of parallel applications. Project website: 18

Backup slide SDC rates converge 19