Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Panda: MapReduce Framework on GPU’s and CPU’s
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen ZhouJonathan Too Milind KulkarniSaurabh Bagchi Purdue University.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Slide 1/18 Applying Machine Learning to Computer System Dependability Problems Joint Work With: Purdue: Milind Kulkarni, Sam Midkiff, Bowen Zhou, Fahad.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Computer Engg, IIT(BHU)
GPU-based iterative CT reconstruction
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Presented by: Isaac Martin
NVIDIA Fermi Architecture
6- General Purpose GPU Programming
Presentation transcript:

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh, Amit Sabne, Rudolf Eigenmann (Purdue) Presentation available at: engineering.purdue.edu/dcsl

Slide 2/8 Emerging Trend Heterogeneous computing is gaining ground as a way to accelerate performance of parallel applications Buzzword is “accelerators” –Graphics Processing Units (GPUs) –Field Programmable Gate Arrays (FPGAs) Attraction is high degree of parallelism close to the main processor –Example: Dell Poweredge servers have 2 Kepler GPUs with a total of 2  1536 CUDA cores

Slide 3/8 But … not so fast Programming models for these architectures hide lots of architecture details –As they should But, these architectures are ripe for committing horrendous performance errors –Even more so than in traditional CPU architectures Why? –FPGA: Constrained on chip memory; careless program can wipe out any performance improvement by going to main processor –GPU: Multiple levels of memory hierarchy with widely different access latencies; Identical control flow mandated for all threads within a block

Slide 4/8 GPU Schematic CUDA hierarchy of threads, thread blocks, and grids with per thread private, per block shared, and per application global memory spaces Memory hierarchy

Slide 5/8 Specs leading to Performance Problem Shared memory and L1 cache are limited –16 KB-48 KB or 48 KB-16 KB –Very fast access: 1+ TB/s Global memory is accessible by all threads on the GPU –Larger amount of memory: 8 GB –Slower access: 320 GB/s If communication with the host memory is required (over PCI Express bus), then much slower –PLDI 12 paper shows a 5X speedup if avoiding cyclic communication

Slide 6/8 Common Patterns of Performance Bugs Memory bugs –Un-coalescing memory access –Bank conflict of shared memory –Channel skew in global memory –The schedule of transmission of host to device memory Multi-thread bugs –Block/Thread configuration –Branch divergence Synchronization bugs

Slide 7/8 Performance debugger work flow Benchmarking (small scales or small data) Profiling Detect performance anomaly Localize the problem Automatic program transformation Re-benchmarking Acceptable? NO Break Yes Program Static Analysis

Slide 8/8 Example of a Performance Bug Matrix transpose on GPU –The memory bandwidth of GTX 280 is 140 GB/sec For 2048  2048 matrix –Naïve transpose: 2.2 GB/s –Coalesced transpose: 17.1 GB/s

Slide 9/8 Can We Do This Automatically? Training Phase (A Series of Small-scale Testing Runs) –Instrumentation to record observational features –Modeling to train a model that can predict observational features from control features Deployment Phase (Large-scale Production Runs) –Instrumentation to record the same features –Detection to flag production runs with negative correlation –Localization Use the trained model to reconstruct observational feature Rank features by reconstruction error Some lessons from our prior work [HPDC `11] [HotDep `12]

Slide 10/8 Can We Do This Automatically? Maybe Some lessons from our prior work [HPDC `11] [HotDep `12] corr(f( ), g( )) < 0 y y x x BUG! Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated Behavioral Feature Scale of Execution

Slide 11/8 g’ -1 (f (x)) ABHRANTA: a Predictive Model for Program Behavior at Large Scale ABHRANTA replaced non-invertible transform g used by Vrisha with a linear transform g’ The new model provides an automatic way to reconstruct “bug-free” behavior at large scale, lifting the burden of manual analysis of program scaling behavior g’(*) x x f(x)

Slide 12/8 Results from HPC Benchmark AMG2006 is a parallel algebraic multigrid solver for linear systems, written in 104K lines of C code. –The application is configured to solve the default 3D Laplace type problem Train on node runs, test at larger scales (up to 4096 nodes) Fault injection study – Integer overflows, buffer overflows Control features: X, Y, Z dimensions of 3D grid Observational features: All conditionals indexed by calling context

Slide 13/8 Can We Do This for GPU Programs? We think we can (Wild ?) Speculation! Features that make this approach more feasible 1.More regular kernels than general purpose programs 2.Good places to insert monitors to observe behavioral features 3.Often spare computational capacity close by 4.Types of performance bugs are limited 5.Types of program transformations limited

Slide 14/8 Presentation available at: Dependable Computing Systems Lab (DCSL) web site engineering.purdue.edu/dcsl