Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Slides:

Advertisements

Similar presentations

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

More Charm++/TAU examples Applications:  NAMD  Parallel Framework for Unstructured Meshing (ParFUM) Features: Profile snapshots: Captures the runtime.

Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Ch 4: Threads Dr. Mohamed Hefeeda.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Advanced / Other Programming Models Sathish Vadhiyar.

Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Source: Operating System Concepts by Silberschatz, Galvin and Gagne.

CS333 Intro to Operating Systems Jonathan Walpole.

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Synchronization These notes introduce:

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Introduction to Operating Systems Concepts

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

TensorFlow– A system for large-scale machine learning

Productive Performance Tools for Heterogeneous Parallel Computing

GPU Computing CIS-543 Lecture 10: Streams and Events

CS427 Multicore Architecture and Parallel Computing

OPERATING SYSTEMS CS3502 Fall 2017

Department of Computer and Information Science

NVIDIA Profiler’s Guide

Outline Introduction Motivation for performance mapping SEAA model

CUDA Execution Model – III Streams and Events

Chapter 4: Threads.

Synchronization These notes introduce:

Presentation transcript:

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance Research Laboratory University of Oregon Performance Measurement of Applications with GPU Acceleration using CUDA

ParCo 2009 Outline  Motivation  Performance perspectives  Acceleration, asynchrony, and concurrency  CPU-GPU execution scenarios  Performance measurement for GPGPUs  Accelerator performance measurement in PGI compiler  TAUcuda performance measurement  operation and API  TAUcuda tests and application case studies  Conclusions and future work 2

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Motivation  Heterogeneous computing technology more accessible  Multicore processors  Manycore accelerators (e.g., NVIDIA Tesla GPU)  High-performance processing engines (e.g., IBM Cell BE)  Achieving performance potential is challenging  Complexity of hardware operation and programming interface  CUDA created to help in GPU accelerator code development  Few performance tools for parallel accelerated applications  Need to understand acceleration in context of whole program  Need integration of accelerator measurements in scalable parallel performance tools  Focus on GPGPU performance measurement using CUDA 3

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Heterogeneous Performance Perspective  Heterogeneous applications can have concurrent execution  Main “host” path and “external” task paths  Want to capture performance for all execution paths  External execution may be difficult or impossible to measure  “Host” creates measurement view for external entity  Maintains local and remote performance data  External entity may provide performance data to the host  What perspective does the host have of the external entity?  Determines the semantics of the measurement data  Existing parallel performance tools are CPU(host)-centric  Event-based sampling (not appropriate for accelerators)  Direct measurement (through instrumentation of events) 4

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CUDA Performance Perspective  CUDA enables programming of kernels for GPU acceleration  GPU acceleration acts as an external tasks  Performance measurement appears straightforward  Execution model complicates performance measurement  Synchronous and asynchronous operation with respect to host  Overlapping of data transfer and kernel execution  Multiple GPU devices and multiple streams per device  Different acceleration kernels used in parallel application  Multiple application sections  Multiple application threads/processes  See performance in context: temporal, spatial, thread/process  Two general approaches: synchronous and asynchronous 5

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CPU – GPU Execution / Measurement Scenarios 6 Synchronous Asynchronous

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Approach  Consider use of NVIDIA PerfKit and CUDA Profiler  PerfKit provides low-level data for GPU driver interface  limited for use with CUDA programming environment  CUDA Profiler provides extensive stream-level measurements  creates post-mortem event trace of kernel operation on streams  difficult to merge with application performance data  Goal is to produce profiles (traces) showing distribution of accelerator performance with respect to application events  Approach 1: force all measurements to be synchronous  Restricts CUDA usage, disallowing concurrent operation  Create new thread for every CUDA invocation  Approach 2: develop CUDA measurement mechanism  Merge with TAU performance system 7

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 PGI Compiler for GPU (using CUDA)  PGI accelerator compiler (PGI 9.x, C and Fortran, x64 Linux)  Loop parallelization for acceleration on GPUs using CUDA  Directive-based presenting a GPGPU programming abstraction  Compiler not source translation – CUDA code hidden  TAU measurement of PGI acceleration  Wrappers of runtime system  Track runtime system events as seen from the host processor  Show source information associated with events  Routine name  File name, source line number for kernel  Variable names in memory upload, download operations  Grid sizes 8

Performance Measurement of Applications with GPU Acceleration using CUDAParCo Matrix Multiplication Profile (3000x3000, ~22 GF)

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CUDA Programming for GPGPU  PGI compiler represents GPGPU programming abstraction  Performance tool uses runtime system wrappers  essentially a synchronous call performance model!!!  In general, programming of GPGPU devices is more complex  CUDA environment  Programming of multiple streams and GPU devices  multiple streams execute concurrently  Programming of data transfers to/from GPU device  Programming of GPU kernel code  Synchronization with streams  Stream event interface 10

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAU CUDA Performance Measurement (TAUcuda) 11  Build on CUDA stream event interface  Allow “events” to be placed in streams and processed  events are timestamped  CUDA runtime reports GPU timing in event structure  Events are reported back to CPU when requested  use begin and end events to calculate intervals  Want to associate TAU event context with CUDA events  Get top of TAU event stack at begin (TAU context)  CUDA kernel invocations are asynchronous  CPU does not see actual CUDA “end” event  CPU retrieves events in a non-blocking and blocking manner  Want to capture “waiting time”

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CPU-GPU Operation and TAUcuda Events 12

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAU CUDA Measurement API void tau_cuda_init(int argc, char **argv);  To be called when the application starts  Initializes data structures and checks GPU status void tau_cuda_exit()  To be called before any thread exits at end of application  All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream);  Called before CUDA statements to be measured  Returns handle which should be used in the end call  If event is new or the TAU context is new for the event, a new CUDA event profile object is created void tau_cuda_stream_end(void * handle);  Called immediately after CUDA statements to be measured  Handle identifies the stream  Inserts a CUDA event into the stream 13

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAU CUDA Measurement API (2) vector tau_cuda_update();  Checks for completed CUDA events on all streams  Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream);  Same as tau_cuda_update() except for a particular stream  Non-blocking and returns # completed on the stream vector tau_cuda_finalize();  Waits for all CUDA events to complete on all streams  Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream);  Same as tau_cuda_finalize() except for a particular stream  Blocking and returns # completed on the stream 14

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Scenario Results – One and Two Streams  Run simple CUDA experiments to validate TAU CUDA  Tesla S1070 test system 15

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Scenario Results – Two Devices, Two Contexts 16

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAUcuda Compared to CUDA Profiler  CUDA Profiler integrated in CUDA runtime system  Captures time measures for GPGPU kernel and memory tasks  Creates a trace in memory and outputs at end of execution  Can use to verify TAUcuda  Slight time variation due to differences in mechanism 17

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Case Study: TAUcuda in NAMD and ParFUM  TAU integrated in Charm++ (ICPP 2009 paper)  Charm++ applications  NAMD is a molecular dynamics application  Parallel Framework for Unstructure Meshing (ParFUM)  Both have been accelerated with CUDA  Demonstration use of TAUcuda  Observe the effect of CUDA acceleration  Show scaling results for GPU cluster execution  Experimental environments  Two S1070 GPU servers (Universit of Oregon)  AC cluster: 32 nodes, 4 Tesla GPUs per node (UIUC) 18

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 NAMD GPU Profile (Two GPU Devices)  Test out TAU CUDA with NAMD  Two processes with one Tesla GPU for each 19 CPU profile GPU profile (P0) GPU profile (P1)

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 NAMD GPU Efficiency Gain (16 versus 32 GPUs)  AC cluster: 16 and 32 processes  dev_sum_forces: 50% improvement  dev_nonbounded: 100% improvement EventTAU ContextDeviceStream 20

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 NAMD GPU Scaling (4 to 64 GPUs)  Strong scaling by event and device number  Good scaling for non-bounded calculations  Sum forces scales less well, but overall is small Non-bonded calculations Sum forces calculations Number of Devices Scaling Efficiency 21

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 ParFUM CUDA speedup (Single CPU plus GPU)  Problem size: 128 x 8 x 8 mesh  With GPU acceleration, only 9 seconds in CUDA kernels 22

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Case Study: HMPP-TAU 23 User Application TAU CUDACUDA TAUcuda HMPP Runtime HMPP CUDA Codelet Measurement User events HMPP events Codelet events Measurement User events HMPP events Codelet events Measurement CUDA stream events Waiting information Measurement CUDA stream events Waiting information

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 HMPP Data/Overlap Experiment 24 TAUcuda events

Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Conclusions and Future Work  Heterogeneous parallel computing will challenge parallel performance technology  Must deal with diversity in hardware and software  Must deal with richer parallelism and concurrency  Developed and demonstrated TAUcuda  TAU + CUDA measurement approach  Showed case studies and integrated in HMPP  Next targeting OpenCL (TAUopenCL)  Better merge TAU and TAUcuda performance data  Take advantage of other tools in TAU toolset  Performance database (PerfDMF), data mining (PerfExplorer)  Integrated in application and heterogeneous environments 25