Profiling & Tuning Applications. Overview Performance limiters – Bandwidth, computations, latency Using the Visual Profiler Checklist Case Study: molecular.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Intermediate GPGPU Programming in CUDA

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Profiling & Tuning Applications CUDA Course July István Reguly.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SAGE: Self-Tuning Approximation for Graphics Engines

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Martin Kruliš by Martin Kruliš (v1.0)1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Extracted directly from:

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Sunpyo Hong, Hyesoon Kim

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Tuning Threaded Code with Intel® Parallel Amplifier.

CUDA programming Performance considerations (CUDA best practices)

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Lecture 18 CUDA Program Implementation and Debugging

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Validation October 13, 2017.

Heterogeneous Programming

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

NVIDIA Profiler’s Guide

Lecture 2: Intro to the simd lifestyle and GPU internals

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

Presentation transcript:

Profiling & Tuning Applications

Overview Performance limiters – Bandwidth, computations, latency Using the Visual Profiler Checklist Case Study: molecular dynamics code Command-line profiling (MPI) Auto-tuning

Introduction Why is my application running slow? Follow on Emerald – module load cuda/ NVIDIA Visual Profiler – Works with CUDA, needs some tweaks to work with OpenCL nvprof – command line tool, can be used with MPI applications

Identifying Performance Limiters CPU: Setup, data movement GPU: Bandwidth, compute or latency limited Number of instructions for every byte moved – ~4.5 : 1 with ECC on – ~3.6 : 1 with ECC off Algorithmic analysis gives a good estimate Actual code is likely different – Instructions for loop control, pointer math, etc. – Memory access patterns – How to find out? Use the profiler (quick, but approximate) Use source code modification (takes more work)

Analysis with Source Code Modification Time memory-only and math-only versions – Not so easy for kernels with data-dependent control flow – Good to estimate time spent on accessing memory or executing instructions Shows whether kernel is memory or compute bound Put an if statement depending on kernel argument around math/mem instructions – Use dynamic shared memory to get the same occupancy

Example scenarios time mem math full Memory-bound Good overlap between mem- math. Latency is not a problem mem math full Math-bound Good overlap between mem- math. mem math full Well balanced Good overlap between mem- math. mem math full Mem and latency bound Poor overlap, latency is a problem

NVIDIA Visual Profiler Launch with nvvp Collects metrics and events during execution – Calls to the CUDA API – Memory transfers – Kernels Occupancy Computations Memory accesses Requires deterministic execution!

Visual Profiler Demo

Concurrent kernels

Metrics vs.Events

Source correlation CUDA 5.0 debug: -G, optimized: -lineinfo

How to use the profiler Understand the timeline – Where and when is your code – Add annotations to your application – NVIDIA Tools Extension (markers, names, etc.) Find obvious bottlenecks Focus profiling on region of interest Dive into it

Checklist cudaDeviceSynchronize() – Most API calls (e.g. kernel launch) are asynchronous – Overhead when launching kernels – Get rid of cudaDeviceSynchronize() to hide this latency – Timing: events or callbacks in CUDA 5.0 Cache config 16/48 or 48/16 kB L1/shared (default is 48k shared!) – cudaSetDeviceCacheConfig – cudaFuncSetCacheConfig – Check if shared memory usage is a limiting factor

Checklist Occupancy – Max 1536 threads or 8 blocks per SM on Fermi (2048/16 for Kepler) – Limited amount of registers and shared memory Max 63 registers/thread, rest is spilled to global memory (255 for K20 Keplers) You can explicitly limit it (-maxregcount=xx) 48kB/16kB shared/L1: dont forget to set it – Visual Profiler tells you what is the limiting factor – In some cases though, it is faster if you dont maximise it (see Volkov paper) -> Autotuning!

Verbose compile Add –Xptxas=-v Feed into Occupancy Calculator ptxas info : Compiling entry function '_Z10fem_kernelPiS_' for 'sm_20' ptxas info : Function properties for _Z10fem_kernelPiS_ 856 bytes stack frame, 980 bytes spill stores, 1040 bytes spill loads ptxas info : Used 63 registers, 96 bytes cmem[0]

Checklist Precision mix (e.g. 1.0 vs 1.0f) – cuobjdump – F2F.F64.F32 (6* the cost of a multiply) – IEEE standard: always convert to higher precision – Integer multiplications are now expensive (6*) cudaMemcpy – Introduces explicit synchronisation, high latency – Is it necessary? May be cheaper to launch a kernel which immediately exits – Could it be asynchronous? (Pin the memory!)

Asynchronous Memcopy Start generating data Start copy as soon as some data is ready Start feeding the GPU with data Start computing on it as soon as there is enough Memcopy H2D Memcopy D2H

Case Study Molecular Dynamics ~10000 atoms Short-range interaction – Verlet lists Very long simulation time – Production code runs for ~1 month

Gaps between kernels – get rid of cudaDeviceSynchronize() – free 8% speedup

None of the kernels use shared memory – set L1 to 48k – free 10% speedup

Gaps between kernels – get rid of cudaDeviceSynchronize() – free 8% speedup None of the kernels use shared memory – set L1 to 48k – free 10% speedup forces_second_step is 89% of runtime

Analysing kernels Fairly low runtime – Launch latency Few, small blocks – Tail Low theoretical occupancy – 63 registers/thread – Spills?? L1 configuration Analyze all

Memory Low efficiency Overhead because of spills But a very low total utilization (5.66 GB/s) Not really a problem

Instruction Very high branch divergence – Threads in a warp doing different things – SIMD – all branches executed sequentially – Need to look into the code Rest is okay

Occupancy Low occupancy Achieved much lower than theoretical – Load imbalance, tail Limiter is block size – In this case doesnt help, there are already too few blocks Structural problem – Need to look into the code

Structural problems 1 thread per atom 10k atoms – too few threads Force computation with each neighbor – Redundant computations – Different number of neighbors – divergence Interaction based computation (2x speedup) Exploit symmetry Lots more threads, unit work per thread

Memory-bound kernels What can you do if a kernel is memory-bound? Access pattern – Profiler Global Load/Store Efficiency – Struct of Arrays vs. Array of Structs Cache: every memory transaction is 128 Bytes High occupancy required to get close to theoretical bandwidth xyzxyzxyz… xxx…yyy…z…

nvprof Command-line profiling tool Text output (CSV) – CPU, GPU activity, trace – Event collection (no metrics) Headless profile collection – Can be used in a distributed setting – Visualise results using the Visual Profiler

Usage nvprof [nvprof_args] [app_args] Use --query-events to get a list of events you can profile Time(%),Time,Calls,Avg,Min,Max,Name,us,,us,us,us, 58.02, ,2, , , ,"op_cuda_update()" 18.92, ,2, , , ,"op_cuda_res()" 18.38, ,18, , , ,"[CUDA memcpy HtoD]" 4.68, ,3, , , ,"[CUDA memcpy DtoH]"

Distributed Profiling Use the nvprof-script from the eIS wiki mpirun [mpirun args] nvprof-script –o out [nvprof args] [app args] – Will create out.0, out.1… files for different processes Import into Visual Profiler – File/Import nvprof Profile

Auto-tuning Several parameters that affect performance – Block size – Amount of work per block – Application specific Which combination performs the best? Auto-tuning with Flamingo – #define/read the sizes, recompile/rerun combinations

Auto-tuning Case Study Thread cooperation on sparse matrix-vector product – Multiple threads doing partial dot product on the row – Reduction in shared memory Auto-tune for different matrices – Difficult to predict caching behavior – Develop a heuristic for cooperation vs. average row length

Autotuning Case Study