SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.

Slides:



Advertisements
Similar presentations
OpenMP.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Intermediate GPGPU Programming in CUDA
Outline Reading Data From Files Double Buffering GMAC ECE
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education Program May 29 – June Hybrid MPI/CUDA Scaling accelerator.
Hybrid Redux: CUDA / MPI 1. CUDA / MPI Hybrid – Why? 2  Harness more hardware  16 CUDA GPUs > 1!  You have a legacy MPI code that you’d like to accelerate.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
More Charm++/TAU examples Applications:  NAMD  Parallel Framework for Unstructured Meshing (ParFUM) Features: Profile snapshots: Captures the runtime.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
DKRZ Tutorial 2013, Hamburg1 Hands-on: NPB-MZ-MPI / BT VI-HPS Team.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
12th VI-HPS Tuning Workshop, 7-11 October 2013, JSC, Jülich1 Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca,
GPU Programming with CUDA – Optimisation Mike Griffiths
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Advanced / Other Programming Models Sathish Vadhiyar.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
Hybrid MPI and OpenMP Parallel Programming
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
CUDA Dynamic Parallelism
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Markus Geimer 1), Peter Philippen 1) Andreas.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
Development of a GPU based PIC
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
SC‘13: Hands-on Practical Hybrid Parallel Application Performance Engineering Hands-on example code: NPB-MZ-MPI / BT (on Live-ISO/DVD) VI-HPS Team.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Computer Engg, IIT(BHU)
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Lecture 18 CUDA Program Implementation and Debugging
Gwangsun Kim, Jiyun Jeong, John Kim
GPU Computing CIS-543 Lecture 03: Introduction to CUDA
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
SHARED MEMORY PROGRAMMING WITH OpenMP
NVIDIA Profiler’s Guide
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Advanced TAU Commander
Pipeline parallelism and Multi–GPU Programming
Introduction to CUDA.
CUDA Execution Model – III Streams and Events
Synchronization These notes introduce:
CUDA Fortran Programming with the IBM XL Fortran Compiler
Presentation transcript:

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering2 Jacobi Solver Jacobi Example –Iterative solver for system of equations –Code uses OpenMP, CUDA and MPI for parallelization Domain decomposition –Halo exchange at boundaries: Via MPI between processes Via CUDA between hosts and accelerators

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering3 Jacobi Without Instrumentation # Compile host code % mpicc -O3 -fopenmp -DUSE_MPI –I -c jacobi_cuda.c -o jacobi_mpi+cuda.o # Compile CUDA kernel % nvcc -O3 -c jacobi_cuda_kernel.cu -o jacobi_cuda_kernel.o # Link executable % mpicc -fopenmp -lm –L -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o -o./jacobi_mpi+cuda

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering4 Instrumentation with Score-P # Compile host code % scorep mpicc -O3 -fopenmp -DUSE_MPI –I -c jacobi_cuda.c -o jacobi_mpi+cuda.o # Compile CUDA kernel % scorep nvcc -O3 -c jacobi_cuda_kernel.cu -o jacobi_cuda_kernel.o # Link executable % scorep mpicc -fopenmp -lm –L -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o -o./jacobi_mpi+cuda

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering5 CUDA Advanced Measurement Configuration Enable recording of CUDA events with the CUPTI interface via environment variable SCOREP_CUDA_ENABLE Provide a list of recording types, e.g. Start with using the default configuration Adjust CUPTI buffer size (in bytes) as needed % export SCOREP_CUDA_ENABLE=runtime,driver,gpu,kernel,idle % export SCOREP_CUDA_ENABLE=yes % export SCOREP_CUDA_BUFFER=100000

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering6 SCOREP_CUDA_ENABLE: Recording Types Recording typeRemark yes/DEFAULT/1"runtime, kernel, concurrent, memcpy" noDisable CUDA measurement (same as unset SCOREP_CUDA_ENABLE) runtimeCUDA runtime API driverCUDA driver API kernelCUDA kernels kernel_counterFixed CUDA kernel metrics concurrentConcurrent kernel recording idleGPU compute idle time pure_idleGPU idle time (memory copies are not idle) memcpyCUDA memory copies syncRecord implicit and explicit CUDA synchronization gpumemusageRecord CUDA memory (de)allocations as a counter stream_reuseReuse destroyed/closed CUDA streams device_reuseReuse destroyed/closed CUDA devices

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering7 Measurement (Profiling) % export OMP_NUM_THREADS=6 % export SCOREP_CUDA_ENABLE=yes % export SCOREP_CUDA_BUFFER= % export SCOREP_EXPERIMENT_DIRECTORY=jacobi_cuda_profile % mpirun -n 2./jacobi_mpi+cuda Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and 6 threads + one Tesla T10 Processor for each process. 307 of 2049 local rows are calculated on the CPU to balance the load between the CPU and the GPU. 0, … … … … … … 900, total: s Problem size (x dimension) Load balancing factor (in this example 15% of the computations are calculated on the CPU) Problem size (y dimension)

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering8 CUBE4 Analysis % cube jacobi_cuda_profile/profile.cubex

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering9 Scoring Do we need to filter? (Overhead and memory footprint)  Very small example => no filtering % scorep-score jacobi_cuda_profile/profile.cubex Estimated aggregate size of event trace (total_tbc): bytes Estimated requirements for largest trace buffer (max_tbc): bytes (hint: When tracing set SCOREP_TOTAL_MEMORY > max_tbc to avoid intermediate flushes or reduce requirements using file listing names of USR regions to be filtered.) flt type max_tbc time % region ALL ALL OMP OMP USR USR MPI MPI COM COM

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering10 Measurement (Tracing) % export OMP_NUM_THREADS=6 % export SCOREP_CUDA_ENABLE=yes % export SCOREP_CUDA_BUFFER= % export SCOREP_EXPERIMENT_DIRECTORY=jacobi_cuda_trace % export SCOREP_ENABLE_PROFILING=false % export SCOREP_ENABLE_TRACING=true % mpirun -n 2./jacobi_mpi+cuda Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and 6 threads + one Tesla T10 Processor for each process. 307 of 2049 local rows are calculated on the CPU to balance the load between the CPU and the GPU. 0, … … … … … … 900, total: s

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering11 Vampir Analysis % vampir jacobi_cuda_trace/traces.otf2