Presentation on theme: " Srinath Vadlamani Youngsung Kim and John Dennis ASAP-TDD-CISL NCAR A COMPARISON OF CLIMATE APPLICATIONS ON ACCELERATED AND CONVENTIONAL ARCHITECTURES."— Presentation transcript:
Srinath Vadlamani Youngsung Kim and John Dennis ASAP-TDD-CISL NCAR A COMPARISON OF CLIMATE APPLICATIONS ON ACCELERATED AND CONVENTIONAL ARCHITECTURES.
The overall picture of acceleration effort and different techniques should be understood. [Srinath V.] We use small investigative kernels to help teach us. We use instrumentation tools to help us work with the larger code set. The investigative DG_KERNEL shows what is possible if everything was simple. [Youngsung K.] DG_KERNEL helped us understand the hardware. DG_KERNEL helped us understand coding practices and software instructions to achieve superb performance. PRESENTATION HAS TWO PARTS
ASAP Personnel Srinath Vadlamani, John Dennis, Youngsung Kim, Michael Arndt, Ben Jamroz and Rich Loft Active collaborators Intel: Michael Greenfield, Ruchira Sasanka, Sergey Egorov, Karthik Raman, and Mark Lubin NREL: Ilene Carpenter APPLICATION AND SCALABILITY PERFORMANCE (ASAP) TEAM RESEARCHING MODERN MICRO- ARCHITECTURE FOR CLIMATE CODES
Climate simulations simulate 100s to 1000s of years of activity. Currently high resolution climate simulations rate is 2 ~ 3 simulated year per day (SYPD) [~40k pes]. GPUs and Coprocessors can help to increase SYPD. Many collaborators mandates the use of many architectures. We must use these architectures efficiently for successful SYPD speed up, which requires knowing the hardware! CLIMATE CODES ALWAYS NEED A FASTER SYSTEM.
Conventional CPU based: NCAR Yellowstone (Xeon: SNB) - CESM, HOMME ORNL TITAN (AMD: Interlagos) - benchmark kernel Xeon Phi based: TACC Stampede - CESM NCAR test system (SE10x changing to 7120) - HOMME GPU based: NCAR Caldera cluster (M2070Q) -HOMME ORNL Titan(K20x) -HOMME TACC Stampede (K20) - benchmark kernels only. WE HAVE STARTED THE ACCELERATION EFFORT A SPECIFIC PLATFORMS.
CESM is a large application so we need to create benchmarks kernels to understand the hardware. Smaller examples are easier to understand and manipulate. The first two kernels we have created are DG_KERNEL from HOMME [detailed by Youngsung] Standalone driver for WETDEPA_V2. WE CAN LEARN HOW TO USE ACCELERATED HARDWARE FOR CLIMATE CODES BY CREATING REPRESENTATIVE EXAMPLES.
We created DG_KERNEL knowing it could be a well vectorized code (with help). What if we want to start cherry picking subroutines and loops to try the learned techniques? Instrumentation tools are available with teams that are willing to support your efforts. Trace based tools offer great detail. Profile tools present summaries upfront. Previous NCAR-SEA conference highlighted such tools. KNOWING WHAT CAN BE ACCELERATED IS HALF THE BATTLE.
Extrae tracing tool developed at Barcelona Supercomputer Center H. Servat, H. Labart, J. Gimenez Automatic performance identification process is a BSC research project. Produces a time series of communication and hardware counter events. Paraver is the visualizer that also performs statistical analysis. There are clustering techniques which uses a folding concept plus the research identification process to create “ synthetic ” traces with fewer samples. EXTRAE TRACING CAN PICK OUT PROBLEMATIC REGIONS OF A LARGE CODE.
CLUSTERING GROUPS WITH SIMILAR BAD COMPUTATIONAL CHARACTERISTICS IS A GOOD GUIDE. Result of an Extrae trace of CESM on Yellowstone. Similar to exclusive execution time.
EXTRAE TRACING EXPOSED POSSIBLE WASTE OF CYCLES. Red: Instructions count. Blue: d(INS)/dt
PARAVER IDENTIFIED CODE REGIONS. Trace identifies what code is active when. We now examine code regions for characteristics amenable to acceleration.
Automatic Performance Identification highlighted these group ’ s subroutines. Group AOverall Execution time % conden2.7 compute_usschu3.3 rtrnmc1.75 Group B micro_gm_tend1.36 wetdepa_v22.5 Group C reftra_sw1.71 spcvmc_sw1.21 vrtqdr_sw1.43 small number of lines of code ready to be vectorized
The subroutine has sections of double nested loops. These loops are very long with branches. Compilers will have trouble vectorizing loops containing branches. The restructure started with breaking up loops. We collected scalars into arrays for vector operations. We broke up very long expressions into smaller pieces. WETDEPA_V2 CAN BE VECTORIZED WITH RECODING.
MODIFICATION OF THE CODE DOES COMPARE WELL WITH COMPILER OPTIMIZATION. Vectorizing? -vec-report=3,6 Code optimized Modification was for a small number of lines. -O3 fast for orig. gave incorrect results
MODIFIED WETDEPA_V2 PLACED BACK IN TO CESM ON SNB SHOWS BETTER USE OF RESOURCES. 2.5% -->.7% of overall execution in CESM on Yellowstone.
CAM-SE configuration was profiled on Stampede at TACC using TAU. It provides different levels of introspection of subroutine and loop efficiency. This process taught us some more about hardware counter metrics. Initial investigation fits in the core-count to core-count comparison. Profilers are also useful tools for understanding code efficiency in the BIG code.
Hot Spots can be associated with largest exclusive execution time. Long time may be a branchy section of code. LONG EXCLUSIVE TIME ON BOTH DEVICES IS A GOOD PLACE TO START LOOKING.
Low VI is a candidate for acceleration techniques High VI could be misleading. Note: The VI metric is defined differently on Sandybridge and Xeon Phi. ojects/papi/wiki/PAPIT opics:SandyFlops ojects/papi/wiki/PAPIT opics:SandyFlops POSSIBLE SPEEDUP CAN BE ACHIEVED WITH A GAIN IN VECTORIZATION INTENSITY (VI)
CESM ON KNC NOT COMPETITIVE TODAY. Device avg. time step [s] Sandybridge - O KNC -O KNC -O2, “derivatitive_m od.F90” -O KNC -O2, “derivatitive_mod.F90” -O1 -align array64bytes FC5 ne16g37 16 MPI ranks/node 1 rank/core 8 nodes Single thread
DeviceMPI rank/de vice Threads/ MPI rank Avg. 1 day dt [s] Dual SNB Dual SNB Xeon Phi SE Xeon Phi SE Xeon Phi SE Xeon Phi SE Xeon Phi SE Xeon Phi SE HYBRID-PARALLELISM IS PROMISING FOR CESM ON KNC FCIDEAL ne16ne16 Stampede: 8 nodes F03 use of allocatable derived type components to overcome threading issue [all –O2] Intel compiler and IMPI KNC 4.6x slower Will get better with Xeon Phi tuning techniques
CESM is running on the TACC Stampede KNC cluster. We are more familiar with possibilities on GPUs and KNCs by using climate code benchmark kernels. Kernels are useful for discovering acceleration strategies and hardware investigations. Results are promising. We now have tracing and profiling tool knowledge to help identify acceleration possibilities with in the large code base. We have strategies for symmetric operation as a very attractive mode of execution. Though CESM is not competitive on a KNC cluster today, the kernel experience shows what is possible. PART 1. CONCLUSION: WE ARE HOPEFUL TO SEE SPEEDUP ON ACCELERATED HARDWARE.
ASAP/TDD/CI SL/NCAR Youngsung Kim PERFORMANCE TUNING TECHNIQUES FOR GPU AND MIC
Introduction Kernel-based approach Micro-architectures MIC performance evolutions CUDA-C performance evolutions CPU performance evolutions along with MIC evolutions GPU programming : Open ACC, CUDA Fortran, and F2C-ACC One source consideration Summary CONTENTS
What is a kernel? A small computation-intensive part of existing large code Represent characteristics of computations Benefit of kernel-based approach Easy to manipulate and understand CESM: >1.5M LOC Easy to convert to various programming technologies CUDA-C, CUDA-Fortran, OpenACC, and F2C-ACC Easy to isolate issues for analysis Simplify hardware counter analysis MOTIVATION OF KERNEL-BASED APPROACH
Origin* a kernel derived from the computational part of the gradient calculation in the Discontinuous Galerkin formulation of the shallow water equations from HOMME. Implementation from HOMME Similar to “dg3d_gradient_mass” function in “dg3d_core_mod.F90” Calculate gradient of flux vectors and update the flux vectors using the calculated gradient DG KERNEL *: D. Nair, Stephen J. Thomas, and Richard D. Loft: A discontinuous Galerkin global shallow water model, Monthly Weather Review, Vol. 133, pp
Floating point operations No dependancy between elements # of elements Can be calculated from source code analytically Ex.) When nit=1000, nelem=1024, nx=4(npts=nx*nx) ≈ 2 GFLOP OpenMP Two OpenMP Parallel regions for Do loops on element index(ie) DG KERNEL – source code
CPU Conventional multi-core: 1 ~ 16+ cores/~256-bit vector registers Many programming language: Fortran, C/C++, etc. Intel SandyBridge E Peak performance(2 Sockets): DP GFLOPS(Estimated by presenter) MIC Based on Intel Pentium cores with extensions including wider vector registers. Many core and wider vector: 60+ cores/512-bit vector registers Limited programming language(extensions only from Intel): C/C++, Fortran Intel KNC( a.k.a. MIC) Peak Performance(7120): DP TFLOPS GPU Many light-weight threads: ~2680+ threads(threading & vectorization) Limited programming language(extensions): CUDA-C, CUDA-Fortran, OpenCL, OpenACC, F2C-ACC, etc. Peak performances Nvidia K20x: DP TFLOPS Nvidia K20: DP TFLOPS Nvidia M2070Q: GFLOPS MICRO-ARCHITECTURES
THE BEST PERFORMANCE RESULTS FROM CPU, GPU, AND MIC 5.4x 6.6x MIC
Compiler options -mmic Environmental variables OMP_NUM_THREADS=240 KMP_AFFINITY = 'granularity=fine,compact' Native mode only No cost of memory copy between CPU and MIC Supports from Intel R. Sasanka MIC EVOLUTION 15.6x
Source code i = 1 s1 = (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) i = i + 1 s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) i = i + 1 s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) i = i + 1 s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) Compiler options -mmic -openmp –O3 Performance Considerations Complete unroll of three nested loops Vectorized, but not efficiently enough MIC VER. 2
MIC VER. 3
MIC VER. 4
MIC VER. 5
CPU EVOLUTIONS WITH MIC EVOLUTIONS Generally, performance tuning on a micro-architecture also helps to improve performance on another micro-architecture. However, it is not always true. GPU
Compiler options -O3 -arch=sm_35 same to all versions “Offload mode” only However, the time cost for data copy between CPU and GPU is not included for comparison to MIC native mode CUDA-C EVOLUTIONS 14.2x
CUDA-C VER. 1
CUDA-C VER. 2
CUDA-C VER. 3
CUDA-C VER. 4
Source Code ie = (blockidx%x - 1)*NDIV + (threadidx%x - 1)/(NX*NX) + 1 ii = MODULO(threadIdx%x - 1, NX*NX) + 1 IF (ie > SET_NELEM) RETURN k = MODULO(ii-1,NX) + 1 l = (ii - 1)/NX + 1 s2 = 0.0_8 DO j=1, NX s1 = 0.0_8 DO i = 1, NX s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + & delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) END DO ! i loop s2 = s2 + s1*gw(j) END DO ! j grad(ii,ie) = s2 flx(ii,ie) = flx(ii,ie)+ dt*grad(ii,ie) fly(ii,ie) = fly(ii,ie)+ dt*grad(ii,ie) Performance considerations Maintains source structure of original Fortran Needs understanding on CUDA threading model, especially for debugging and performance tuning Supports implicit memory copy, which is convenient but could negatively impact to performance if over-used. CUDA-FORTRAN
One source is highly desirable Hard to manage versions from multiple micro-architectures and multiple programming technologies Performance enhancement can be applied to multiple versions simultaneously Conditional Compilation Macro to insert & delete code for a particular technology User control compilation by using compiler macro Hard to get one source for CUDA-C Many scientific codes are written in Fortran CUDA-C has quite different code structure and should be written in C Performance impact Highest performance tuning techniques hardly allow one source ONE SOURCE
Faster hardware provides us with potential to the performance. However, we can exploit the potential only through better software. Better software on accelerators generally means that it utilizes many cores and wide vectors simultaneously and efficiently. In practice, those massive parallelisms can be achieved effectively by, among others, 1) re-using data that are loaded onto faster memory and 2) accessing successive array elements with aligned unit-stride manner. CONCLUSIONS
Using those techniques, we have achieved considerable amount of speed-ups for DG KERNEL Speed-ups compared the best one socket SandyBridge performance MIC: 6.6x GPU: 5.4x Speed-ups from initial version to the best performed version MIC: 15.6x GPU: 14.2x Our next challenge is to applying the techniques that we have learned from kernel experiments to real software package. CONCLUSIONS - CONTINUED
Contacts: ASAP: CESM: HOMME: Extrae: tools/trace-generationhttp://www.bsc.es/es/computer-sciences/performance- tools/trace-generation TAU: THANK YOU FOR YOUR ATTENTION.