Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Srinath Vadlamani  Youngsung Kim  and John Dennis  ASAP-TDD-CISL NCAR A COMPARISON OF CLIMATE APPLICATIONS ON ACCELERATED AND CONVENTIONAL ARCHITECTURES.

Similar presentations


Presentation on theme: " Srinath Vadlamani  Youngsung Kim  and John Dennis  ASAP-TDD-CISL NCAR A COMPARISON OF CLIMATE APPLICATIONS ON ACCELERATED AND CONVENTIONAL ARCHITECTURES."— Presentation transcript:

1  Srinath Vadlamani  Youngsung Kim  and John Dennis  ASAP-TDD-CISL NCAR A COMPARISON OF CLIMATE APPLICATIONS ON ACCELERATED AND CONVENTIONAL ARCHITECTURES.

2  The overall picture of acceleration effort and different techniques should be understood. [Srinath V.]  We use small investigative kernels to help teach us.  We use instrumentation tools to help us work with the larger code set.  The investigative DG_KERNEL shows what is possible if everything was simple. [Youngsung K.]  DG_KERNEL helped us understand the hardware.  DG_KERNEL helped us understand coding practices and software instructions to achieve superb performance. PRESENTATION HAS TWO PARTS

3  ASAP Personnel  Srinath Vadlamani, John Dennis, Youngsung Kim, Michael Arndt, Ben Jamroz and Rich Loft  Active collaborators  Intel: Michael Greenfield, Ruchira Sasanka, Sergey Egorov, Karthik Raman, and Mark Lubin  NREL: Ilene Carpenter APPLICATION AND SCALABILITY PERFORMANCE (ASAP) TEAM RESEARCHING MODERN MICRO- ARCHITECTURE FOR CLIMATE CODES

4  Climate simulations simulate 100s to 1000s of years of activity.  Currently high resolution climate simulations rate is 2 ~ 3 simulated year per day (SYPD) [~40k pes].  GPUs and Coprocessors can help to increase SYPD.  Many collaborators mandates the use of many architectures.  We must use these architectures efficiently for successful SYPD speed up, which requires knowing the hardware! CLIMATE CODES ALWAYS NEED A FASTER SYSTEM.

5  Conventional CPU based:  NCAR Yellowstone (Xeon: SNB) - CESM, HOMME  ORNL TITAN (AMD: Interlagos) - benchmark kernel  Xeon Phi based:  TACC Stampede - CESM  NCAR test system (SE10x changing to 7120) - HOMME  GPU based:  NCAR Caldera cluster (M2070Q) -HOMME  ORNL Titan(K20x) -HOMME  TACC Stampede (K20) - benchmark kernels only. WE HAVE STARTED THE ACCELERATION EFFORT A SPECIFIC PLATFORMS.

6  CESM is a large application so we need to create benchmarks kernels to understand the hardware.  Smaller examples are easier to understand and manipulate.  The first two kernels we have created are  DG_KERNEL from HOMME [detailed by Youngsung]  Standalone driver for WETDEPA_V2. WE CAN LEARN HOW TO USE ACCELERATED HARDWARE FOR CLIMATE CODES BY CREATING REPRESENTATIVE EXAMPLES.

7  We created DG_KERNEL knowing it could be a well vectorized code (with help).  What if we want to start cherry picking subroutines and loops to try the learned techniques?  Instrumentation tools are available with teams that are willing to support your efforts.  Trace based tools offer great detail.  Profile tools present summaries upfront.  Previous NCAR-SEA conference highlighted such tools. KNOWING WHAT CAN BE ACCELERATED IS HALF THE BATTLE.

8  Extrae tracing tool developed at Barcelona Supercomputer Center  H. Servat, H. Labart, J. Gimenez  Automatic performance identification process is a BSC research project.  Produces a time series of communication and hardware counter events.  Paraver is the visualizer that also performs statistical analysis.  There are clustering techniques which uses a folding concept plus the research identification process to create “ synthetic ” traces with fewer samples. EXTRAE TRACING CAN PICK OUT PROBLEMATIC REGIONS OF A LARGE CODE.

9 CLUSTERING GROUPS WITH SIMILAR BAD COMPUTATIONAL CHARACTERISTICS IS A GOOD GUIDE. Result of an Extrae trace of CESM on Yellowstone. Similar to exclusive execution time.

10 EXTRAE TRACING EXPOSED POSSIBLE WASTE OF CYCLES. Red: Instructions count. Blue: d(INS)/dt

11 PARAVER IDENTIFIED CODE REGIONS. Trace identifies what code is active when. We now examine code regions for characteristics amenable to acceleration.

12 Automatic Performance Identification highlighted these group ’ s subroutines. Group AOverall Execution time % conden2.7 compute_usschu3.3 rtrnmc1.75 Group B micro_gm_tend1.36 wetdepa_v22.5 Group C reftra_sw1.71 spcvmc_sw1.21 vrtqdr_sw1.43 small number of lines of code ready to be vectorized

13  The subroutine has sections of double nested loops.  These loops are very long with branches.  Compilers will have trouble vectorizing loops containing branches.  The restructure started with breaking up loops.  We collected scalars into arrays for vector operations.  We broke up very long expressions into smaller pieces. WETDEPA_V2 CAN BE VECTORIZED WITH RECODING.

14 MODIFICATION OF THE CODE DOES COMPARE WELL WITH COMPILER OPTIMIZATION. Vectorizing? -vec-report=3,6 Code optimized Modification was for a small number of lines. -O3 fast for orig. gave incorrect results

15 MODIFIED WETDEPA_V2 PLACED BACK IN TO CESM ON SNB SHOWS BETTER USE OF RESOURCES. 2.5% -->.7% of overall execution in CESM on Yellowstone.

16  CAM-SE configuration was profiled on Stampede at TACC using TAU.  It provides different levels of introspection of subroutine and loop efficiency.  This process taught us some more about hardware counter metrics.  Initial investigation fits in the core-count to core-count comparison. Profilers are also useful tools for understanding code efficiency in the BIG code.

17  Hot Spots can be associated with largest exclusive execution time.  Long time may be a branchy section of code. LONG EXCLUSIVE TIME ON BOTH DEVICES IS A GOOD PLACE TO START LOOKING.

18  Low VI is a candidate for acceleration techniques  High VI could be misleading.  Note: The VI metric is defined differently on Sandybridge and Xeon Phi. ojects/papi/wiki/PAPIT opics:SandyFlops ojects/papi/wiki/PAPIT opics:SandyFlops POSSIBLE SPEEDUP CAN BE ACHIEVED WITH A GAIN IN VECTORIZATION INTENSITY (VI)

19 CESM ON KNC NOT COMPETITIVE TODAY. Device avg. time step [s] Sandybridge - O KNC -O KNC -O2, “derivatitive_m od.F90” -O KNC -O2, “derivatitive_mod.F90” -O1 -align array64bytes FC5 ne16g37 16 MPI ranks/node 1 rank/core 8 nodes Single thread

20 DeviceMPI rank/de vice Threads/ MPI rank Avg. 1 day dt [s] Dual SNB Dual SNB Xeon Phi SE Xeon Phi SE Xeon Phi SE Xeon Phi SE Xeon Phi SE Xeon Phi SE HYBRID-PARALLELISM IS PROMISING FOR CESM ON KNC FCIDEAL ne16ne16 Stampede: 8 nodes F03 use of allocatable derived type components to overcome threading issue [all –O2] Intel compiler and IMPI KNC 4.6x slower Will get better with Xeon Phi tuning techniques

21  CESM is running on the TACC Stampede KNC cluster.  We are more familiar with possibilities on GPUs and KNCs by using climate code benchmark kernels.  Kernels are useful for discovering acceleration strategies and hardware investigations. Results are promising.  We now have tracing and profiling tool knowledge to help identify acceleration possibilities with in the large code base.  We have strategies for symmetric operation as a very attractive mode of execution.  Though CESM is not competitive on a KNC cluster today, the kernel experience shows what is possible. PART 1. CONCLUSION: WE ARE HOPEFUL TO SEE SPEEDUP ON ACCELERATED HARDWARE.

22 ASAP/TDD/CI SL/NCAR Youngsung Kim PERFORMANCE TUNING TECHNIQUES FOR GPU AND MIC

23  Introduction  Kernel-based approach  Micro-architectures  MIC performance evolutions  CUDA-C performance evolutions  CPU performance evolutions along with MIC evolutions  GPU programming : Open ACC, CUDA Fortran, and F2C-ACC  One source consideration  Summary CONTENTS

24  What is a kernel?  A small computation-intensive part of existing large code  Represent characteristics of computations  Benefit of kernel-based approach  Easy to manipulate and understand  CESM: >1.5M LOC  Easy to convert to various programming technologies  CUDA-C, CUDA-Fortran, OpenACC, and F2C-ACC  Easy to isolate issues for analysis  Simplify hardware counter analysis MOTIVATION OF KERNEL-BASED APPROACH

25  Origin*  a kernel derived from the computational part of the gradient calculation in the Discontinuous Galerkin formulation of the shallow water equations from HOMME.  Implementation from HOMME  Similar to “dg3d_gradient_mass” function in “dg3d_core_mod.F90”  Calculate gradient of flux vectors and update the flux vectors using the calculated gradient DG KERNEL *: D. Nair, Stephen J. Thomas, and Richard D. Loft: A discontinuous Galerkin global shallow water model, Monthly Weather Review, Vol. 133, pp

26  Floating point operations  No dependancy between elements  # of elements  Can be calculated from source code analytically  Ex.) When nit=1000, nelem=1024, nx=4(npts=nx*nx) ≈ 2 GFLOP  OpenMP  Two OpenMP Parallel regions for Do loops on element index(ie) DG KERNEL – source code

27  CPU  Conventional multi-core: 1 ~ 16+ cores/~256-bit vector registers  Many programming language: Fortran, C/C++, etc.  Intel SandyBridge E  Peak performance(2 Sockets): DP GFLOPS(Estimated by presenter)  MIC  Based on Intel Pentium cores with extensions including wider vector registers.  Many core and wider vector: 60+ cores/512-bit vector registers  Limited programming language(extensions only from Intel): C/C++, Fortran  Intel KNC( a.k.a. MIC)  Peak Performance(7120): DP TFLOPS  GPU  Many light-weight threads: ~2680+ threads(threading & vectorization)  Limited programming language(extensions): CUDA-C, CUDA-Fortran, OpenCL, OpenACC, F2C-ACC, etc.  Peak performances  Nvidia K20x: DP TFLOPS  Nvidia K20: DP TFLOPS  Nvidia M2070Q: GFLOPS MICRO-ARCHITECTURES

28 THE BEST PERFORMANCE RESULTS FROM CPU, GPU, AND MIC 5.4x 6.6x MIC

29  Compiler options  -mmic  Environmental variables  OMP_NUM_THREADS=240  KMP_AFFINITY = 'granularity=fine,compact'  Native mode only  No cost of memory copy between CPU and MIC  Supports from Intel  R. Sasanka MIC EVOLUTION 15.6x

30  Source modification  NONE  Compiler options  -mmic –openmp –O3 MIC VER. 1

31  Source code i = 1 s1 = (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) i = i + 1 s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) i = i + 1 s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) i = i + 1 s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i)  Compiler options  -mmic -openmp –O3  Performance Considerations  Complete unroll of three nested loops  Vectorized, but not efficiently enough MIC VER. 2

32 MIC VER. 3

33 MIC VER. 4

34 MIC VER. 5

35 CPU EVOLUTIONS WITH MIC EVOLUTIONS Generally, performance tuning on a micro-architecture also helps to improve performance on another micro-architecture. However, it is not always true. GPU

36  Compiler options  -O3 -arch=sm_35  same to all versions  “Offload mode” only  However, the time cost for data copy between CPU and GPU is not included for comparison to MIC native mode CUDA-C EVOLUTIONS 14.2x

37 CUDA-C VER. 1

38 CUDA-C VER. 2

39 CUDA-C VER. 3

40 CUDA-C VER. 4

41  Source Code ie = (blockidx%x - 1)*NDIV + (threadidx%x - 1)/(NX*NX) + 1 ii = MODULO(threadIdx%x - 1, NX*NX) + 1 IF (ie > SET_NELEM) RETURN k = MODULO(ii-1,NX) + 1 l = (ii - 1)/NX + 1 s2 = 0.0_8 DO j=1, NX s1 = 0.0_8 DO i = 1, NX s1 = s1 + (delta(l,j)*flx(i+(j-1)*nx,ie)*der(i,k) + & delta(i,k)*fly(i+(j-1)*nx,ie)*der(j,l))*gw(i) END DO ! i loop s2 = s2 + s1*gw(j) END DO ! j grad(ii,ie) = s2 flx(ii,ie) = flx(ii,ie)+ dt*grad(ii,ie) fly(ii,ie) = fly(ii,ie)+ dt*grad(ii,ie)  Performance considerations  Maintains source structure of original Fortran  Needs understanding on CUDA threading model, especially for debugging and performance tuning  Supports implicit memory copy, which is convenient but could negatively impact to performance if over-used. CUDA-FORTRAN

42 OpenACC

43 F2C-ACC

44  One source is highly desirable  Hard to manage versions from multiple micro-architectures and multiple programming technologies  Performance enhancement can be applied to multiple versions simultaneously  Conditional Compilation  Macro to insert & delete code for a particular technology  User control compilation by using compiler macro  Hard to get one source for CUDA-C  Many scientific codes are written in Fortran  CUDA-C has quite different code structure and should be written in C  Performance impact  Highest performance tuning techniques hardly allow one source ONE SOURCE

45  Faster hardware provides us with potential to the performance. However, we can exploit the potential only through better software.  Better software on accelerators generally means that it utilizes many cores and wide vectors simultaneously and efficiently.  In practice, those massive parallelisms can be achieved effectively by, among others, 1) re-using data that are loaded onto faster memory and 2) accessing successive array elements with aligned unit-stride manner. CONCLUSIONS

46  Using those techniques, we have achieved considerable amount of speed-ups for DG KERNEL  Speed-ups compared the best one socket SandyBridge performance  MIC: 6.6x  GPU: 5.4x  Speed-ups from initial version to the best performed version  MIC: 15.6x  GPU: 14.2x  Our next challenge is to applying the techniques that we have learned from kernel experiments to real software package. CONCLUSIONS - CONTINUED

47  Contacts:  ASAP:  CESM:  HOMME:  Extrae: tools/trace-generationhttp://www.bsc.es/es/computer-sciences/performance- tools/trace-generation  TAU: THANK YOU FOR YOUR ATTENTION.


Download ppt " Srinath Vadlamani  Youngsung Kim  and John Dennis  ASAP-TDD-CISL NCAR A COMPARISON OF CLIMATE APPLICATIONS ON ACCELERATED AND CONVENTIONAL ARCHITECTURES."

Similar presentations


Ads by Google