Download presentation

Presentation is loading. Please wait.

Published byNayeli Code Modified about 1 year ago

1
Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April NCSA, IL, USA 1

2
Talk Outline Physis stencil framework Map Reduce for K Mini-app development

3
Multi-GPU Application Development Non-unified programming models – MPI for inter-node parallelism – CUDA/OpenCL/OpenAC C for accelerators Optimization – Blocking – Overlapped computation and communication CUDA MPI Good performance with low programmer productivity

4
Goal High performance, highly productive programming for heterogeneous clusters Approach High level abstractions for structured parallel programming – Simplifying programming models – Portability across platforms – Does not sacrifice too much performance

5
Stencil Computation Pc’ = (Pc + Pn + Ps + Pw + Pe) * 1/5.0

6
Physis (Φύσις) Framework [SC’11] Stencil DSL Declarative Portable Global-view C-based void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0); } DSL Compiler Target-specific code generation and optimizations Automatic parallelization Physis C C+MPI CUDA CUDA+MPI OpenMP OpenCL

7
DSL Overview C + custom data types and intrinsics Grid data types – PSGrid3DFloat, PSGrid3DDouble, etc. Dense Cartesian domain types – PSDomain1D, PSDomain2D, and PSDomain3D Intrinsics – Runtime management – Grid object management (PSGridFloat3DNew, etc) – Grid accesses (PSGridCopyin, PSGridGet, etc) – Applying stencils to grids (PSGridMap, PSGridRun) – Grid reductions (PSGridReduce)

8
Writing Stencils Stencil Kernel – C functions describing a single flow of scalar execution on one grid element – Executed over specified rectangular domains void diffusion(const int x, const int y, const int z, PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t); } Issues a write to grid g2 Offset must be constant Periodic access is possible with PSGridGetPeriodic.

9
Applying Stencils to Grids Map: Creates a stencil closure that encapsulates stencil and grids Run: Iteratively executes stencil closures PSGrid3DFloat g1 = PSGrid3DFloatNew(NX, NY, NZ); PSGrid3DFloat g2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ); PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10); Grouping by PSStencilRun Target for kernel fusion optimization

10
Implementation DSL translator – Translate intrinsics calls to RT API calls – Generate GPU kernels with boundary exchanges based on static analysis – Using the ROSE compiler framework (LLNL) Runtime – Provides a shared memory-like interface for multidimensional grids over distributed CPU/GPU memory Physis Code Implementation Source Code Executable Code

11
CUDA Thread Blocking Each thread sweeps points in the Z dimension X and Y dimensions are blocked with AxB thread blocks, where A and B are user-configurable parameters (64x4 by default) X Z Y

12
Example: 7-point Stencil GPU Code __device__ void kernel(const int x,const int y,const int z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2) { float v = (((((( *__PSGridGetAddrNoHaloFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + *__PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + *__PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + *__PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v; } __global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2) { int x = blockIdx.x * blockDim.x + threadIdx.x + offset0, y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x = dom.local_max[0] || (y = dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); }

13
Example: 7-point Stencil CPU Code static void __PSStencilRun_0(int iter,void **stencils) { struct dim3 block_dim(64,4,1); struct __PSStencil_kernel *s0 = (struct __PSStencil_kernel *)stencils[0]; cudaFuncSetCacheConfig(__PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil(__PSGetLocalSize(0) / ((double )64))),(int )(ceil(__PSGetLocalSize(1) / ((double )4))),1); __PSDomainSetLocalSize(&s0 -> dom); s0 -> g = __PSGetGridByID(s0 -> __g_index); s0 -> g2 = __PSGetGridByID(s0 -> __g2_index); int i; for (i = 0; i < iter; ++i) {{ int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); } __PSStencilRun_kernel >>(__PSGetLocalOffset(0), __PSGetLocalOffset(1),s0 -> dom, *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); } cudaThreadSynchronize(); }

14
Optimization ： Overlapped Computation and Communication Boundary Inner points 1. Copy boundaries from GPU to CPU for non-unit stride cases 2. Computes interior points 3. Boundary exchanges with neighbors 4. Computes boundaries Time

15
Optimization Example: 7-Point Stencil CPU Code for (i = 0; i < iter; ++i) { __PSStencilRun_kernel_interior >> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -> dom,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); … __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); cudaThreadSynchronize(); } cudaThreadSynchronize(); } Computing Interior Points Boundary Exchange Computing Boundary Planes Concurrently

16
Local Optimization Register blocking – Reuse loaded grid elements with registers for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k- 1]+f[i][j][k+1]); } double kc = f[i][j][0]; doubke kn = f[i][j][1]; for (int k = 1; k < n-1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn); } Original Optimized

17
Local Optimization Common subexpression elimination in offset computation – Eliminates intra- and inter-iteration common subexpressions for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-1)*n*n]+f[i+j*n+(k+1)*n*n]); } int idx = i+j*n+n*n; for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-n*n]+f[idx+n*n]); idx += n*n; } Original Optimized

18
Evaluation Performance and productivity Sample code – 7-point diffusion kernel (#stencil: 1) – Jacobi kernel from Himeno benchmark (#stencil: 1) – Seismic simulation (#stencil: 15) Platform – Tsubame 2.0 Node: Westmere-EP 2.9GHz x 2 + M2050 x 3 Dual Infiniband QDR with full bisection BW fat tree

19
Productivity Similar size as sequential code in C

20
Optimization Effect 7-point diffusion stencil on 1 GPU (Tesla M2050)

21
Diffusion Weak Scaling

22
Seismic Weak Scaling Problem size: 256x256x256 per GPU

23
Diffusion Strong Scaling Problem size: 512x512x4096

24
Himeno Strong Scaling Problem size XL (1024x1024x512)

25
Conclusion High-level abstractions for stencil computations – Portable – Declarative – Automatic parallelization Future work – Automatic and model-based tuning – Fault tolerance by automated checkpointing – Support of other architectures OpenMP/OpenCL ongoing Porting to the K Computer – User-defined data types – Composability extension Acknowledgments – JST CREST, FP3C, NVIDIA – The ROSE project by Dan Quinlan et al. of LLNL

26
Ongoing Work Auto-tuning – Preliminary AT for the CUDA backend available Supporting different accelerators Supporting more complex problems – Stencil with limited data dependency – Hierarchically organized problems Work unit: Dense structured grids Overall problem domain: Sparsely connected work units Example – NICAM: An Icosahedral model of climate simulation – UPACS: Fluid simulation of engineering problems

27
Further Information Code is available at Maruyama et al., “Physis: Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers,” SC’11, 2011.

28
Optimization Effects

29
Talk Outline Physis stencil framework Map Reduce for K Mini-app development

30
Map Reduce for K In-memory MPI-based MapReduce for the K computer – Implemented as a C library – Provides (some of) Hadoop-like programming interfaces – Strong focus on scalable processing of large data sets on K – Supports standard MPI clusters too Application examples – Metagenome sequence analysis – Replica-exchange MD Runs hundreds of NAMD as a Map Fast data loading by the Tofu network

31
Talk Outline Physis stencil framework Map Reduce for K Mini-app development

32
HPC Mini Applications A set of mini-apps derived from national key applications. – Source code will be publicly released around Q1- Q2 ’14. – Expected final number of apps: < 20 Part of the national pre-exa projects – Supported by the MEXT HPCI Feasibility Study program (PI: Hirofumi Tomita, RIKEN AICS)

33
Mini-App Methodology 1.Request for application submissions to the current users of the Japanese HPC systems – Documentation Mathematical background Target capability and capacity – Input data sets – Validation methods – Application code Full-scale applications – Capable to run complete end-to-end simulations – 17 submissions so far Stripped-down applications – Simplified applications with only essential part of code – 6 submissions so far

34
Mini-App Methodology 2.Deriving “mini” applications from submitted applications – Understanding performance-critical patterns Computations, memory access, file I/O, network I/O – Reducing codebase size Removing code not used for target problems Applying standard software engineering practices (e.g., DRY) – Refactoring into reference implementations – Performance modeling In collaboration with the ASPEN project (Jeff Vetter at ORNL) – (Optional) Versions optimized for specific architecture

35
Mni-App Example: Molecular Dynamics Kernels: Pairwise force calculation + Long-range updates Two alternative algorithms for solving the equivalent problems – FFT-based Particle Mesh Ewald Bottlenecked by all-to-all communications at scale – Fast Multipole Method Tree-based problem formulation with no all-to-all communications Simplified problem settings – Only simulates water molecules in the NVE setting – Can reduce the codebase significantly – Easier to create input data sets of different scales Two reference implementations to study performance implications by algorithmic differences – MARBLE (20K SLOC) – MODYLAS (16K SLOC)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google