Presentation is loading. Please wait.

Presentation is loading. Please wait.

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Similar presentations


Presentation on theme: "This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."— Presentation transcript:

1 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC LLNL-PRES-653032 April 21-24, 2014 Salishan Conference on High-Speed Computing

2 LLNL-PRES-653032 Lawrence Livermore National Laboratory 2  Managing increased hardware complexity and diversity… Multi-level memory — High-bandwidth (e.g., stacked) memory on-package — High-capacity (e.g., NVRAM) main memory — Deeper cache hierarchies (user managed?) Heterogeneous processor hierarchy, changing core count configurations — Latency-optimized (e.g., fat cores) — Throughput-optimized (e.g., GPUs, MIC) Increased importance of vectorization / SIMD — 2 – 8 wide (double precision) on many architectures and growing, 32 wide on GPUs As # cores/chip increases, cache coherence across full chip may not exist  …requires pervasive, disruptive, architecture-specific software changes Data-specific changes — Data structure transformations (e.g., Struct of Array vs. Array of Struct) — Need to insert directives and intrinsics (e.g., restrict and align annotations) on individual loops Algorithm-specific changes — Loop inversion, strip mining, loop fusion, etc. — Which loops and which directives (e.g., OpenMP, OpenACC) may be architecture-specific

3 LLNL-PRES-653032 Lawrence Livermore National Laboratory 3  Architecture/Performance portability (our defn): map application memory and functional requirements to memory systems and functional units on a range of architectures, while maintaining a consistent programming style and control of all aspects of execution  The problem is acute for large multi-physics codes O(10 5 ) – O(10 6 ) lines of code; O(10K) loops Mini-apps do not capture the scale of code changes needed  Manageable portable performance requires separating platform- specific data management and execution concerns from numerical algorithms in our applications… ….with code changes that are intuitive to developers

4 LLNL-PRES-653032 Lawrence Livermore National Laboratory 4  We need algorithms & programming styles that: Can express various forms of parallelism Enable high performance portably We can explore in our codes incrementally  There is no clear “best choice” for future PM/language.  RAJA is based on standard C++ (we rely on already) It supports constructs & extends concepts used heavily in LLNL codes It can be added to codes incrementally & used selectively It allows various PMs “under the covers” – it does not wed a code to a particular technology It is lightweight and offers developers customizable implementation choices

5 LLNL-PRES-653032 Lawrence Livermore National Laboratory 5 Data type encapsulation hides non-portable compiler directives, intrinsics, etc. (not required by RAJA, but a good idea in general) Traversal templates encapsulate platform-specific scheduling & execution Index sets encapsulate loop iteration patterns & data placement C++ lambda functions enable decoupling (this is essential for us!) Important: Loop body is the same. Transformations can be adopted incrementally and each part can be customized for specific code needs. double* x ; double*y ; double a ; // … for ( int i = begin; i < end; ++i ) { y[ i ] += a * x[ i ] ; } C-style for-loop Real_ptr x; Real_ptr y ; Real_type a ; // … forall ( IndexSet, [&] (Index_type i) { y[ i ] += a * x[ i ] ; } ); RAJA-style loop

6 LLNL-PRES-653032 Lawrence Livermore National Laboratory 6 int elems[] = {0, 1, 2, 3, 4, 5, 6, 7, 14, 27, 36, 40, 41, 42, 43, 44, 45, 46, 47, 87, 117}; It is common to define arrays of indices to process; e.g., nodes, elts w/material, etc. Segments can be tailored to architecture features (e.g., SIMD hardware units): createHybridISet() methods coordinate runtime & compile-time optimizations Platform specific header files contain tailored traversal definitions 0…714, 27, 3640…4787,117 Create “Hybrid” Index Set containing work segments HybridISet segments = createHybridISet( elems, nelems ); Traversal method dispatches segments according to execution policy forall ( segments, loop_body ); Range segment for (int i = begin; i < end; ++i) { loop_body(i) ; } Unstructured segment for (int i = 0; i < seg_len; ++i) { loop_body( segment[i] ) ; }

7 LLNL-PRES-653032 Lawrence Livermore National Laboratory 7 Code compiled with icpc v14.0.106 and run on one Intel ES-2670 node (TLCC2)

8 LLNL-PRES-653032 Lawrence Livermore National Laboratory 8 Understanding RAJA overhead at small thread counts requires more analysis.

9 LLNL-PRES-653032 Lawrence Livermore National Laboratory 9

10 LLNL-PRES-653032 Lawrence Livermore National Laboratory 10

11 LLNL-PRES-653032 Lawrence Livermore National Laboratory 11  Then, replace loops with calls to API routines that translate to RAJA mechanics for (int ii = 0; ii numRealZones; i++) { int zone = domain->Zones[i]; // use “zone” as array index } for (int ii = 0; ii numRealZones; i++) { int zone = domain->Zones[i]; // use “zone” as array index } forEachRealZone (domain, [=](int zone) { // use “zone” as array index }); forEachRealZone (domain, [=](int zone) { // use “zone” as array index }); …becomes this Loops like these… for (int j = domain->jmin; j jmax; ++j) { for (int i = domain->imin; i imax; ++i) { int zone = i + j * domain->jp; // use “zone” as array index } for (int j = domain->jmin; j jmax; ++j) { for (int i = domain->imin; i imax; ++i) { int zone = i + j * domain->jp; // use “zone” as array index } In single code location domain->index_sets->real_zones =...  Define RAJA index sets in routines where domain extents and indirection arrays are set up

12 LLNL-PRES-653032 Lawrence Livermore National Laboratory 12  Sedov problem Lagrangian hydro only 3D, 192^3 = 7,077,888 zones Run 25 cycles Time run with “date \n run \date” in input file  Evaluate strong-scaling with different MPI-OpenMP combinations “Coarse-grained” threading: thread domain loops “Fine-grained” threading: thread numerical kernel loops  Note: this problem is not representative of a real problem in terms of computational work and exercises a small portion of code

13 LLNL-PRES-653032 Lawrence Livermore National Laboratory 13 MPI TasksThreads per MPI task Time (seconds) SpeedupParallel efficiency 1116281.001.000 219881.650.825 415123.180.795 812716.010.751 16114910.930.683 32110116.120.504 6417721.140.330 3229517.140.268 16410315.810.247 8811813.800.216 41614011.630.182 2321779.200.144 1642556.380.099 Number of domains is N x M, where N = # MPI tasks, M = # threads/task Real problems run here, typically.

14 LLNL-PRES-653032 Lawrence Livermore National Laboratory 14 RAJA version: N MPI tasks x M threads, N x M = 64 Original version: N MPI tasks In either case, N domains

15 LLNL-PRES-653032 Lawrence Livermore National Laboratory 15 RAJA version: N MPI tasks x M threads, N domains Original version: N MPI tasks x M threads, N x M domains In either case, N x M = 64

16 LLNL-PRES-653032 Lawrence Livermore National Laboratory 16  Basic integration is straightforward  Hard work is localized Setting up & manipulating index sets Defining platform-specific execution policies for loop classes  Converting loops is easy, but tedious Replace loop header with call to iteration template Identify loop type (i.e., execution pattern) Determine whether loop can and should be parallelized — Are other changes needed; e.g., variable scope, thread safety? — What is appropriate execution policy? Platform-specific?  Encapsulation of looping constructs also benefits software consistency, readability, and maintainability

17 LLNL-PRES-653032 Lawrence Livermore National Laboratory 17  RAJA version is sometimes faster, sometimes slower. We have converted only 421 Lagrange hydro loops (327 DPstream, 83 DPwork, 11 Seq). Threading too aggressive? Other code transformations can expose additional parallelism opportunities and enable compiler optimizations (e.g., SIMD).  We need to overcome RAJA serial performance hit. Compiler optimization for template/lambda constructs? Need better inlining?  Once RAJA in place, exploration of data layout and execution choices to improve performance is straightforward and centralized (we have done some of this already)

18 LLNL-PRES-653032 Lawrence Livermore National Laboratory 18  We have identified specific compiler deficiencies, developed concrete recommendations and evidence of feasibility, and engaged compiler teams  We also created LCALS to study & monitor the issues A suite of loops implemented with various s/w constructs (Livermore Loops modernized and expanded – this time in C++) Very useful for dialogue with compiler vendors — Generate test cases for vendors showing optimization/support issues — Try vendor solutions and report findings — Introduce & motivate encapsulation concepts not on vendors’ RADAR — Track version-to-version compiler performance Available at https://codesign.llnl.gov “A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes”, Rich Hornung and Jeff Keasler, LLNL-TR-653681. (https://codesign.llnl.gov/codesign-papers-presentations.php)

19 LLNL-PRES-653032 Lawrence Livermore National Laboratory 19 All runs on Intel ES-2670 node, compiled with icpc: -O3 -mavx -inline-max-total-size=10000 -inline-forceinline -ansi-alias -std=c++0x

20 LLNL-PRES-653032 Lawrence Livermore National Laboratory 20 Illustrative RAJA use cases

21 LLNL-PRES-653032 Lawrence Livermore National Laboratory 21  Many multi-physics applications use indirection arrays to traverse unstructured meshes, to access elements with a given material, etc.  Since indirection arrays are (re)defined at runtime, many compile- time optimizations cannot be applied (e.g., SIMD vectorization).  ALE3D example: large problem with 10 materials, 512 domains, 16+ million zones (many multi-material), most work done in “long” stride-1 index ranges:  Hybrid index sets can help recover some lost performance by exposing traversals of stride-1 ranges without indirection. >= 1684%>= 12869% >= 3274%>= 25667% >= 6470%>= 51264%

22 LLNL-PRES-653032 Lawrence Livermore National Laboratory 22 template forall(int begin, int end, LB loop_body) { bool done = false ; while (!done) { try { done = true ; for (int i = begin; i < end; ++i) loop_body(i) ; } catch (Transient_fault) { cache_invalidate() ; done = false ; } } } No impact on source code & recovery cost is commensurate with scope of fault. It requires: idempotence, O/S can signal processor faults, try/catch can process O/S signals, etc. These requirements should not be showstoppers!

23 LLNL-PRES-653032 Lawrence Livermore National Laboratory 23 void domixupvar( Domain_t* domain, MixedZone_t* mixzon, double *var,... ) { for ( int iz = 0 ; iz < nmixz ; iz++ ) { if ( mixzon->doflag[ iz ] == 1 ) { for ( int i = 0 ; i numlocalmats ; i++ ) { RegMixedZone_t& mzreg =...; if ( mzreg.ndxput[ iz ] >= 0 ) { //... var[ mzreg.ndxput[ iz ] ] = mzreg.airnew[ iz ]; // etc... This code uses integer arrays for both control logic and indirection.

24 LLNL-PRES-653032 Lawrence Livermore National Laboratory 24 Encoding conditional logic in RAJA index sets simplifies code and removes two levels of nesting (good for developers and compilers!): Serial speedup:1.6x on Intel Sandybridge, 1.9x on IBM BG/Q (g++) Aside: compiling original code w/ g++4.7.2 (needed for lambdas) gives 1.99x speedup over XLC. So, original w/XLC  RAJA w/g++ yields 3.78x total performance increase. void domixupvar( Domain_t* domain, MixedZone_t* mixzon, double *var,... ) { for (int i = 0; i numlocalmats; i++) { int ir = domain->localmats[i] ; RAJA::forall (*mixzon->reg[ir].ndxput_is, [&] (int iz) { //... var[ mzreg.ndxput[ iz ] ] = mzreg.airnew[ iz ]; // etc...

25 LLNL-PRES-653032 Lawrence Livermore National Laboratory 25  A common operation in staggered-mesh codes sums values to nodes from surrounding zones; i.e., nodal_val[ node ] += zonal_val[ zone ]  Index set segments can be used to define independent groups of computation (colors)  Option A (~8x speedup w/16 threads): Iterate over groups sequentially (group 1 completes, then group 2, etc.) Operations within a group execute in parallel  Option B (~17% speedup over option A): Zones in a group (row) processed sequentially Iterate over rows of each color in parallel  Note: No source code change needed to switch between iteration / parallel execution patterns. 43433 22111 4 3 4 33 22 111 22111 22222 11111 2 2 2 22 11111 11111 Option A Option B

26 LLNL-PRES-653032 Lawrence Livermore National Laboratory 26  A Hybrid Memory Cube has four quadrants. Latency is lowered (by as much as 50%) if access stays within a quadrant.  Addressing rotates quickly through vaults, 4 vaults/quadrant.  Quadrant access is striped in a way that requires non-contiguous array allocation. Byte assignment of memory space to quadrants (e.g., an array “block” within a quadrant has 64 doubles)

27 LLNL-PRES-653032 Lawrence Livermore National Laboratory 27  A specialized traversal template can be written for the HMC to keep memory access within a quadrant; e.g., #define QUADRANT_SIZE 64 #define QUADRANT_MASK 63 #define QUADRANT_STRIDE (QUADRANT_SIZE * 4) template void forall(int begin, int end, LOOP_BODY loop_body){ int beginQuad = (begin / QUADRANT_SIZE ) ; int endQuad = ((end - 1) / QUADRANT_SIZE) ; int beginOffset = (beginQuad * QUADRANT_STRIDE + (begin & QUADRANT_MASK) ; int endOffset = (endQuad * QUADRANT_STRIDE) + ((end - 1) & QUADRANT_MASK) + 1 ; do { /* do at most QUADRANT_SIZE iterations */ for(ii=beginOffSet; ii<endOffSet; ++i) { loop_body(ii) ; } beginOffset += QUADRANT_STRIDE ; endOffset += QUADRANT_STRIDE ; } while (beginQuad++ != endQuad) ; }

28 LLNL-PRES-653032 Lawrence Livermore National Laboratory 28  RAJA can encapsulate platform-specific implementation concerns in a large code. Insertion is not hard, but can be tedious. (Many loops, but few patterns) We are working with ROSE team to see what can be done automatically.  What are the benefits? Application code can be simpler – easier to read, write, and maintain. Developers can customize model for code-specific constructs. Centralized loop-level execution control. Code can be parameterized to run efficiently on different platforms.  What are the concerns? Improving performance requires detailed analysis and other code changes (not unique to RAJA and ROSE can help with this…) Extending model to other PMs and architectures (we’re optimistic) Managing “memory spaces” on future architectures (a little help, anybody?) We can’t “do it all” via s/w engineering. We also need help from compilers, PMs, O/S, language features, etc. (all have portability issues!)

29 LLNL-PRES-653032 Lawrence Livermore National Laboratory 29  Jeff Keasler, my collaborator on RAJA development  Esteban Pauli, “guinea pig” for trying out RAJA in Ares

30 LLNL-PRES-653032 Lawrence Livermore National Laboratory 30 The end.


Download ppt "This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."

Similar presentations


Ads by Google