This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
3.5 Interprocess Communication
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 13 Embedded Systems
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
CSCI2413 Lecture 6 Operating Systems Memory Management 2 phones off (please)
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse.
Computer System Architectures Computer System Software
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Microprocessor-based systems Curse 7 Memory hierarchies.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Generative Programming. Automated Assembly Lines.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
CE Operating Systems Lecture 14 Memory management.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
Looping and Counting Lecture 3 Hartmut Kaiser
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Processor Architecture
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Operating Systems: Wrap-Up Questions answered in this lecture: What is an Operating System? Why are operating systems so interesting? What techniques can.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Single Node Optimization Computational Astrophysics.
Chapter 1 Java Programming Review. Introduction Java is platform-independent, meaning that you can write a program once and run it anywhere. Java programs.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 6: Stepwise refinement revisited, Midterm review.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
For Massively Parallel Computation The Chaotic State of the Art
Embedded Systems Design
Ray-Cast Rendering in VTK-m
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Chapter 4: Threads.
Chapter 4: Threads & Concurrency
Foundations and Definitions
Programming with Shared Memory Specifying parallelism
Presentation transcript:

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA Lawrence Livermore National Security, LLC LLNL-PRES April 21-24, 2014 Salishan Conference on High-Speed Computing

LLNL-PRES Lawrence Livermore National Laboratory 2  Managing increased hardware complexity and diversity… Multi-level memory — High-bandwidth (e.g., stacked) memory on-package — High-capacity (e.g., NVRAM) main memory — Deeper cache hierarchies (user managed?) Heterogeneous processor hierarchy, changing core count configurations — Latency-optimized (e.g., fat cores) — Throughput-optimized (e.g., GPUs, MIC) Increased importance of vectorization / SIMD — 2 – 8 wide (double precision) on many architectures and growing, 32 wide on GPUs As # cores/chip increases, cache coherence across full chip may not exist  …requires pervasive, disruptive, architecture-specific software changes Data-specific changes — Data structure transformations (e.g., Struct of Array vs. Array of Struct) — Need to insert directives and intrinsics (e.g., restrict and align annotations) on individual loops Algorithm-specific changes — Loop inversion, strip mining, loop fusion, etc. — Which loops and which directives (e.g., OpenMP, OpenACC) may be architecture-specific

LLNL-PRES Lawrence Livermore National Laboratory 3  Architecture/Performance portability (our defn): map application memory and functional requirements to memory systems and functional units on a range of architectures, while maintaining a consistent programming style and control of all aspects of execution  The problem is acute for large multi-physics codes O(10 5 ) – O(10 6 ) lines of code; O(10K) loops Mini-apps do not capture the scale of code changes needed  Manageable portable performance requires separating platform- specific data management and execution concerns from numerical algorithms in our applications… ….with code changes that are intuitive to developers

LLNL-PRES Lawrence Livermore National Laboratory 4  We need algorithms & programming styles that: Can express various forms of parallelism Enable high performance portably We can explore in our codes incrementally  There is no clear “best choice” for future PM/language.  RAJA is based on standard C++ (we rely on already) It supports constructs & extends concepts used heavily in LLNL codes It can be added to codes incrementally & used selectively It allows various PMs “under the covers” – it does not wed a code to a particular technology It is lightweight and offers developers customizable implementation choices

LLNL-PRES Lawrence Livermore National Laboratory 5 Data type encapsulation hides non-portable compiler directives, intrinsics, etc. (not required by RAJA, but a good idea in general) Traversal templates encapsulate platform-specific scheduling & execution Index sets encapsulate loop iteration patterns & data placement C++ lambda functions enable decoupling (this is essential for us!) Important: Loop body is the same. Transformations can be adopted incrementally and each part can be customized for specific code needs. double* x ; double*y ; double a ; // … for ( int i = begin; i < end; ++i ) { y[ i ] += a * x[ i ] ; } C-style for-loop Real_ptr x; Real_ptr y ; Real_type a ; // … forall ( IndexSet, [&] (Index_type i) { y[ i ] += a * x[ i ] ; } ); RAJA-style loop

LLNL-PRES Lawrence Livermore National Laboratory 6 int elems[] = {0, 1, 2, 3, 4, 5, 6, 7, 14, 27, 36, 40, 41, 42, 43, 44, 45, 46, 47, 87, 117}; It is common to define arrays of indices to process; e.g., nodes, elts w/material, etc. Segments can be tailored to architecture features (e.g., SIMD hardware units): createHybridISet() methods coordinate runtime & compile-time optimizations Platform specific header files contain tailored traversal definitions 0…714, 27, 3640…4787,117 Create “Hybrid” Index Set containing work segments HybridISet segments = createHybridISet( elems, nelems ); Traversal method dispatches segments according to execution policy forall ( segments, loop_body ); Range segment for (int i = begin; i < end; ++i) { loop_body(i) ; } Unstructured segment for (int i = 0; i < seg_len; ++i) { loop_body( segment[i] ) ; }

LLNL-PRES Lawrence Livermore National Laboratory 7 Code compiled with icpc v and run on one Intel ES-2670 node (TLCC2)

LLNL-PRES Lawrence Livermore National Laboratory 8 Understanding RAJA overhead at small thread counts requires more analysis.

LLNL-PRES Lawrence Livermore National Laboratory 9

LLNL-PRES Lawrence Livermore National Laboratory 10

LLNL-PRES Lawrence Livermore National Laboratory 11  Then, replace loops with calls to API routines that translate to RAJA mechanics for (int ii = 0; ii numRealZones; i++) { int zone = domain->Zones[i]; // use “zone” as array index } for (int ii = 0; ii numRealZones; i++) { int zone = domain->Zones[i]; // use “zone” as array index } forEachRealZone (domain, [=](int zone) { // use “zone” as array index }); forEachRealZone (domain, [=](int zone) { // use “zone” as array index }); …becomes this Loops like these… for (int j = domain->jmin; j jmax; ++j) { for (int i = domain->imin; i imax; ++i) { int zone = i + j * domain->jp; // use “zone” as array index } for (int j = domain->jmin; j jmax; ++j) { for (int i = domain->imin; i imax; ++i) { int zone = i + j * domain->jp; // use “zone” as array index } In single code location domain->index_sets->real_zones =...  Define RAJA index sets in routines where domain extents and indirection arrays are set up

LLNL-PRES Lawrence Livermore National Laboratory 12  Sedov problem Lagrangian hydro only 3D, 192^3 = 7,077,888 zones Run 25 cycles Time run with “date \n run \date” in input file  Evaluate strong-scaling with different MPI-OpenMP combinations “Coarse-grained” threading: thread domain loops “Fine-grained” threading: thread numerical kernel loops  Note: this problem is not representative of a real problem in terms of computational work and exercises a small portion of code

LLNL-PRES Lawrence Livermore National Laboratory 13 MPI TasksThreads per MPI task Time (seconds) SpeedupParallel efficiency Number of domains is N x M, where N = # MPI tasks, M = # threads/task Real problems run here, typically.

LLNL-PRES Lawrence Livermore National Laboratory 14 RAJA version: N MPI tasks x M threads, N x M = 64 Original version: N MPI tasks In either case, N domains

LLNL-PRES Lawrence Livermore National Laboratory 15 RAJA version: N MPI tasks x M threads, N domains Original version: N MPI tasks x M threads, N x M domains In either case, N x M = 64

LLNL-PRES Lawrence Livermore National Laboratory 16  Basic integration is straightforward  Hard work is localized Setting up & manipulating index sets Defining platform-specific execution policies for loop classes  Converting loops is easy, but tedious Replace loop header with call to iteration template Identify loop type (i.e., execution pattern) Determine whether loop can and should be parallelized — Are other changes needed; e.g., variable scope, thread safety? — What is appropriate execution policy? Platform-specific?  Encapsulation of looping constructs also benefits software consistency, readability, and maintainability

LLNL-PRES Lawrence Livermore National Laboratory 17  RAJA version is sometimes faster, sometimes slower. We have converted only 421 Lagrange hydro loops (327 DPstream, 83 DPwork, 11 Seq). Threading too aggressive? Other code transformations can expose additional parallelism opportunities and enable compiler optimizations (e.g., SIMD).  We need to overcome RAJA serial performance hit. Compiler optimization for template/lambda constructs? Need better inlining?  Once RAJA in place, exploration of data layout and execution choices to improve performance is straightforward and centralized (we have done some of this already)

LLNL-PRES Lawrence Livermore National Laboratory 18  We have identified specific compiler deficiencies, developed concrete recommendations and evidence of feasibility, and engaged compiler teams  We also created LCALS to study & monitor the issues A suite of loops implemented with various s/w constructs (Livermore Loops modernized and expanded – this time in C++) Very useful for dialogue with compiler vendors — Generate test cases for vendors showing optimization/support issues — Try vendor solutions and report findings — Introduce & motivate encapsulation concepts not on vendors’ RADAR — Track version-to-version compiler performance Available at “A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes”, Rich Hornung and Jeff Keasler, LLNL-TR (

LLNL-PRES Lawrence Livermore National Laboratory 19 All runs on Intel ES-2670 node, compiled with icpc: -O3 -mavx -inline-max-total-size= inline-forceinline -ansi-alias -std=c++0x

LLNL-PRES Lawrence Livermore National Laboratory 20 Illustrative RAJA use cases

LLNL-PRES Lawrence Livermore National Laboratory 21  Many multi-physics applications use indirection arrays to traverse unstructured meshes, to access elements with a given material, etc.  Since indirection arrays are (re)defined at runtime, many compile- time optimizations cannot be applied (e.g., SIMD vectorization).  ALE3D example: large problem with 10 materials, 512 domains, 16+ million zones (many multi-material), most work done in “long” stride-1 index ranges:  Hybrid index sets can help recover some lost performance by exposing traversals of stride-1 ranges without indirection. >= 1684%>= 12869% >= 3274%>= 25667% >= 6470%>= 51264%

LLNL-PRES Lawrence Livermore National Laboratory 22 template forall(int begin, int end, LB loop_body) { bool done = false ; while (!done) { try { done = true ; for (int i = begin; i < end; ++i) loop_body(i) ; } catch (Transient_fault) { cache_invalidate() ; done = false ; } } } No impact on source code & recovery cost is commensurate with scope of fault. It requires: idempotence, O/S can signal processor faults, try/catch can process O/S signals, etc. These requirements should not be showstoppers!

LLNL-PRES Lawrence Livermore National Laboratory 23 void domixupvar( Domain_t* domain, MixedZone_t* mixzon, double *var,... ) { for ( int iz = 0 ; iz < nmixz ; iz++ ) { if ( mixzon->doflag[ iz ] == 1 ) { for ( int i = 0 ; i numlocalmats ; i++ ) { RegMixedZone_t& mzreg =...; if ( mzreg.ndxput[ iz ] >= 0 ) { //... var[ mzreg.ndxput[ iz ] ] = mzreg.airnew[ iz ]; // etc... This code uses integer arrays for both control logic and indirection.

LLNL-PRES Lawrence Livermore National Laboratory 24 Encoding conditional logic in RAJA index sets simplifies code and removes two levels of nesting (good for developers and compilers!): Serial speedup:1.6x on Intel Sandybridge, 1.9x on IBM BG/Q (g++) Aside: compiling original code w/ g (needed for lambdas) gives 1.99x speedup over XLC. So, original w/XLC  RAJA w/g++ yields 3.78x total performance increase. void domixupvar( Domain_t* domain, MixedZone_t* mixzon, double *var,... ) { for (int i = 0; i numlocalmats; i++) { int ir = domain->localmats[i] ; RAJA::forall (*mixzon->reg[ir].ndxput_is, [&] (int iz) { //... var[ mzreg.ndxput[ iz ] ] = mzreg.airnew[ iz ]; // etc...

LLNL-PRES Lawrence Livermore National Laboratory 25  A common operation in staggered-mesh codes sums values to nodes from surrounding zones; i.e., nodal_val[ node ] += zonal_val[ zone ]  Index set segments can be used to define independent groups of computation (colors)  Option A (~8x speedup w/16 threads): Iterate over groups sequentially (group 1 completes, then group 2, etc.) Operations within a group execute in parallel  Option B (~17% speedup over option A): Zones in a group (row) processed sequentially Iterate over rows of each color in parallel  Note: No source code change needed to switch between iteration / parallel execution patterns Option A Option B

LLNL-PRES Lawrence Livermore National Laboratory 26  A Hybrid Memory Cube has four quadrants. Latency is lowered (by as much as 50%) if access stays within a quadrant.  Addressing rotates quickly through vaults, 4 vaults/quadrant.  Quadrant access is striped in a way that requires non-contiguous array allocation. Byte assignment of memory space to quadrants (e.g., an array “block” within a quadrant has 64 doubles)

LLNL-PRES Lawrence Livermore National Laboratory 27  A specialized traversal template can be written for the HMC to keep memory access within a quadrant; e.g., #define QUADRANT_SIZE 64 #define QUADRANT_MASK 63 #define QUADRANT_STRIDE (QUADRANT_SIZE * 4) template void forall(int begin, int end, LOOP_BODY loop_body){ int beginQuad = (begin / QUADRANT_SIZE ) ; int endQuad = ((end - 1) / QUADRANT_SIZE) ; int beginOffset = (beginQuad * QUADRANT_STRIDE + (begin & QUADRANT_MASK) ; int endOffset = (endQuad * QUADRANT_STRIDE) + ((end - 1) & QUADRANT_MASK) + 1 ; do { /* do at most QUADRANT_SIZE iterations */ for(ii=beginOffSet; ii<endOffSet; ++i) { loop_body(ii) ; } beginOffset += QUADRANT_STRIDE ; endOffset += QUADRANT_STRIDE ; } while (beginQuad++ != endQuad) ; }

LLNL-PRES Lawrence Livermore National Laboratory 28  RAJA can encapsulate platform-specific implementation concerns in a large code. Insertion is not hard, but can be tedious. (Many loops, but few patterns) We are working with ROSE team to see what can be done automatically.  What are the benefits? Application code can be simpler – easier to read, write, and maintain. Developers can customize model for code-specific constructs. Centralized loop-level execution control. Code can be parameterized to run efficiently on different platforms.  What are the concerns? Improving performance requires detailed analysis and other code changes (not unique to RAJA and ROSE can help with this…) Extending model to other PMs and architectures (we’re optimistic) Managing “memory spaces” on future architectures (a little help, anybody?) We can’t “do it all” via s/w engineering. We also need help from compilers, PMs, O/S, language features, etc. (all have portability issues!)

LLNL-PRES Lawrence Livermore National Laboratory 29  Jeff Keasler, my collaborator on RAJA development  Esteban Pauli, “guinea pig” for trying out RAJA in Ares

LLNL-PRES Lawrence Livermore National Laboratory 30 The end.