Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National.

Similar presentations


Presentation on theme: "Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National."— Presentation transcript:

1 Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National Laboratories Collaborators: SNL Staff: [B.|R.] Barrett, E. Boman, R. Brightwell, H.C. Edwards, A. Williams SNL Postdocs: M. Hoemmen, S. Rajamanickam, M. Wolf (MIT Lincoln Lab) ORNL staff: Chris Baker, Greg Koenig, Geoffroy Vallee. Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin company, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

2 Topics 1.Context: A Library developer’s perspective. 2.Observations on success of MPI. 3.Scalable manycore approaches. 4.A new approach to soft faults. 5.Exascale concerns.

3 https://computing.llnl.gov/tutorials/bgp/images/nodeSoftwareStacks.gif Tutorial for users of Livermore Computing's Dawn BlueGene/P (Blaise Barney) LLNL Compute Center View

4 http://www.mcs.anl.gov/research/cifts/ CiFTS Project Image: http://nowlab.cse.ohio-state.edu/projects/ftb-ib/FTB-IB_files/ftb.jpg FTB World View

5 Math Libraries World View Math Libraries (e.g., Hypre, PETSc, SuperLU, Trilinos) Math Libraries (e.g., Hypre, PETSc, SuperLU, Trilinos) Processor Architecture Node Architecture Network Architecture OS/Runtime Compilers Programming Models Applications

6 Three Design Points Terascale Laptop: Uninode-Manycore Petascale Deskside: Multinode-Manycore Exascale Center:Manynode-Manycore

7 Basic Concerns: Trends, Manycore Stein’s Law: If a trend cannot continue, it will stop. Herbert Stein, chairman of the Council of Economic Advisers under Nixon and Ford. Trends at risk: –Power. –Single core performance. –Node count. –Memory size & BW. –Concurrency expression in existing Programming Models. –Resilience. Edwards: SAND2009-8196 Trilinos ThreadPool Library v1.1. “Status Quo” ~ MPI-only 7 Strong Scaling Potential

8 Observations MPI-Only is not sufficient, except … much of the time. Near-to-medium term: –MPI+[OMP|TBB|Pthreads|CUDA|OCL|MPI] –Long term, too? –$100 wager: The first 10 exascale apps will call MPI_Init. Concern: –Best hybrid performance: 1 MPI rank per UMA core set. –UMA core set size growing slowly  Lots of MPI tasks. Long- term: –Something hierarchical, global in scope. Conjecture: –Data-intensive apps need non-SPDM model. –Will develop new programming model/env. –Rest of apps will adopt over time. –Time span: 10-20 years.

9 What Can we Do Right Now? Study why MPI (really SPMD) was successful. Study new parallel landscape. Try to cultivate an approach similar to MPI (and others).

10 MPI Impresssions 10

11 Dan Reed, Microsoft Workshop on the Road Map for the Revitalization of High End Computing June 16-18, 2003 Tim Stitts, CSCS SOS14 Talk March 2010 “ MPI is often considered the “portable assembly language” of parallel computing, …” Brad Chamberlain, Cray, 2000.

12 MPI Reality 12

13 Tramonto WJDC Functional New functional. Bonded systems. 552 lines C code. WJDC-DFT (Werthim, Jain, Dominik, and Chapman) theory for bonded systems. (S. Jain, A. Dominik, and W.G. Chapman. Modified interfacial statistical associating fluid theory: A perturbation density functional theory for inhomogeneous complex fluids. J. Chem. Phys., 127:244904, 2007.) Models stoichiometry constraints inherent to bonded systems. How much MPI-specific code? dft_fill_wjdc.c

14 dft_fill_wjdc.c MPI-specific code

15 MFIX Source term for pressure correction MPI-callable, OpenMP-enabled. 340 Fortran lines. No MPI-specific code. Ubiquitous OpenMP markup (red regions). MFIX: Multiphase Flows with Interphase eXchanges (https://www.mfix.org/) source_pp_g.f

16 Reasons for MPI/SPMD Success? Portability? Yes. Standardized? Yes. Momentum?Yes. Separation of many Parallel & Algorithms concerns?Big Yes. Once framework in place: –Sophisticated physics added as serial code. –Ratio of science experts vs. parallel experts: 10:1. Key goal for new parallel apps: Preserve this ratio

17 Computational Domain Expert Writing MPI Code

18 Computational Domain Expert Writing Future Parallel Code

19 Evolving Parallel Programming Model 19

20 Parallel Programming Model: Multi-level/Multi-device Stateless vectorizable computational kernels run on each core Intra-node (manycore) parallelism and resource management Node-local control flow (serial) Inter-node/inter-device (distributed) parallelism and resource management Threading Message Passing stateless kernels computational node with manycore CPUs and / or GPGPU network of computational nodes 20

21 Domain Scientist’s Parallel Palette MPI-only (SPMD) apps: –Single parallel construct. –Simultaneous execution. –Parallelism of even the messiest serial code. MapReduce: –Plug-n-Play data processing framework - 80% Google cycles. Pregel: Graph framework (other 20%) Next-generation PDE and related applications: –Internode: MPI, yes, or something like it. Composed with intranode. –Intranode: Much richer palette. More care required from programmer. What are the constructs in our new palette?

22 Obvious Constructs/Concerns Parallel for: –No loop-carried dependence. –Rich loops. –Use of shared memory for temporal reuse, efficient device data transfers. Parallel reduce: –Couple with other computations. –Concern for reproducibility (‘+’ not associative).

23 Other construct: Pipeline Sequence of filters. Each filter is: –Sequential (grab element ID, enter global assembly) or –Parallel (fill element stiffness matrix). Filters executed in sequence. Programmer’s concern: –Determine (conceptually): Can filter execute in parallel? –Write filter (serial code). –Register it with the pipeline. Extensible: –New physics feature. –New filter added to pipeline.

24 0 4 21 3 68 5 7 E1 E3E4 E2 E1 E2 E3 E4 01430143 012345678012345678 12541254 34763476 45874587 Global Matrix Assemble Rows 0,1,2 Assemble Rows 3,4,5 Assemble Rows 6,7,8 TBB Pipeline for FE assembly FE Mesh Element-stiffness matrices computed in parallel Launch elem-data from mesh Compute stiffnesses & loads Assemble rows of stiffness into global matrix Serial FilterParallel FilterSeveral Serial Filters in series Each assembly filter assembles certain rows from a stiffness, then passes it on to the next assembly filter TBB work of Alan Williams

25 0 4 21 3 68 5 7 E1 E3E4 E2 E1 E2 E3 E4 01430143 012345678012345678 12541254 34763476 45874587 Global Matrix Assemble Rows Alternative TBB Pipeline for FE assembly FE Mesh Element-stiffness matrices computed in parallel Launch elem-data from mesh Compute stiffnesses & loads Assemble rows of stiffness into global matrix Serial FilterParallel Filter Each parallel call to the assembly filter assembles all rows from the stiffness, using locking to avoid race conflicts with other threads. Assemble Rows

26 Base-line FE Assembly Timings Num- procs Assembly -time Intel 11.1 Assembly -time GCC 4.4.4 11.80s1.95s 40.45s0.50s 80.24s0.28s Problem size: 80x80x80 == 512000 elements, 531441 matrix-rows The finite-element assembly performs 4096000 matrix-row sum-into operations (8 per element) and 4096000 vector-entry sum-into operations. MPI-only, no threads. Linux dual quad-core workstation.

27 FE Assembly Timings Num- threads Elem- group -size Matrix- conflicts Vector- conflicts Assembly -time 11002.16s 14002.09s 18002.08s 41959179591.01s 447938250.74s 48318040.69s 816453613060.87s 845892490.45s 88161810.38s Problem size: 80x80x80 == 512000 elements, 531441 matrix-rows The finite-element assembly performs 4096000 matrix-row sum-into operations (8 per element) and 4096000 vector-entry sum-into operations. No MPI, only threads. Linux dual quad-core workstation.

28 Other construct: Thread team Multiple threads. Fast barrier. Shared, fast access memory pool. Example: Nvidia SM (but need 10-100X local memory). X86 more vague, emerging more clearly in future.

29 Observe: Iteration count increases with number of subdomains. With scalable threaded smoothers (LU, ILU, Gauss-Seidel): –Solve with fewer, larger subdomains. –Better kernel scaling (threads vs. MPI processes). –Better convergence, More robust. Exascale Potential: Tiled, pipelined implementation. Three efforts: –Level-scheduled triangular sweeps (ILU solve, Gauss-Seidel). –Decomposition by partitioning –Multithreaded direct factorization Preconditioners for Scalable Multicore Systems Strong scaling of Charon on TLCC (P. Lin, J. Shadid 2009 ) MPI TasksThreadsIterations 40961153 20482129 10244125 5128117 25616117 12832111 29 Factors Impacting Performance of Multithreaded Sparse Triangular Solve, Michael M. Wolf and Michael A. Heroux and Erik G. Boman, VECPAR 2010. # MPI Ranks

30 Thread Team Advantanges Qualitatively better algorithm: –Threaded triangular solve scales. –Fewer MPI ranks means fewer iterations, better robustness. Exploits: –Shared data. –Fast barrier. –Data-driven parallelism.

31 Finite Elements/Volumes/Differences and parallel node constructs Parallel for, reduce, pipeline: –Sufficient for vast majority of node level computation. –Supports: Complex modeling expression. Vanilla parallelism. –Must be “stencil-aware” for temporal locality. –If well done: OpenMP vs. TBB vs. CUDA vs. XYZ doesn’t matter much. Refactoring costs are manageable. Thread team: –Complicated. –Requires substantial parallel algorithm knowledge. –Useful in solvers.

32 Programming Today for Tomorrow’s Machines 32

33 Programming Today for Tomorrow’s Machines Parallel Programming in the small: –Focus: writing sequential code fragments. Wrapped in simple thread-safe functions. –Programmer skills: 10%: Pattern/framework experts (domain-aware). 90%: Domain experts (pattern-aware) Languages needed are already here: –Maybe CAF will make it, other too. –DSL are useful (math libs are essentially DSL). –Exception: Large-scale data-intensive graph?

34 FE/FV/FD Parallel Programming Today for ((i,j,k) in points/elements on subdomain) { compute coefficients for point (i,j,k) inject into global matrix } Notes: User in charge of: –Writing physics code. –Iteration space traversal. –Storage association. Pattern/framework/runtime in charge of: –SPMD execution.

35 FE/FV/FD Parallel Programming Tomorrow pipeline { filter(addPhysicsLayer1 );... filter(addPhysicsLayern ); filter(injectIntoGlobalMatrix ); } Notes: User in charge of: –Writing physics code (filter). –Registering filter with framework. Pattern/framework/runtime in charge of: –SPMD execution. –Iteration space traversal. o Sensitive to temporal locality. –Filter execution scheduling. –Storage association. Better assignment of responsibility (in general).

36 Co-Design Opportunities Better runtime systems: –MPI still wins, a lot. –Work, data placement & migration. Storage association: –Portability requires math, storage indexing to be distinct. MPI Shared memory extensions: –Allocate a shared buffer for ranks on a node. Repeatable results: –Need some way for non-expert to trust results. –Debug mode. Data regularity: –We need to exploit as much as possible.

37 Resilient Algorithms: A little reliability, please. 37

38 Soft Error Futures Soft error handling: A legitimate algorithms issue. Programming model, runtime environment play role.

39 Every calculation matters Small PDE Problem: ILUT/GMRES Correct result:35 Iters, 343M FLOPS 2 examples of a single bad op. Solvers: –50-90% of total app operations. –Soft errors most likely in solver. Need new algorithms for soft errors: –Well-conditioned wrt errors. –Decay proportional to number of errors. –Minimal impact when no errors. DescriptionIters FLOP S Recursive Residual Error Solution Error All Correct Calcs 35343 M 4.6e-151.0e-6 Iter=2, y[1] += 1.0 SpMV incorrect Ortho subspace 35343 M 6.7e-153.7e+3 Q[1][1] += 1.0 Non-ortho subspace N/CN/A7.7e-025.9e+5 39 Soft Error Resilience New Programming Model Elements: SW-enabled, highly reliable: Data storage, paths. Compute regions. Idea: New algorithms with minimal usage of high reliability. First new algorithm: FT-GMRES. Resilient to soft errors. Outer solve: Highly Reliable Inner solve: “bulk” reliability. General approach applies to many algorithms. M. Heroux, M. Hoemmen

40 FTGMRES Results 40

41 Exascale Concerns 41

42 If FLOPS are free, why are we making them cheaper? 42

43 Easy things should be easy, hard things should be possible. Why are we making easy things easier and hard things impossible? 43

44 Explicit/SIMT vs. Implicit/Recursive Algorithms Problem Difficulty Easy Hard Time to Solution Implicit/Recursive: Implicit formulations. Multilevel prec. Explicit/SIMT: Explicit formulations. Jacobi prec.

45 Problems with Accelerator-based Exascale Global SIMT is the only approach that really works well on GPUs, but: –Many of our most robust algorithms have no apparent SIMT replacement. –Working on it, but a lot to do, and fundamental issues at play. SMs might be useful to break SIMT mold, but: –Local store is way too small. –No market reason to make it bigger. Could consider SIMT approaches, but: –Broader apps community moving the other way: Climate: Looking at implicit formulations. Embedded UQ: Coupled formulations. Exascale apps at risk? –Isolation from the broader app trends. –Accelerators good, but in combination with strong multicore CPU.

46 Summary Building the next generation of parallel applications requires enabling domain scientists: –Write sophisticated methods. –Do so with serial fragments. –Fragments hoisted into scalable, resilient fragment. Resilient algorithms will mitigate soft error impact. Success of manycore will require breaking out of global SIMT-only.


Download ppt "Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National."

Similar presentations


Ads by Google