Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National.

Slides:



Advertisements
Similar presentations
Declarative Programming Languages for Multicore Architecures Workshop 15 January 2006.
Advertisements

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Introduction to Algorithms 6.046J/18.401J
Chapter 6 File Systems 6.1 Files 6.2 Directories
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
INTEL CONFIDENTIAL Implementing a Task Decomposition Introduction to Parallel Programming – Part 9.
Threads, SMP, and Microkernels
Graphs, representation, isomorphism, connectivity
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Part IV: Memory Management
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Lecture 6: Multicore Systems
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.
Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
Parallel Solution of the Poisson Problem Using MPI
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Programmability Hiroshi Nakashima Thomas Sterling.
Parallel Computing Presented by Justin Reschke
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Chandra S. Martha Min Lee 02/10/2016
Conception of parallel algorithms
3- Parallel Programming Models
Programming Models for SimMillennium
Parallel Algorithm Design
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Chapter 4: Threads.
Chapter 4: Threads.
Hybrid Programming with OpenMP and MPI
Chapter 4: Threads & Concurrency
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National Laboratories Collaborators: SNL Staff: [B.|R.] Barrett, E. Boman, R. Brightwell, H.C. Edwards, A. Williams SNL Postdocs: M. Hoemmen, S. Rajamanickam, M. Wolf (MIT Lincoln Lab) ORNL staff: Chris Baker, Greg Koenig, Geoffroy Vallee. Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin company, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Topics 1.Context: A Library developer’s perspective. 2.Observations on success of MPI. 3.Scalable manycore approaches. 4.A new approach to soft faults. 5.Exascale concerns.

Tutorial for users of Livermore Computing's Dawn BlueGene/P (Blaise Barney) LLNL Compute Center View

CiFTS Project Image: FTB World View

Math Libraries World View Math Libraries (e.g., Hypre, PETSc, SuperLU, Trilinos) Math Libraries (e.g., Hypre, PETSc, SuperLU, Trilinos) Processor Architecture Node Architecture Network Architecture OS/Runtime Compilers Programming Models Applications

Three Design Points Terascale Laptop: Uninode-Manycore Petascale Deskside: Multinode-Manycore Exascale Center:Manynode-Manycore

Basic Concerns: Trends, Manycore Stein’s Law: If a trend cannot continue, it will stop. Herbert Stein, chairman of the Council of Economic Advisers under Nixon and Ford. Trends at risk: –Power. –Single core performance. –Node count. –Memory size & BW. –Concurrency expression in existing Programming Models. –Resilience. Edwards: SAND Trilinos ThreadPool Library v1.1. “Status Quo” ~ MPI-only 7 Strong Scaling Potential

Observations MPI-Only is not sufficient, except … much of the time. Near-to-medium term: –MPI+[OMP|TBB|Pthreads|CUDA|OCL|MPI] –Long term, too? –$100 wager: The first 10 exascale apps will call MPI_Init. Concern: –Best hybrid performance: 1 MPI rank per UMA core set. –UMA core set size growing slowly  Lots of MPI tasks. Long- term: –Something hierarchical, global in scope. Conjecture: –Data-intensive apps need non-SPDM model. –Will develop new programming model/env. –Rest of apps will adopt over time. –Time span: years.

What Can we Do Right Now? Study why MPI (really SPMD) was successful. Study new parallel landscape. Try to cultivate an approach similar to MPI (and others).

MPI Impresssions 10

Dan Reed, Microsoft Workshop on the Road Map for the Revitalization of High End Computing June 16-18, 2003 Tim Stitts, CSCS SOS14 Talk March 2010 “ MPI is often considered the “portable assembly language” of parallel computing, …” Brad Chamberlain, Cray, 2000.

MPI Reality 12

Tramonto WJDC Functional New functional. Bonded systems. 552 lines C code. WJDC-DFT (Werthim, Jain, Dominik, and Chapman) theory for bonded systems. (S. Jain, A. Dominik, and W.G. Chapman. Modified interfacial statistical associating fluid theory: A perturbation density functional theory for inhomogeneous complex fluids. J. Chem. Phys., 127:244904, 2007.) Models stoichiometry constraints inherent to bonded systems. How much MPI-specific code? dft_fill_wjdc.c

dft_fill_wjdc.c MPI-specific code

MFIX Source term for pressure correction MPI-callable, OpenMP-enabled. 340 Fortran lines. No MPI-specific code. Ubiquitous OpenMP markup (red regions). MFIX: Multiphase Flows with Interphase eXchanges ( source_pp_g.f

Reasons for MPI/SPMD Success? Portability? Yes. Standardized? Yes. Momentum?Yes. Separation of many Parallel & Algorithms concerns?Big Yes. Once framework in place: –Sophisticated physics added as serial code. –Ratio of science experts vs. parallel experts: 10:1. Key goal for new parallel apps: Preserve this ratio

Computational Domain Expert Writing MPI Code

Computational Domain Expert Writing Future Parallel Code

Evolving Parallel Programming Model 19

Parallel Programming Model: Multi-level/Multi-device Stateless vectorizable computational kernels run on each core Intra-node (manycore) parallelism and resource management Node-local control flow (serial) Inter-node/inter-device (distributed) parallelism and resource management Threading Message Passing stateless kernels computational node with manycore CPUs and / or GPGPU network of computational nodes 20

Domain Scientist’s Parallel Palette MPI-only (SPMD) apps: –Single parallel construct. –Simultaneous execution. –Parallelism of even the messiest serial code. MapReduce: –Plug-n-Play data processing framework - 80% Google cycles. Pregel: Graph framework (other 20%) Next-generation PDE and related applications: –Internode: MPI, yes, or something like it. Composed with intranode. –Intranode: Much richer palette. More care required from programmer. What are the constructs in our new palette?

Obvious Constructs/Concerns Parallel for: –No loop-carried dependence. –Rich loops. –Use of shared memory for temporal reuse, efficient device data transfers. Parallel reduce: –Couple with other computations. –Concern for reproducibility (‘+’ not associative).

Other construct: Pipeline Sequence of filters. Each filter is: –Sequential (grab element ID, enter global assembly) or –Parallel (fill element stiffness matrix). Filters executed in sequence. Programmer’s concern: –Determine (conceptually): Can filter execute in parallel? –Write filter (serial code). –Register it with the pipeline. Extensible: –New physics feature. –New filter added to pipeline.

E1 E3E4 E2 E1 E2 E3 E Global Matrix Assemble Rows 0,1,2 Assemble Rows 3,4,5 Assemble Rows 6,7,8 TBB Pipeline for FE assembly FE Mesh Element-stiffness matrices computed in parallel Launch elem-data from mesh Compute stiffnesses & loads Assemble rows of stiffness into global matrix Serial FilterParallel FilterSeveral Serial Filters in series Each assembly filter assembles certain rows from a stiffness, then passes it on to the next assembly filter TBB work of Alan Williams

E1 E3E4 E2 E1 E2 E3 E Global Matrix Assemble Rows Alternative TBB Pipeline for FE assembly FE Mesh Element-stiffness matrices computed in parallel Launch elem-data from mesh Compute stiffnesses & loads Assemble rows of stiffness into global matrix Serial FilterParallel Filter Each parallel call to the assembly filter assembles all rows from the stiffness, using locking to avoid race conflicts with other threads. Assemble Rows

Base-line FE Assembly Timings Num- procs Assembly -time Intel 11.1 Assembly -time GCC s1.95s 40.45s0.50s 80.24s0.28s Problem size: 80x80x80 == elements, matrix-rows The finite-element assembly performs matrix-row sum-into operations (8 per element) and vector-entry sum-into operations. MPI-only, no threads. Linux dual quad-core workstation.

FE Assembly Timings Num- threads Elem- group -size Matrix- conflicts Vector- conflicts Assembly -time s s s s s s s s s Problem size: 80x80x80 == elements, matrix-rows The finite-element assembly performs matrix-row sum-into operations (8 per element) and vector-entry sum-into operations. No MPI, only threads. Linux dual quad-core workstation.

Other construct: Thread team Multiple threads. Fast barrier. Shared, fast access memory pool. Example: Nvidia SM (but need X local memory). X86 more vague, emerging more clearly in future.

Observe: Iteration count increases with number of subdomains. With scalable threaded smoothers (LU, ILU, Gauss-Seidel): –Solve with fewer, larger subdomains. –Better kernel scaling (threads vs. MPI processes). –Better convergence, More robust. Exascale Potential: Tiled, pipelined implementation. Three efforts: –Level-scheduled triangular sweeps (ILU solve, Gauss-Seidel). –Decomposition by partitioning –Multithreaded direct factorization Preconditioners for Scalable Multicore Systems Strong scaling of Charon on TLCC (P. Lin, J. Shadid 2009 ) MPI TasksThreadsIterations Factors Impacting Performance of Multithreaded Sparse Triangular Solve, Michael M. Wolf and Michael A. Heroux and Erik G. Boman, VECPAR # MPI Ranks

Thread Team Advantanges Qualitatively better algorithm: –Threaded triangular solve scales. –Fewer MPI ranks means fewer iterations, better robustness. Exploits: –Shared data. –Fast barrier. –Data-driven parallelism.

Finite Elements/Volumes/Differences and parallel node constructs Parallel for, reduce, pipeline: –Sufficient for vast majority of node level computation. –Supports: Complex modeling expression. Vanilla parallelism. –Must be “stencil-aware” for temporal locality. –If well done: OpenMP vs. TBB vs. CUDA vs. XYZ doesn’t matter much. Refactoring costs are manageable. Thread team: –Complicated. –Requires substantial parallel algorithm knowledge. –Useful in solvers.

Programming Today for Tomorrow’s Machines 32

Programming Today for Tomorrow’s Machines Parallel Programming in the small: –Focus: writing sequential code fragments. Wrapped in simple thread-safe functions. –Programmer skills: 10%: Pattern/framework experts (domain-aware). 90%: Domain experts (pattern-aware) Languages needed are already here: –Maybe CAF will make it, other too. –DSL are useful (math libs are essentially DSL). –Exception: Large-scale data-intensive graph?

FE/FV/FD Parallel Programming Today for ((i,j,k) in points/elements on subdomain) { compute coefficients for point (i,j,k) inject into global matrix } Notes: User in charge of: –Writing physics code. –Iteration space traversal. –Storage association. Pattern/framework/runtime in charge of: –SPMD execution.

FE/FV/FD Parallel Programming Tomorrow pipeline { filter(addPhysicsLayer1 );... filter(addPhysicsLayern ); filter(injectIntoGlobalMatrix ); } Notes: User in charge of: –Writing physics code (filter). –Registering filter with framework. Pattern/framework/runtime in charge of: –SPMD execution. –Iteration space traversal. o Sensitive to temporal locality. –Filter execution scheduling. –Storage association. Better assignment of responsibility (in general).

Co-Design Opportunities Better runtime systems: –MPI still wins, a lot. –Work, data placement & migration. Storage association: –Portability requires math, storage indexing to be distinct. MPI Shared memory extensions: –Allocate a shared buffer for ranks on a node. Repeatable results: –Need some way for non-expert to trust results. –Debug mode. Data regularity: –We need to exploit as much as possible.

Resilient Algorithms: A little reliability, please. 37

Soft Error Futures Soft error handling: A legitimate algorithms issue. Programming model, runtime environment play role.

Every calculation matters Small PDE Problem: ILUT/GMRES Correct result:35 Iters, 343M FLOPS 2 examples of a single bad op. Solvers: –50-90% of total app operations. –Soft errors most likely in solver. Need new algorithms for soft errors: –Well-conditioned wrt errors. –Decay proportional to number of errors. –Minimal impact when no errors. DescriptionIters FLOP S Recursive Residual Error Solution Error All Correct Calcs M 4.6e-151.0e-6 Iter=2, y[1] += 1.0 SpMV incorrect Ortho subspace M 6.7e-153.7e+3 Q[1][1] += 1.0 Non-ortho subspace N/CN/A7.7e-025.9e+5 39 Soft Error Resilience New Programming Model Elements: SW-enabled, highly reliable: Data storage, paths. Compute regions. Idea: New algorithms with minimal usage of high reliability. First new algorithm: FT-GMRES. Resilient to soft errors. Outer solve: Highly Reliable Inner solve: “bulk” reliability. General approach applies to many algorithms. M. Heroux, M. Hoemmen

FTGMRES Results 40

Exascale Concerns 41

If FLOPS are free, why are we making them cheaper? 42

Easy things should be easy, hard things should be possible. Why are we making easy things easier and hard things impossible? 43

Explicit/SIMT vs. Implicit/Recursive Algorithms Problem Difficulty Easy Hard Time to Solution Implicit/Recursive: Implicit formulations. Multilevel prec. Explicit/SIMT: Explicit formulations. Jacobi prec.

Problems with Accelerator-based Exascale Global SIMT is the only approach that really works well on GPUs, but: –Many of our most robust algorithms have no apparent SIMT replacement. –Working on it, but a lot to do, and fundamental issues at play. SMs might be useful to break SIMT mold, but: –Local store is way too small. –No market reason to make it bigger. Could consider SIMT approaches, but: –Broader apps community moving the other way: Climate: Looking at implicit formulations. Embedded UQ: Coupled formulations. Exascale apps at risk? –Isolation from the broader app trends. –Accelerators good, but in combination with strong multicore CPU.

Summary Building the next generation of parallel applications requires enabling domain scientists: –Write sophisticated methods. –Do so with serial fragments. –Fragments hoisted into scalable, resilient fragment. Resilient algorithms will mitigate soft error impact. Success of manycore will require breaking out of global SIMT-only.