The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
MATH 685/ CSI 700/ OR 682 Lecture Notes
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 17 Parallel Processing.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Ordinary Differential Equations (ODEs)
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.
Parallel Simulation of Continuous Systems: A Brief Introduction
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 Chapter 5: Harmonic Analysis in Frequency and Time Domains Contributors: A. Medina, N. R. Watson, P. Ribeiro, and C. Hatziadoniu Organized by Task Force.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
Exascale climate modeling 24th International Conference on Parallel Architectures and Compilation Techniques October 18, 2015 Michael F. Wehner Lawrence.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Concurrency and Performance Based on slides by Henri Casanova.
Quality of Service for Numerical Components Lori Freitag Diachin, Paul Hovland, Kate Keahey, Lois McInnes, Boyana Norris, Padma Raghavan.
CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Xing Cai University of Oslo
Parallel Programming By J. H. Wang May 2, 2017.
for more information ... Performance Tuning
CSCI1600: Embedded and Real Time Software
Objective of This Course
Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application
Gary M. Zoppetti Gagan Agrawal
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Comparison of CFEM and DG methods
Memory System Performance Chapter 3
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
Presentation transcript:

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics & Statistics, Old Dominion University Institute for Scientific Computing Research, Lawrence Livermore National Laboratory Institute for Computer Applications in Science & Engineering, NASA Langley Research Center 15 minute presentation for first DOE NGI PI meeting

Background of this Presentation Originally prepared for Petaflops II Conference History of the Petaflops Initiative in the USA Enabling Technologies for Petaflops Computing, Feb 1994 book by Sterling, Messina, and Smith, MIT Press, 1995 Applications Workshop, Aug 1994 Architectures Workshop, Apr 1995 Systems Software Workshop, Jun 1996 Algorithms Workshop, Apr 1997 Systems Operations Workshop Review, Jun 1998 Enabling Technologies for Petaflops II, Feb 1999 Topics in Ultra-scale Computing, book by Sterling et al., MIT Press, 2001 (to appear)

Weighing in at the Bottom Line Characterization of a 1 Teraflop/s computer of today about 1,000 processors of 1 Gflop/s (peak) each due to inefficiencies within the processors, more practically characterized as about 4,000 processors of 250 Mflop/s each How do we want to get to 1 Petaflop/s by 2007 (original goal)? 1,000,000 processors of 1 Gflop/s each (only wider)? 10,000 processors of 100 Gflop/s each (mainly deeper)? From the point of view of PDE simulations on quasi-static Eulerian grids Either! Caveat: dynamic grid simulations are not covered in this talk but see work at Bonn, Erlangen, Heidelberg, LLNL, and ODU presented elsewhere

Perspective Many “Grand Challenges” in computational science are formulated as PDEs (possibly among alternative formulations) However, PDE simulations historically have not performed as well as other scientific simulations PDE simulations require a balance among architectural components that is not necessarily met in a machine designed to “max out” on the standard LINPACK benchmark The justification for building petaflop/s architectures undoubtedly will (and should) include PDE applications However, cost-effective use of petaflop/s on PDEs requires further attention to architectural and algorithmic matters Memory-centric view of computation needs further promotion

Application Performance History “3 orders of magnitude in 10 years” – better than Moore’s Law

Bell Prize Performance History

Plan of Presentation General characterization of PDE requirements Four sources of performance improvement to get from current 100's of Gflop/s (for PDEs) to 1 Pflop/s Each illustrated with examples from computational aerodynamics offered as typical of real workloads (nonlinear, unstructured, multicomponent, multiscale, etc.) Performance presented on up to thousands of processors of T3E and ASCI Red (for parallel aspects) on numerous uniprocessors (for memory hierarchy aspects)

Purpose of Presentation Not to argue for specific algorithms/programming models/codes in any detail but see talks on Newton-Krylov-Schwarz (NKS) methods under homepage Provide a requirements target for designers of today's systems typical of several contemporary successfully parallelized PDE applications not comprehensive of all important large-scale applications Speculate on requirements target for designers of tomorrow's sytems Promote attention to current architectural weaknesses, relative to requirements of PDEs

Four Sources of Performance Improvement Expanded number of processors arbitrarily large factor, through extremely careful attention to load balancing and synchronization More efficient use of processor cycles, and faster processor/memory elements one to two orders of magnitude, through memory-assist language features, processors-in-memory, and multithreading Algorithmic variants that are more architecture-friendly approximately an order of magnitude, through improved locality and relaxed synchronization Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity This last does not contribute to raw flop/s!

PDE Varieties and Complexities Varieties of PDEs evolution (time hyperbolic, time parabolic) equilibrium (elliptic, spatially hyperbolic or parabolic) mixed, varying by region mixed, of multiple type (e.g., parabolic with elliptic constraint) Complexity parameterized by: spatial grid points, Nx temporal grid points, Nt components per point, Nc auxiliary storage per point, Na grid points in stencil, Ns Memory: M  Nx ( Nc + Na + Nc Nc  Ns ) Work: W  Nx  Nt  ( Nc + Na + Nc  Nc  Ns )

Resource Scaling for PDEs For 3D problems, (Memory)  (Work)3/4 for equilibrium problems, work scales with problem size  no. of iteration steps – for “reasonable” implicit methods proportional to resolution in single spatial dimension for evolutionary problems, work scales with problem size  time steps – CFL-type arguments place latter on order of resolution in single spatial dimension Proportionality constant can be adjusted over a very wide range by both discretization and by algorithmic tuning If frequent time frames are to be captured, disk capacity and I/O rates must both scale linearly with work

Typical PDE Tasks Vertex-based loops Edge-based “stencil op” loops state vector and auxiliary vector updates Edge-based “stencil op” loops residual evaluation, approximate Jacobian evaluation Jacobian-vector product (often replaced with matrix-free form, involving residual evaluation) intergrid transfer (coarse/fine) Sparse, narrow-band recurrences approximate factorization and back substitution smoothing Vector inner products and norms orthogonalization/conjugation convergence progress and stability checks

Edge-based Loop Vertex-centered tetrahedral grid Traverse by edges load vertex values compute intensively store contributions to flux at vertices Each vertex appears in approximately 15 flux computations

Explicit PDE Solvers Concurrency is pointwise, O(N) Comm.-to-Comp. ratio is surface-to-volume, O((N/P)-1/3) Communication range is nearest-neighbor, except for time-step computation Synchronization frequency is once per step, O((N/P)-1) Storage per point is low Load balance is straightforward for static quasi-uniform grids Grid adaptivity (together with temporal stability limitation) makes load balance nontrivial

Domain-decomposed Implicit PDE Solvers Concurrency is pointwise, O(N), or subdomainwise, O(P) Comm.-to-Comp. ratio still mainly surface-to-volume, O((N/P)-1/3) Communication still mainly nearest-neighbor, but nonlocal communication arises from conjugation, norms, coarse grid problems Synchronization frequency often more than once per grid-sweep, up to Krylov dimension, O(K(N/P)-1) Storage per point is higher, by factor of O(K) Load balance issues the same as for explicit

Source #1: Expanded Number of Processors Amdahl's law can be defeated if serial sections make up a nonincreasing fraction of total work as problem size and processor count scale up together – true for most explicit or iterative implicit PDE solvers popularized in 1986 Karp Prize paper by Gustafson, et al. Simple, back-of-envelope parallel complexity analyses show that processors can be increased as fast, or almost as fast, as problem size, assuming load is perfectly balanced Caveat: the processor network must also be scalable (applies to protocols as well as to hardware) Remaining four orders of magnitude could be met by hardware expansion (but this does not mean that fixed-size applications of today would run 104 times faster)

Back-of-Envelope Scalability Demonstration for Bulk-synchronized PDE Computations Given complexity estimates of the leading terms of: the concurrent computation (per iteration phase) the concurrent communication the synchronization frequency And a model of the architecture including: internode communication (network topology and protocol reflecting horizontal memory structure) on-node computation (effective performance parameters including vertical memory structure) One can estimate optimal concurrency and execution time on per-iteration basis, or overall (by taking into account any granularity-dependent convergence rate) simply differentiate time estimate in terms of (N,P) with respect to P, equate to zero and solve for P in terms of N

3D Stencil Costs (per Iteration) grid points in each direction n, total work N=O(n3) processors in each direction p, total procs P=O(p3) memory per node requirements O(N/P) execution time per iteration A n3/p3 grid points on side of each processor subdomain n/p neighbor communication per iteration B n2/p2 cost of global reductions in each iteration C log p or C p(1/d) C includes synchronization frequency same dimensionless units for measuring A, B, C e.g., cost of scalar floating point multiply-add

3D Stencil Computation Illustration Rich local network, tree-based global reductions total wall-clock time per iteration for optimal p, , or or (with ), without “speeddown,” p can grow with n in the limit as

3D Stencil Computation Illustration Rich local network, tree-based global reductions optimal running time where limit of infinite neighbor bandwidth, zero neighbor latency ( ) (This analysis is on a per iteration basis; fuller analysis would multiply this cost by an iteration count estimate that generally depends on n and p.)

Summary for Various Networks With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: optimal number of processors scales linearly with problem size With 3D torus-based global reductions and scalable nearest neighbor hardware: optimal number of processors scales as three-fourths power of problem size (almost “scalable”) With common network bus (heavy contention): optimal number of processors scales as one-fourth power of problem size (not “scalable”) bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards”

1999 Bell Prize Parallel Scaling Results on ASCI Red ONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices (about 11 million unknowns) on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)

Surface Visualization of Test Domain for Computing Flow over an ONERA M6 Wing

Transonic “Lambda” Shock Solution

Fixed-size Parallel Scaling Results (Flop/s)

Fixed-size Parallel Scaling Results (Time in seconds)

Algorithm: Newton-Krylov-Schwarz nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

Fixed-size Scaling Results for W-cycle (Time in seconds, courtesy of D Fixed-size Scaling Results for W-cycle (Time in seconds, courtesy of D. Mavriplis) ASCI runs: for grid of 3.1M vertices; T3E runs: for grid of 24.7M vertices

Source #2: More Efficient Use of Faster Processors Current low efficiencies of sparse codes can be improved if regularity of reference is exploited with memory-assist features PDEs have periodic workingset structure that permits effective use of prefetch/dispatch directives, and lots of slackness Combined with “processors-in-memory” (PIM) technology for gather/scatter into densely used block transfers and multithreading, PDEs can approach full utilization of processor cycles Caveat: high bandwidth is critical, since PDE algorithms do only O(N) work for O(N) gridpoints worth of loads and stores One to two orders of magnitude can be gained by catching up to the clock, and by following the clock into the few-GHz range

Following the Clock 1999 Predictions from the Semiconductor Industry Association http://public.itrs.net/files/1999_SIA_Roadmap/Home.htm A factor of 2-3 can be expected by 2007 by following the clock alone

Example of Multithreading Same ONERA M6 wing Euler code simulation on ASCI Red ASCI Red contains two processors per node, sharing memory Can use second processor in either message-passing mode with its own subdomain, or in multithreaded shared memory mode, which does not require the number of subdomain partitions to double Latter is much more effective in flux evaluation phase, as shown by cumulative execution time (here, memory bandwidth is not an issue)

PDE Workingsets Smallest: data for single stencil: Ns (Nc2 + Nc + Na ) (sharp) Largest: data for entire subdomain: (Nx/P)  Ns  (Nc2 + Nc + Na ) (sharp) Intermediate: data for a neighborhood collection of stencils, reused as possible

Cache Traffic for PDEs As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth

Strategies Based on Workingset Structure No performance value in memory levels larger than subdomain Little performance value in memory levels smaller than subdomain but larger than required to permit full reuse of most data within each subdomain subtraversal After providing L1 large enough for smallest workingset (and multiple independent copies up to accommodate desired level of multithreading) all additional resources should be invested in large L2 Tables describing grid connectivity are built (after each grid rebalancing) and stored in PIM --- used to pack/unpack dense-use cache lines during subdomain traversal

Costs of Greater Per-processor Efficiency Programming complexity of managing subdomain traversal Space to store gather/scatter tables in PIM Time to (re)build gather/scatter tables in PIM Memory bandwidth commensurate with peak rates of all processors

Source #3: More “Architecture Friendly” Algorithms Algorithmic practice needs to catch up to architectural demands several “one-time” gains remain to be contributed that could improve data locality or reduce synchronization frequency, while maintaining required concurrency and slackness “One-time” refers to improvements by small constant factors, nothing that scales in N or P – complexities are already near information-theoretic lower bounds, and we reject increases in flop rates that derive from less efficient algorithms Caveat: remaining algorithmic performance improvements may cost extra space or may bank on stability shortcuts that occasionally backfire, making performance modeling less predictable Perhaps an order of magnitude of performance remains here

Raw Performance Improvement from Algorithms Spatial reorderings that improve locality interlacing of all related grid-based data structures ordering gridpoints and grid edges for L1/L2 reuse Discretizations that improve locality higher-order methods (lead to larger denser blocks at each point than lower-order methods) vertex-centering (for same tetrahedral grid, leads to denser blockrows than cell-centering) Temporal reorderings that improve locality block vector algorithms (reuse cached matrix blocks; vectors in block are independent) multi-step vector algorithms (reuse cached vector blocks; vectors have sequential dependence)

Raw Performance Improvement from Algorithms Temporal reorderings that reduce synchronization penalty less stable algorithmic choices that reduce synchronization frequency (deferred orthogonalization, speculative step selection) less global methods that reduce synchronization range by replacing a tightly coupled global process (e.g., Newton) with loosely coupled sets of tightly coupled local processes (e.g., Schwarz) Precision reductions that make bandwidth seem larger lower precision representation of preconditioner matrix coefficients or poorly known coefficients (arithmetic is still performed on full precision extensions)

Improvements Resulting from Locality Reordering 8.0 16 26 27 42 21.0 200 Pent. Pro 333 400 360 300 600 450 332 120 250 Clock MHz 6.3 21 36 40 60 18.8 7.8 31 49 78 19.5 Pent. II/NT 8.3 33 47 52 83 20.8 Pent. II/LIN 2.5 20 71 8.9 800 Ultra II/HPC 3.5 25 54 94 13.0 720 Ultra II 3.0 18 35 75 12.5 1.3 37 91 7.6 1200 Alpha 21164 1.6 14 32 39 900 2.3 15 43 66 9.9 664 604e 3.1 59 117 24.3 480 P2SC (4 card) 2.7 13 51 101 21.4 P2SC (2 card) 4.0 68 87 163 20.3 P3 5.2 74 127 25.4 500 R10000 Orig. % of Peak Mflop/s Interl. only Reord. Only Opt. % of Peak Processor

Improvements from Blocking Vectors Same ONERA M6 Euler simulation, on SGI Origin One vector represents standard GMRES acceleration Four vectors is a blocked Krylov method, not yet in “production” version Savings arises from not reloading matrix elements of Jacobian for each new vector (four-fold increase in matrix element use per load) Flop/s rate is effectively tripled – however, can the extra vectors be used efficiently from a numerical viewpoint??

Improvements from Reduced Precision Same ONERA M6 Euler simulation, on SGI Origin Standard (middle column) is double precision in all floating quantities Optimization (right column) is to store preconditioner for Jacobian matrix in single precision only, promoting to double before use in the processor Bandwidth and matrix cache capacities are effectively doubled, with no deterioration in numerical properties

Source #4: Algorithms Packing More Science Per Flop Some algorithmic improvements do not improve flop rate, but lead to the same scientific end in the same time at lower hardware cost (less memory, lower operation complexity) Caveat: such adaptive programs are more complicated and less thread-uniform than those they improve upon in quality/cost ratio Desirable that petaflop/s machines be general purpose enough to run the “best” algorithms Not daunting, conceptually, but puts an enormous premium on dynamic load balancing An order of magnitude or more can be gained here for many problems

Example of Adaptive Opportunities Spatial Discretization-based adaptivity change discretization type and order to attain required approximation to the continuum everywhere without over-resolving in smooth, easily approximated regions Fidelity-based adaptivity change continuous formulation to accommodate required phenomena everywhere without enriching in regions where nothing happens Stiffness-based adaptivity change solution algorithm to provide more powerful, robust techniques in regions of space-time where discrete problem is linearly or nonlinearly stiff without extra work in nonstiff, locally well-conditioned regions

Experimental Example of Opportunity for Advanced Adaptivity Driven cavity: Newton’s method (left) versus new Additive Schwarz Preconditioned Inexact Newton (ASPIN) nonlinear preconditioning (right)

Status and Prospects for Advanced Adaptivity Metrics and procedures well developed in only a few areas method-of-lines ODEs for stiff IBVPs and DAEs, FEA for elliptic BVPs Multi-model methods used in ad hoc ways in production Boeing TRANAIR code Poly-algorithmic solvers demonstrated in principle but rarely in the “hostile” environment of high-performance computing Requirements for progress management of hierarchical levels of synchronization user specification of hierarchical priorities of different threads

Summary of Suggestions for PDE Petaflops Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity (but we won't count this towards rate improvement) Algorithmic variants that are more architecture-friendly expect half an order of magnitude, through improved locality and relaxed synchronization More efficient use of processor cycles, and faster processor/memory expect one-and-a-half orders of magnitude, through memory-assist language features, PIM, and multithreading Expanded number of processors expect two orders of magnitude, through dynamic balancing and extreme care in implementation

Reminder about the Source of PDEs Computational engineering is not about individual large-scale analyses, done fast and “thrown over the wall” Both “results” and their sensitivities are desired; often multiple operation points to be simulated are known a priori, rather than sequentially Sensitivities may be fed back into optimization process Full PDE analyses may also be inner iterations in a multidisciplinary computation In such contexts, “petaflop/s” may mean 1,000 analyses running somewhat asynchronously with respect to each other, each at 1 Tflop/s – clearly a less daunting challenge and one that has better synchronization properties for exploiting “The Grid” – than 1 analysis running at 1 Pflop/s

Summary Recommendations for Architects Support rich (mesh-like) interprocessor connectivity and fast global reductions Allow disabling of expensive interprocessor cache coherence protocols for user-tagged data Support fast message-passing protocols between processors that physically share memory, for legacy MP applications Supply sufficient memory system bandwidth per processor (at least one word per clock per scalar unit) Give user optional control of L2 cache traffic through directives Develop at least gather/scatter processor-in-memory capability Support variety of precisions in blocked transfers and fast precision conversions

Recommendations for New Benchmarks Recently introduced sPPM benchmark fills a void for memory-system-realistic full-application PDE performance, but is explicit, structured, and relatively high-order Similar full-application benchmark is needed for implicit, unstructured, low-order PDE solvers Reflecting the hierarchical, distributed memory layout of high end computers, this benchmark would have two aspects uniprocessor (“vertical”) memory system performance – suite of problems of various grid sizes and multicomponent sizes with different interlacings and edge orderings parallel (“horizontal”) network performance – problems of various subdomain sizes and synchronization frequencies

Bibliography High Performance Parallel CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing (to appear, 2001) Toward Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes & Smith, 1999, in “Proceedings of Parallel CFD'99,” Elsevier Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 Newton-Krylov-Schwarz Methods for Aerodynamics Problems: Compressible and Incompressible Flows on Unstructured Grids, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 513-520 How Scalable is Domain Decomposition in Practice, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297 On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 10th Intl. Conf. on Domain Decomposition Methods,” AMS, pp. 311-319

Related URLs Follow-up on this talk ASCI platforms http://www.mcs.anl.gov/petsc-fun3d ASCI platforms http://www.llnl.gov/asci/platforms Int. Conferences on Domain Decomposition Methods http://www.ddm.org SIAM Conferences on Parallel Processing http://www.siam.org (Norfolk, USA 12-14 Mar 2001) International Conferences on Parallel CFD http://www.parcfd.org

Acknowledgments Collaborators: Dinesh Kaushik (ODU), Kyle Anderson (NASA), and the PETSc team at ANL: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith Sponsors: U.S. DOE, ICASE, NASA, NSF Computer Resources: DOE, SGI-Cray Inspiration: Shahid Bokhari (ICASE), Xiao-Chuan Cai (CU-Boulder), Rob Falgout (LLNL), Paul Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup (CU-Boulder), Michael Kern (INRIA), Dimitri Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede (Univ. Erlangen), John Salmon (Caltech), Linda Stals (ODU), Bob Voigt (DOE), David Young (Boeing), Paul Woodward (UMinn)