The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics.

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance David E. Keyes Department of Mathematics & Statistics, Old Dominion University Institute for Scientific Computing Research, Lawrence Livermore National Laboratory Institute for Computer Applications in Science & Engineering, NASA Langley Research Center 15 minute presentation for first DOE NGI PI meeting

Background of this Presentation
Originally prepared for Petaflops II Conference History of the Petaflops Initiative in the USA Enabling Technologies for Petaflops Computing, Feb 1994 book by Sterling, Messina, and Smith, MIT Press, 1995 Applications Workshop, Aug 1994 Architectures Workshop, Apr 1995 Systems Software Workshop, Jun 1996 Algorithms Workshop, Apr 1997 Systems Operations Workshop Review, Jun 1998 Enabling Technologies for Petaflops II, Feb 1999 Topics in Ultra-scale Computing, book by Sterling et al., MIT Press, 2001 (to appear)

Weighing in at the Bottom Line
Characterization of a 1 Teraflop/s computer of today about 1,000 processors of 1 Gflop/s (peak) each due to inefficiencies within the processors, more practically characterized as about 4,000 processors of 250 Mflop/s each How do we want to get to 1 Petaflop/s by 2007 (original goal)? 1,000,000 processors of 1 Gflop/s each (only wider)? 10,000 processors of 100 Gflop/s each (mainly deeper)? From the point of view of PDE simulations on quasi-static Eulerian grids Either! Caveat: dynamic grid simulations are not covered in this talk but see work at Bonn, Erlangen, Heidelberg, LLNL, and ODU presented elsewhere

Perspective Many “Grand Challenges” in computational science are formulated as PDEs (possibly among alternative formulations) However, PDE simulations historically have not performed as well as other scientific simulations PDE simulations require a balance among architectural components that is not necessarily met in a machine designed to “max out” on the standard LINPACK benchmark The justification for building petaflop/s architectures undoubtedly will (and should) include PDE applications However, cost-effective use of petaflop/s on PDEs requires further attention to architectural and algorithmic matters Memory-centric view of computation needs further promotion

Application Performance History “3 orders of magnitude in 10 years” – better than Moore’s Law

Bell Prize Performance History

Plan of Presentation General characterization of PDE requirements
Four sources of performance improvement to get from current 100's of Gflop/s (for PDEs) to 1 Pflop/s Each illustrated with examples from computational aerodynamics offered as typical of real workloads (nonlinear, unstructured, multicomponent, multiscale, etc.) Performance presented on up to thousands of processors of T3E and ASCI Red (for parallel aspects) on numerous uniprocessors (for memory hierarchy aspects)

Purpose of Presentation
Not to argue for specific algorithms/programming models/codes in any detail but see talks on Newton-Krylov-Schwarz (NKS) methods under homepage Provide a requirements target for designers of today's systems typical of several contemporary successfully parallelized PDE applications not comprehensive of all important large-scale applications Speculate on requirements target for designers of tomorrow's sytems Promote attention to current architectural weaknesses, relative to requirements of PDEs

Four Sources of Performance Improvement
Expanded number of processors arbitrarily large factor, through extremely careful attention to load balancing and synchronization More efficient use of processor cycles, and faster processor/memory elements one to two orders of magnitude, through memory-assist language features, processors-in-memory, and multithreading Algorithmic variants that are more architecture-friendly approximately an order of magnitude, through improved locality and relaxed synchronization Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity This last does not contribute to raw flop/s!

PDE Varieties and Complexities
Varieties of PDEs evolution (time hyperbolic, time parabolic) equilibrium (elliptic, spatially hyperbolic or parabolic) mixed, varying by region mixed, of multiple type (e.g., parabolic with elliptic constraint) Complexity parameterized by: spatial grid points, Nx temporal grid points, Nt components per point, Nc auxiliary storage per point, Na grid points in stencil, Ns Memory: M  Nx ( Nc + Na + Nc Nc  Ns ) Work: W  Nx  Nt  ( Nc + Na + Nc  Nc  Ns )

Resource Scaling for PDEs
For 3D problems, (Memory)  (Work)3/4 for equilibrium problems, work scales with problem size  no. of iteration steps – for “reasonable” implicit methods proportional to resolution in single spatial dimension for evolutionary problems, work scales with problem size  time steps – CFL-type arguments place latter on order of resolution in single spatial dimension Proportionality constant can be adjusted over a very wide range by both discretization and by algorithmic tuning If frequent time frames are to be captured, disk capacity and I/O rates must both scale linearly with work

Typical PDE Tasks Vertex-based loops Edge-based “stencil op” loops
state vector and auxiliary vector updates Edge-based “stencil op” loops residual evaluation, approximate Jacobian evaluation Jacobian-vector product (often replaced with matrix-free form, involving residual evaluation) intergrid transfer (coarse/fine) Sparse, narrow-band recurrences approximate factorization and back substitution smoothing Vector inner products and norms orthogonalization/conjugation convergence progress and stability checks

Edge-based Loop Vertex-centered tetrahedral grid Traverse by edges
load vertex values compute intensively store contributions to flux at vertices Each vertex appears in approximately 15 flux computations

Explicit PDE Solvers Concurrency is pointwise, O(N)
Comm.-to-Comp. ratio is surface-to-volume, O((N/P)-1/3) Communication range is nearest-neighbor, except for time-step computation Synchronization frequency is once per step, O((N/P)-1) Storage per point is low Load balance is straightforward for static quasi-uniform grids Grid adaptivity (together with temporal stability limitation) makes load balance nontrivial

Domain-decomposed Implicit PDE Solvers
Concurrency is pointwise, O(N), or subdomainwise, O(P) Comm.-to-Comp. ratio still mainly surface-to-volume, O((N/P)-1/3) Communication still mainly nearest-neighbor, but nonlocal communication arises from conjugation, norms, coarse grid problems Synchronization frequency often more than once per grid-sweep, up to Krylov dimension, O(K(N/P)-1) Storage per point is higher, by factor of O(K) Load balance issues the same as for explicit

Source #1: Expanded Number of Processors
Amdahl's law can be defeated if serial sections make up a nonincreasing fraction of total work as problem size and processor count scale up together – true for most explicit or iterative implicit PDE solvers popularized in 1986 Karp Prize paper by Gustafson, et al. Simple, back-of-envelope parallel complexity analyses show that processors can be increased as fast, or almost as fast, as problem size, assuming load is perfectly balanced Caveat: the processor network must also be scalable (applies to protocols as well as to hardware) Remaining four orders of magnitude could be met by hardware expansion (but this does not mean that fixed-size applications of today would run 104 times faster)

Back-of-Envelope Scalability Demonstration for Bulk-synchronized PDE Computations
Given complexity estimates of the leading terms of: the concurrent computation (per iteration phase) the concurrent communication the synchronization frequency And a model of the architecture including: internode communication (network topology and protocol reflecting horizontal memory structure) on-node computation (effective performance parameters including vertical memory structure) One can estimate optimal concurrency and execution time on per-iteration basis, or overall (by taking into account any granularity-dependent convergence rate) simply differentiate time estimate in terms of (N,P) with respect to P, equate to zero and solve for P in terms of N

3D Stencil Costs (per Iteration)
grid points in each direction n, total work N=O(n3) processors in each direction p, total procs P=O(p3) memory per node requirements O(N/P) execution time per iteration A n3/p3 grid points on side of each processor subdomain n/p neighbor communication per iteration B n2/p2 cost of global reductions in each iteration C log p or C p(1/d) C includes synchronization frequency same dimensionless units for measuring A, B, C e.g., cost of scalar floating point multiply-add

3D Stencil Computation Illustration Rich local network, tree-based global reductions
total wall-clock time per iteration for optimal p, , or or (with ), without “speeddown,” p can grow with n in the limit as

3D Stencil Computation Illustration Rich local network, tree-based global reductions
optimal running time where limit of infinite neighbor bandwidth, zero neighbor latency ( ) (This analysis is on a per iteration basis; fuller analysis would multiply this cost by an iteration count estimate that generally depends on n and p.)

Summary for Various Networks
With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: optimal number of processors scales linearly with problem size With 3D torus-based global reductions and scalable nearest neighbor hardware: optimal number of processors scales as three-fourths power of problem size (almost “scalable”) With common network bus (heavy contention): optimal number of processors scales as one-fourth power of problem size (not “scalable”) bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards”

1999 Bell Prize Parallel Scaling Results on ASCI Red ONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices (about 11 million unknowns) on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)

Surface Visualization of Test Domain for Computing Flow over an ONERA M6 Wing

Transonic “Lambda” Shock Solution

Fixed-size Parallel Scaling Results (Flop/s)

Fixed-size Parallel Scaling Results (Time in seconds)

Algorithm: Newton-Krylov-Schwarz
nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

Fixed-size Scaling Results for W-cycle (Time in seconds, courtesy of D
Fixed-size Scaling Results for W-cycle (Time in seconds, courtesy of D. Mavriplis) ASCI runs: for grid of 3.1M vertices; T3E runs: for grid of 24.7M vertices

Source #2: More Efficient Use of Faster Processors
Current low efficiencies of sparse codes can be improved if regularity of reference is exploited with memory-assist features PDEs have periodic workingset structure that permits effective use of prefetch/dispatch directives, and lots of slackness Combined with “processors-in-memory” (PIM) technology for gather/scatter into densely used block transfers and multithreading, PDEs can approach full utilization of processor cycles Caveat: high bandwidth is critical, since PDE algorithms do only O(N) work for O(N) gridpoints worth of loads and stores One to two orders of magnitude can be gained by catching up to the clock, and by following the clock into the few-GHz range

Following the Clock 1999 Predictions from the Semiconductor Industry Association A factor of 2-3 can be expected by 2007 by following the clock alone

Example of Multithreading
Same ONERA M6 wing Euler code simulation on ASCI Red ASCI Red contains two processors per node, sharing memory Can use second processor in either message-passing mode with its own subdomain, or in multithreaded shared memory mode, which does not require the number of subdomain partitions to double Latter is much more effective in flux evaluation phase, as shown by cumulative execution time (here, memory bandwidth is not an issue)

PDE Workingsets Smallest: data for single stencil:
Ns (Nc2 + Nc + Na ) (sharp) Largest: data for entire subdomain: (Nx/P)  Ns  (Nc2 + Nc + Na ) (sharp) Intermediate: data for a neighborhood collection of stencils, reused as possible

Cache Traffic for PDEs As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth

Strategies Based on Workingset Structure
No performance value in memory levels larger than subdomain Little performance value in memory levels smaller than subdomain but larger than required to permit full reuse of most data within each subdomain subtraversal After providing L1 large enough for smallest workingset (and multiple independent copies up to accommodate desired level of multithreading) all additional resources should be invested in large L2 Tables describing grid connectivity are built (after each grid rebalancing) and stored in PIM --- used to pack/unpack dense-use cache lines during subdomain traversal

Costs of Greater Per-processor Efficiency
Programming complexity of managing subdomain traversal Space to store gather/scatter tables in PIM Time to (re)build gather/scatter tables in PIM Memory bandwidth commensurate with peak rates of all processors

Source #3: More “Architecture Friendly” Algorithms
Algorithmic practice needs to catch up to architectural demands several “one-time” gains remain to be contributed that could improve data locality or reduce synchronization frequency, while maintaining required concurrency and slackness “One-time” refers to improvements by small constant factors, nothing that scales in N or P – complexities are already near information-theoretic lower bounds, and we reject increases in flop rates that derive from less efficient algorithms Caveat: remaining algorithmic performance improvements may cost extra space or may bank on stability shortcuts that occasionally backfire, making performance modeling less predictable Perhaps an order of magnitude of performance remains here

Raw Performance Improvement from Algorithms
Spatial reorderings that improve locality interlacing of all related grid-based data structures ordering gridpoints and grid edges for L1/L2 reuse Discretizations that improve locality higher-order methods (lead to larger denser blocks at each point than lower-order methods) vertex-centering (for same tetrahedral grid, leads to denser blockrows than cell-centering) Temporal reorderings that improve locality block vector algorithms (reuse cached matrix blocks; vectors in block are independent) multi-step vector algorithms (reuse cached vector blocks; vectors have sequential dependence)

Raw Performance Improvement from Algorithms
Temporal reorderings that reduce synchronization penalty less stable algorithmic choices that reduce synchronization frequency (deferred orthogonalization, speculative step selection) less global methods that reduce synchronization range by replacing a tightly coupled global process (e.g., Newton) with loosely coupled sets of tightly coupled local processes (e.g., Schwarz) Precision reductions that make bandwidth seem larger lower precision representation of preconditioner matrix coefficients or poorly known coefficients (arithmetic is still performed on full precision extensions)

Improvements Resulting from Locality Reordering
8.0 16 26 27 42 21.0 200 Pent. Pro 333 400 360 300 600 450 332 120 250 Clock MHz 6.3 21 36 40 60 18.8 7.8 31 49 78 19.5 Pent. II/NT 8.3 33 47 52 83 20.8 Pent. II/LIN 2.5 20 71 8.9 800 Ultra II/HPC 3.5 25 54 94 13.0 720 Ultra II 3.0 18 35 75 12.5 1.3 37 91 7.6 1200 Alpha 21164 1.6 14 32 39 900 2.3 15 43 66 9.9 664 604e 3.1 59 117 24.3 480 P2SC (4 card) 2.7 13 51 101 21.4 P2SC (2 card) 4.0 68 87 163 20.3 P3 5.2 74 127 25.4 500 R10000 Orig. % of Peak Mflop/s Interl. only Reord. Only Opt. % of Peak Processor

Improvements from Blocking Vectors
Same ONERA M6 Euler simulation, on SGI Origin One vector represents standard GMRES acceleration Four vectors is a blocked Krylov method, not yet in “production” version Savings arises from not reloading matrix elements of Jacobian for each new vector (four-fold increase in matrix element use per load) Flop/s rate is effectively tripled – however, can the extra vectors be used efficiently from a numerical viewpoint??

Improvements from Reduced Precision
Same ONERA M6 Euler simulation, on SGI Origin Standard (middle column) is double precision in all floating quantities Optimization (right column) is to store preconditioner for Jacobian matrix in single precision only, promoting to double before use in the processor Bandwidth and matrix cache capacities are effectively doubled, with no deterioration in numerical properties

Source #4: Algorithms Packing More Science Per Flop
Some algorithmic improvements do not improve flop rate, but lead to the same scientific end in the same time at lower hardware cost (less memory, lower operation complexity) Caveat: such adaptive programs are more complicated and less thread-uniform than those they improve upon in quality/cost ratio Desirable that petaflop/s machines be general purpose enough to run the “best” algorithms Not daunting, conceptually, but puts an enormous premium on dynamic load balancing An order of magnitude or more can be gained here for many problems

Example of Adaptive Opportunities
Spatial Discretization-based adaptivity change discretization type and order to attain required approximation to the continuum everywhere without over-resolving in smooth, easily approximated regions Fidelity-based adaptivity change continuous formulation to accommodate required phenomena everywhere without enriching in regions where nothing happens Stiffness-based adaptivity change solution algorithm to provide more powerful, robust techniques in regions of space-time where discrete problem is linearly or nonlinearly stiff without extra work in nonstiff, locally well-conditioned regions

Experimental Example of Opportunity for Advanced Adaptivity
Driven cavity: Newton’s method (left) versus new Additive Schwarz Preconditioned Inexact Newton (ASPIN) nonlinear preconditioning (right)

Status and Prospects for Advanced Adaptivity
Metrics and procedures well developed in only a few areas method-of-lines ODEs for stiff IBVPs and DAEs, FEA for elliptic BVPs Multi-model methods used in ad hoc ways in production Boeing TRANAIR code Poly-algorithmic solvers demonstrated in principle but rarely in the “hostile” environment of high-performance computing Requirements for progress management of hierarchical levels of synchronization user specification of hierarchical priorities of different threads

Summary of Suggestions for PDE Petaflops
Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity (but we won't count this towards rate improvement) Algorithmic variants that are more architecture-friendly expect half an order of magnitude, through improved locality and relaxed synchronization More efficient use of processor cycles, and faster processor/memory expect one-and-a-half orders of magnitude, through memory-assist language features, PIM, and multithreading Expanded number of processors expect two orders of magnitude, through dynamic balancing and extreme care in implementation

Reminder about the Source of PDEs
Computational engineering is not about individual large-scale analyses, done fast and “thrown over the wall” Both “results” and their sensitivities are desired; often multiple operation points to be simulated are known a priori, rather than sequentially Sensitivities may be fed back into optimization process Full PDE analyses may also be inner iterations in a multidisciplinary computation In such contexts, “petaflop/s” may mean 1,000 analyses running somewhat asynchronously with respect to each other, each at 1 Tflop/s – clearly a less daunting challenge and one that has better synchronization properties for exploiting “The Grid” – than 1 analysis running at 1 Pflop/s

Summary Recommendations for Architects
Support rich (mesh-like) interprocessor connectivity and fast global reductions Allow disabling of expensive interprocessor cache coherence protocols for user-tagged data Support fast message-passing protocols between processors that physically share memory, for legacy MP applications Supply sufficient memory system bandwidth per processor (at least one word per clock per scalar unit) Give user optional control of L2 cache traffic through directives Develop at least gather/scatter processor-in-memory capability Support variety of precisions in blocked transfers and fast precision conversions

Recommendations for New Benchmarks
Recently introduced sPPM benchmark fills a void for memory-system-realistic full-application PDE performance, but is explicit, structured, and relatively high-order Similar full-application benchmark is needed for implicit, unstructured, low-order PDE solvers Reflecting the hierarchical, distributed memory layout of high end computers, this benchmark would have two aspects uniprocessor (“vertical”) memory system performance – suite of problems of various grid sizes and multicomponent sizes with different interlacings and edge orderings parallel (“horizontal”) network performance – problems of various subdomain sizes and synchronization frequencies

Bibliography High Performance Parallel CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing (to appear, 2001) Toward Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes & Smith, 1999, in “Proceedings of Parallel CFD'99,” Elsevier Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp Newton-Krylov-Schwarz Methods for Aerodynamics Problems: Compressible and Incompressible Flows on Unstructured Grids, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp How Scalable is Domain Decomposition in Practice, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 10th Intl. Conf. on Domain Decomposition Methods,” AMS, pp

Related URLs Follow-up on this talk ASCI platforms
ASCI platforms Int. Conferences on Domain Decomposition Methods SIAM Conferences on Parallel Processing (Norfolk, USA Mar 2001) International Conferences on Parallel CFD

Acknowledgments Collaborators: Dinesh Kaushik (ODU), Kyle Anderson (NASA), and the PETSc team at ANL: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith Sponsors: U.S. DOE, ICASE, NASA, NSF Computer Resources: DOE, SGI-Cray Inspiration: Shahid Bokhari (ICASE), Xiao-Chuan Cai (CU-Boulder), Rob Falgout (LLNL), Paul Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup (CU-Boulder), Michael Kern (INRIA), Dimitri Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede (Univ. Erlangen), John Salmon (Caltech), Linda Stals (ODU), Bob Voigt (DOE), David Young (Boeing), Paul Woodward (UMinn)

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics.

Similar presentations

Presentation on theme: "The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics.

Similar presentations

Presentation on theme: "The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics."— Presentation transcript:

Similar presentations

About project

Feedback