Presentation is loading. Please wait.

Presentation is loading. Please wait.

David E. Keyes Center for Computational Science Old Dominion University Institute for Scientific Computing Research (ISCR) Lawrence Livermore National.

Similar presentations


Presentation on theme: "David E. Keyes Center for Computational Science Old Dominion University Institute for Scientific Computing Research (ISCR) Lawrence Livermore National."— Presentation transcript:

1 David E. Keyes Center for Computational Science Old Dominion University Institute for Scientific Computing Research (ISCR) Lawrence Livermore National Laboratory Algorithms and Software for Terascale Computation of PDEs and PDE-constrained Optimization

2 University of Kentucky CCS Seminar, 6 November 2002 Plan of presentation l Imperative of “optimal” algorithms for terascale computing l Basic domain decomposition and multilevel algorithmic concepts l Illustration of solver performance on ASCI platforms l Terascale Optimal PDE Simulations (TOPS) software project of the U.S. DOE l Conclusions and outlook

3 University of Kentucky CCS Seminar, 6 November 2002 Related URLs l Personal homepage: papers, talks, etc. http://www.math.odu.edu/~keyes l SciDAC initiative http://www.science.doe.gov/scidac l TOPS software project http://www.math.odu.edu/~keyes/scidac l PETSc software project http://www.mcs.anl.gov/petsc l Hypre software project http://www.llnl.gov/CASC/hypre Slides from 14-hour Peking University CS&E short course with Bill Gropp (in August 2002) now on-line

4 University of Kentucky CCS Seminar, 6 November 2002 Bibliography l Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002, submitted to J. Comp. Phys. l Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, SIAM J. Sci. Comp. 24:183-200 l High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362 l Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17 l Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136 l Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99 l Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 l How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297

5 University of Kentucky CCS Seminar, 6 November 2002 Terascale simulation has been “sold” Environment global climate contaminant transport Lasers & Energy combustion ICF Engineering crash testing aerodynamics Biology drug design genomics Applied Physics radiation transport supernovae Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

6 University of Kentucky CCS Seminar, 6 November 2002 Terascale simulation has been “sold” Environment global climate contaminant transport Lasers & Energy combustion ICF Engineering crash testing aerodynamics Biology drug design genomics Experiments controversial Applied Physics radiation transport supernovae Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

7 University of Kentucky CCS Seminar, 6 November 2002 Terascale simulation has been “sold” Environment global climate contaminant transport Lasers & Energy combustion ICF Engineering crash testing aerodynamics Biology drug design genomics Experiments controversial Applied Physics radiation transport supernovae Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment.

8 University of Kentucky CCS Seminar, 6 November 2002 Terascale simulation has been “sold” Environment global climate contaminant transport Lasers & Energy combustion ICF Engineering crash testing aerodynamics Biology drug design genomics Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment.

9 University of Kentucky CCS Seminar, 6 November 2002 Terascale simulation has been “sold” Environment global climate contaminant transport Lasers & Energy combustion ICF Engineering crash testing aerodynamics Biology drug design genomics Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment. Experiments difficult to instrument

10 University of Kentucky CCS Seminar, 6 November 2002 Terascale simulation has been “sold” Environment global climate contaminant transport Lasers & Energy combustion ICF Engineering crash testing aerodynamics Biology drug design genomics Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment. Experiments difficult to instrument Experiments expensive

11 University of Kentucky CCS Seminar, 6 November 2002 Terascale simulation has been “sold” Environment global climate contaminant transport Lasers & Energy combustion ICF Engineering crash testing aerodynamics Biology drug design genomics Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous However, simulation is far from proven! To meet expectations, we need to handle problems of multiple physical scales. Experiments difficult to instrument Experiments expensive

12 ‘97‘98‘99‘00‘01‘02‘03‘04‘05‘06 100+ Tflop / 30 TB Time (CY) Capability 1+ Tflop / 0.5 TB Plan Develop Use 30+ Tflop / 10 TB Red 3+ Tflop / 1.5 TB Blue 10+ Tflop / 4 TB White 50+ Tflop / 25 TB Large platforms have been provided ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2006 www.llnl.gov/asci/platforms Sandia Los Alamos Livermore Livermore

13 University of Kentucky CCS Seminar, 6 November 2002 NSF’s 13.6 TF TeraGrid coming on line 26 24 8 4 HPSS 5 UniTree External Networks Site Resources SDSC 4.1 TF 225 TB Caltech NCSA/PACI 8 TF 240 TB Argonne TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org c/o I. Foster

14 University of Kentucky CCS Seminar, 6 November 2002 Bird’s-eye View of the Earth Simulator System 65m 50m Double Floor for IN Cables Interconnection Network (IN) Cabinets Cartridge Tape Library System Power Supply System Air Conditioning System Processor Node (PN) Cabinets Disks Earth Simulator

15 University of Kentucky CCS Seminar, 6 November 2002 Building platforms is the “easy” part l Algorithms must be n highly concurrent and straightforward to load balance n latency tolerant n cache friendly (temporal and spatial locality of reference) n highly scalable (in the sense of convergence) l Goal for algorithmic scalability: fill up memory of arbitrarily large machines while preserving constant* running times with respect to proportionally smaller problem on one processor l Domain-decomposed multilevel methods “natural” for all of these l Domain decomposition also “natural” for software engineering * or logarithmically growing

16 University of Kentucky CCS Seminar, 6 November 2002 Algorithmic requirements from architecture l Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache T3E “horizontal” aspects“vertical” aspects message passing, shared memory threadsregister blocking, cache blocking, prefetching

17 University of Kentucky CCS Seminar, 6 November 2002 Keyword: “Optimal” l Convergence rate nearly independent of discretization parameters n Multilevel schemes for rapid linear convergence of linear problems n Newton-like schemes for quadratic convergence of nonlinear problems l Convergence rate as independent as possible of physical parameters n Continuation schemes n Physics-based preconditioning unscalable scalable Problem Size (increasing with number of processors) Time to Solution 200 150 50 0 100 10100 1000 1 Steel/rubber composite Parallel multigrid c/o M. Adams, Berkeley-Sandia The solver is a key part, but not the only part, of the simulation that needs to be scalable

18 University of Kentucky CCS Seminar, 6 November 2002 Why Optimal Algorithms? l The more powerful the computer, the greater the importance of optimality l Example: n Suppose Alg1 solves a problem in time CN 2, where N is the input size n Suppose Alg2 solves the same problem in time CN n Suppose that the machine on which Alg1 and Alg2 have been parallelized to run has 10,000 processors l In constant time (compared to serial), Alg1 can run a problem 100X larger, whereas Alg2 can run a problem 10,000X larger

19 University of Kentucky CCS Seminar, 6 November 2002 Why Optimal?, cont. l Alternatively, filling the machine’s memory, Alg1 requires 100X time, whereas Alg2 runs in constant time l Is 10,000 processors a reasonable expectation? n Yes, we have it today (ASCI White)! l Could computational scientists really use 10,000X scaling? n Of course; we are approximating the continuum n A grid for weather prediction allows points every 1km versus every 100km on the earth’s surface n In 2D 10,000X disappears fast; in 3D even faster l However, these machines are expensive (Earth Simulator is $0.5B, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them

20 University of Kentucky CCS Seminar, 6 November 2002 Decomposition strategies for L u=f in  l Operator decomposition l Function space decomposition l Domain decomposition Consider, e.g., the implicitly discretized parabolic case

21 University of Kentucky CCS Seminar, 6 November 2002 Operator decomposition l Consider ADI l Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep n two sparse matrix-vector multiplies n two sets of unidirectional bandsolves l Parallelism within each substep l But global data exchanges between bandsolve substeps

22 University of Kentucky CCS Seminar, 6 November 2002 Function space decomposition l Consider a spectral Galerkin method l System of ordinary differential equations l Perhaps are diagonal matrices l Perfect parallelism across spectral index l But global data exchanges to transform back to physical variables at each step

23 University of Kentucky CCS Seminar, 6 November 2002 Domain decomposition l Consider restriction and extension operators for subdomains,, and for possible coarse grid, l Replace discretized with l Solve by a Krylov method, e.g., CG l Matrix-vector multiplies with n parallelism on each subdomain n nearest-neighbor exchanges, global reductions n possible small global system (not needed for parabolic case) =

24 University of Kentucky CCS Seminar, 6 November 2002 Comparison l Operator decomposition (ADI) n natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) l Function space decomposition (Fourier) n natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) l Domain decomposition (Schwarz) n natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem

25 University of Kentucky CCS Seminar, 6 November 2002 Theoretical scaling of domain decomposition (for three common network topologies) l With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: n optimal number of processors scales linearly with problem size (“scalable”, assumes one subdomain per processor) l With power-law-time (3D torus-based) global reductions and scalable nearest neighbor interconnects: n optimal number of processors scales as three-fourths power of problem size (“almost scalable”) l With linear-time (common bus) network: n optimal number of processors scales as one-fourth power of problem size (*not* scalable) n bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node!

26 University of Kentucky CCS Seminar, 6 November 2002 Three Basic Concepts l Iterative correction l Schwarz preconditioning l Schur preconditioning Some “ Advanced” Concepts l Polynomial combinations of Schwarz projections l Schwarz-Schur combinations n Schwarz on Schur-reduced system n Schwarz inside Schur-reduced system l Nonlinear Schwarz new! optimization

27 University of Kentucky CCS Seminar, 6 November 2002 Iterative correction l The most basic idea in iterative methods l Evaluate residual accurately, but solve approximately, where is an approximate inverse to A l A sequence of complementary solves can be used, e.g., with first and then one has l Optimal polynomials of lead to various preconditioned Krylov methods l Scale recurrence, e.g., with, leads to multilevel methods

28 University of Kentucky CCS Seminar, 6 November 2002 smoother Finest Grid First Coarse Grid coarser grid has fewer cells (less work & storage) Restriction transfer from fine to coarse grid Recursively apply this idea until we have an easy problem to solve A Multigrid V-cycle Prolongation transfer from coarse to fine grid Multilevel Preconditioning

29 University of Kentucky CCS Seminar, 6 November 2002 Schwarz Preconditioning l Given A x = b, partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of l Let Boolean rectangular matrix extract the subset of : l Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support

30 University of Kentucky CCS Seminar, 6 November 2002 Iteration count estimates from the Schwarz theory l In terms of N and P, where for d-dimensional isotropic problems, N=h -d and P=H -d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows: Ο(P 1/3 )Ο(P 1/2 ) 1-level Additive Schwarz Ο(1) 2-level Additive Schwarz Ο((NP) 1/6 )Ο((NP) 1/4 ) Domain Jacobi (  =0) Ο(N 1/3 )Ο(N 1/2 ) Point Jacobi in 3Din 2DPreconditioning Type l Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system

31 University of Kentucky CCS Seminar, 6 November 2002 Comments on the Schwarz theory l Basic Schwarz estimates are for: n self-adjoint operators with smooth coefficients n positive definite operators n exact subdomain solves, n two-way overlapping with n generous overlap,  =O(H) (otherwise 2-level result is O(1+H/  )) l Extensible to: n nonself-adjointness (e.g, convection) and jumping coefficients n indefiniteness (e.g., wave Helmholtz) n inexact subdomain solves n one-way overlap communication (“restricted additive Schwarz”) n small overlap

32 University of Kentucky CCS Seminar, 6 November 2002 Schur Preconditioning l Given a partition l Condense: l Let M be a good preconditioner for S l Then is a preconditioner for A l Moreover, solves with may be done approximately if all degrees of freedom are retained

33 University of Kentucky CCS Seminar, 6 November 2002 Schwarz polynomials l Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations l We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively) l This leads to algorithm “Hybrid II” in S-B-G’96: l Convenient for “SPMD” (single prog/multiple data)

34 University of Kentucky CCS Seminar, 6 November 2002 Schwarz-on-Schur l Preconditioning the Schur complement is complex in and of itself; Schwarz can be used on the reduced problem l “Neumann-Neumann” alg l “Balancing Neumann-Neumann” alg l Multigrid on the Schur complement

35 University of Kentucky CCS Seminar, 6 November 2002 Schwarz-inside-Schur l Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization l Constraint l Objective l Lagrangian l Form the gradient of the Lagrangian with respect to each of x, u, and :

36 University of Kentucky CCS Seminar, 6 November 2002 Schwarz-inside-Schur l Equality constrained optimization leads to the KKT system for states x, designs u, and multipliers l Then l Newton Reduced SQP solves the Schur complement system H  u = g, where H is the reduced Hessian

37 University of Kentucky CCS Seminar, 6 November 2002 Schwarz-inside-Schur, cont. l Problems n is the Jacobian of a PDE  huge! n involve Hessians of objective and constraints  second derivatives and huge n H is unreasonable to form, store, or invert l Solutions n Use Schur preconditioning on full system n Form forward action of Hessians by automatic differentiation (vector-to-vector map) n Form approximate inverse action of state Jacobian and its transpose by Schwarz

38 University of Kentucky CCS Seminar, 6 November 2002 Example of PDE-constrained Optimization c/o G. Biros and O. Ghattas Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc wing tip vortices, no control (l); optimal control (r) wing tip vortices, no control (l); optimal control (r) optimal boundary controls shown as velocity vectors l Optimal control of laminar viscous flow n optimization variables are surface suction/injection n objective is minimum drag n 700,000 states; 4,000 controls n 128 Cray T3E processors n ~5 hrs for optimal solution (~1 hr for analysis) www.cs.nyu.edu/~biros/veltisto/

39 University of Kentucky CCS Seminar, 6 November 2002 Nonlinear Schwarz preconditioning l Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free l It replaces with a new nonlinear system possessing the same root, l Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition l Then sum the corrections:

40 University of Kentucky CCS Seminar, 6 November 2002 Nonlinear Schwarz, cont. l It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root l To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : n The residual n The Jacobian-vector product l Remarkably, (Cai-Keyes, 2000) it can be shown that where and l All required actions are available in terms of !

41 University of Kentucky CCS Seminar, 6 November 2002 Example of nonlinear Schwarz Newton’s method Additive Schwarz Preconditioned Inexact Newton (ASPIN) Difficulty at critical Re Stagnation beyond critical Re Convergence for all Re

42 University of Kentucky CCS Seminar, 6 November 2002 “Unreasonable effectiveness” of Schwarz l When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! l Good decompositions are a compromise between conditioning and parallel complexity, in practice Let be a complete set of orthonormal row eigenvectors for A : or Then and — the Schwarz formula!

43 University of Kentucky CCS Seminar, 6 November 2002 Newton-Krylov-Schwarz – a parallel PDE “workhorse” Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994), in PETSc since Balay’s MS project at ODU (1995)

44 University of Kentucky CCS Seminar, 6 November 2002 Jacobian-Free Newton-Krylov Method l In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products l These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where  is chosen with a fine balance between approximation and floating point rounding error Schwarz preconditions, using approximate elements

45 University of Kentucky CCS Seminar, 6 November 2002 Philosophy of Jacobian-free NK l To evaluate the linear residual, we use the true F’(u), giving a true Newton step and asymptotic quadratic Newton convergence l To precondition the linear residual, we do anything convenient that uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations: n combinations of operator-split Jacobians (for reasons of physics or reasons of numerics) n Jacobian of related discretization (for “fast” solves) n Jacobian of lower-order discretization (for more stability, less storage) n Jacobian with “lagged” values for expensive terms (for less computation per degree of freedom) n Jacobian stored in lower precision (for less memory traffic per preconditioning step) n Jacobian blocks decomposed for parallelism

46 University of Kentucky CCS Seminar, 6 November 2002 Philosophy of Jacobian-free NK, cont. l These motivations are not new; most large-scale application codes also take “short cuts” on the approximate Jacobian operator to be inverted – showing physical intuition l The problem with many codes is that they do not anywhere have an accurate global Jacobian operator; they use only the weak Jacobian l This leads to a weakly nonlinearly converging “defect correction method” n Defect correction: n in contrast to preconditioned Newton:

47 University of Kentucky CCS Seminar, 6 November 2002 Physics-based Preconditioning l Recall the example of the shallow- water wave splitting, treated earlier as a solver, leaving a first-order in time splitting error l In the Jacobian-free Newton-Krylov framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner l The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations

48 University of Kentucky CCS Seminar, 6 November 2002 Physics-based preconditioning l In Newton iteration, one seeks to obtain a correction (“delta”) to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual: l A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps l This implicitly defined mapping from residual to “delta” is a natural preconditioner l Software must accommodate this!

49 University of Kentucky CCS Seminar, 6 November 2002 1D Shallow water preconditioning l Define continuity residual for each timestep: l Define momentum residual for each timestep: l Continuity delta-form (*): l Momentum delta form (**):

50 University of Kentucky CCS Seminar, 6 November 2002 1D Shallow water preconditioning, cont. l Solving (**) for and substituting into (*), l After this parabolic equation is solved for , we have l This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep l Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used l See paper by Mousseau et al. (2002) in Ref [1] for impressive results for longtime weather integration

51 University of Kentucky CCS Seminar, 6 November 2002 Operator-split preconditioning l Subcomponents of a PDE operator often have special structure that can be exploited if they are treated separately l Algebraically, this is just a generalization of Schwarz, by term instead of by subdomain l Suppose and a preconditioner is to be constructed, where and are each “easy” to invert l Form a preconditioned vector from as follows: l Equivalent to replacing with l First-order splitting error, yet often used as a solver!

52 University of Kentucky CCS Seminar, 6 November 2002 Operator-split preconditioning, cont. l Suppose S is convection-diffusion and R is reaction, among a collection of fields stored as gridfunctions l On a small regular 2D grid with a five-point stencil: l R is trivially invertible in block diagonal form l S is invertible with one multilevel solve per field J=S +R

53 University of Kentucky CCS Seminar, 6 November 2002 l Preconditioners assembled from just the “strong” elements of the Jacobian, alternating the source term and the diffusion term operators, are competitive in convergence rates with block-ILU on the Jacobian n particularly, since the decoupled scalar diffusion systems are amenable to simple multigrid treatment – not as trivial for the coupled system l The decoupled preconditioners store many fewer elements and significantly reduce memory bandwidth requirements and are expected to be much faster per iteration when carefully implemented l See “alternative block factorization” by Bank et al. in Ref [1]; incorporated into SciDAC TSI solver by D’Azevedo Operator-split preconditioning, cont.

54 University of Kentucky CCS Seminar, 6 November 2002 Using Jacobian of related discretization l To precondition a variable coefficient operator, such as  ·(   ), use, based on a constant coefficient average l Brown & Saad (1980) showed that, because of the availability of fast solvers, it may even be acceptable to use to precondition something like

55 University of Kentucky CCS Seminar, 6 November 2002 Using Jacobian of lower order discretization l Orszag popularized the use of linear finite element discretizations as preconditioners for high-order spectral element discretizations in the 1970s; both approach the same continuous operator l It is common in CFD to employ first-order upwinded convective operators as approximate inversions for higher-order operators: n better factorization stability n smaller matrix bandwidth and complexity l W ith Jacobian-free NK, we can have the best of both worlds – a stable factorization/cheap solve and a true Jacobian step

56 University of Kentucky CCS Seminar, 6 November 2002 Using Jacobian with lagged terms l Newton-chord methods (e.g., papers by Smooke et al.) “freeze” the Jacobian matrices: n saves Jacobian evaluation and factorization, which can be up to 90% of the running time of the code in some apps n however, nonlinear convergence degrades to linear rate l In Jacobian-free NK, we can “freeze” some or all of the terms in the Jacobian preconditioner, while always accessing the action of the true Jacobian for the Krylov matrix-vector multiply: n still saves Jacobian work n maintains asymptotically quadratic rate for nonlinear convergence l See ref [1] for example with coupled edge plasma and Navier- Stokes, showing five-fold improvement over full Newton with constantly refreshed Jacobian on LHS, versus JFNK with preconditioner refreshed once each ten timesteps

57 University of Kentucky CCS Seminar, 6 November 2002 Using Jacobian with lower precision elements l Memory bandwidth is the critical architectural parameter for sparse linear algebra computations l Storing the preconditioner elements in single precision effectively doubles memory bandwidth (and potentially halves runtime) for this critical phase l We still form the Jacobian-vector product with full precision and “zero-pad” the preconditioner elements back to full length in the arithmetic unit, so the numerical quality of the Krylov subspace does not degrade

58 University of Kentucky CCS Seminar, 6 November 2002 Memory BW bottleneck revealed via precision reduction 106s122s16s31s120 181s205s34s60s64 331s373s67s117s32 657s746s136s223s16 Single Double SingleDouble OverallLinear Solve Computational Phase Number of Processors Execution times for unstructured NKS Euler Simulation on Origin 2000: double precision matrices versus single precision preconditioner Note that times are nearly halved, along with precision, for the BW-limited linear solve phase, indicating that the BW can be at least doubled before hitting the next bottleneck!

59 University of Kentucky CCS Seminar, 6 November 2002 Computational Aerodynamics mesh c/o D. Mavriplis, ICASE Implemented in PETSc www.mcs.anl.gov/petsc Transonic “Lambda” Shock, Mach contours on surfaces

60 University of Kentucky CCS Seminar, 6 November 2002 Fixed-size Parallel Scaling Results Four orders of magnitude in 13 years c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith 128 nodes 43min 3072 nodes 2.5min, 226Gf/s 11M unknowns 15µs/unknown 70% efficient This scaling study, featuring our widest range of processor number, was done for the incompressible case.

61 University of Kentucky CCS Seminar, 6 November 2002 l Lab-university collaborations to develop “Integrated Software Infrastructure Centers” (ISICs) and partner with application groups l For FY2002, 51 new projects at $57M/year total n Approximately one-third for ISICs n A third for grid infrastructure and collaboratories n A third for applications groups l 5 Tflop/s IBM SP platforms “Seaborg” at NERSC (#5 in Jun’02 “Top 500” list) and “Cheetah” at ORNL (recently installed) available for SciDAC

62 University of Kentucky CCS Seminar, 6 November 2002 Introducing “Terascale Optimal PDE Simulations” (TOPS) ISIC Nine institutions, $17M, five years, 24 co-PIs

63 University of Kentucky CCS Seminar, 6 November 2002 Scope for TOPS l Design and implementation of “solvers” n Time integrators, with sens. analysis n Nonlinear solvers, with sens. analysis n Optimizers n Linear solvers n Eigensolvers l Software integration l Performance optimization Optimizer Linear solver Eigensolver Time integrator Nonlinear solver Indicates dependence Sens. Analyzer

64 University of Kentucky CCS Seminar, 6 November 2002 TOPS Philosophy on PDEs l Solution of a system of PDEs is rarely a goal in itself n PDEs are solved to derive various functionals from specified inputs n Actual goal is characterization of a response surface or a design or control strategy n Together with analysis, sensitivities and stability are often desired  Tools for PDE solution should also support such related desires

65 University of Kentucky CCS Seminar, 6 November 2002 Conclusions l Domain decomposition and multilevel iteration the dominant paradigm in contemporary terascale PDE simulation l Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes l Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive! l Tunability of NKS algorithmics allows solver adaption to application/architecture combinations l Next generation software should incorporate “best practices” in applications as preconditioners

66 University of Kentucky CCS Seminar, 6 November 2002 Acknowledgments l Collaborators or Contributors: n Xiao-Chuan Cai (Univ. Colorado, Boulder) n Omar Ghattas (Carnegie-Mellon) n Dinesh Kaushik (ODU) n Dana Knoll (LANL) n Dimitri Mavriplis (ICASE) n PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith l Sponsors: DOE, NASA, NSF l Computer Resources: LLNL, LANL, SNL, NERSC

67 University of Kentucky CCS Seminar, 6 November 2002 Related URLs l Personal homepage: papers, talks, etc. http://www.math.odu.edu/~keyes l SciDAC initiative http://www.science.doe.gov/scidac l TOPS project http://www.math.odu.edu/~keyes/scidac l PETSc project http://www.mcs.anl.gov/petsc l Hypre project http://www.llnl.gov/CASC/hypre l ASCI platforms http://www.llnl.gov/asci/platforms

68 University of Kentucky CCS Seminar, 6 November 2002 Bibliography l Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002, to be submitted to J. Comp. Phys. l Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, to appear in SIAM J. Sci. Comp. l High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362 l Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17 l Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136 l Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99 l Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 l How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297

69 University of Kentucky CCS Seminar, 6 November 2002 EOF


Download ppt "David E. Keyes Center for Computational Science Old Dominion University Institute for Scientific Computing Research (ISCR) Lawrence Livermore National."

Similar presentations


Ads by Google