David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research.

David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research Center Institute for Scientific Computing Research Lawrence Livermore National Laboratory Algorithms for Cluster Computing Based on Domain Decomposition

NASA Langley Cluster SIG Plan of Presentation l Introduction (5) l Imperative of domain decomposition and multilevel methods for terascale computing (5) l Basic and “advanced” domain decomposition and multilevel algorithmic concepts (14) l Vignettes of domain decomposition and multilevel performance (13) (with a little help from my friends…) l Agenda for future research (6) l Terascale Optimal PDE Simulations (TOPS) software project (14) l Conclusions (4)

NASA Langley Cluster SIG Terascale simulation has been “sold” Environment global climate groundwater flow Lasers & Energy combustion ICF modeling Engineering structural dynamics electromagnetics Experiments expensive Chemistry & Bio materials modeling drug design Experiments controversial Applied Physics radiation transport hydrodynamics Experiments prohibited Scientific Simulation Experiments dangerous However, it is far from proven! To meet expectations, we need to handle problems of multiple physical scales.

‘97‘98‘99‘00‘01‘02‘03‘04‘05‘06 100+ Tflop / 30 TB Time (CY) Capability 1+ Tflop / 0.5 TB Plan Develop Use 30+ Tflop / 10 TB Red 3+ Tflop / 1.5 TB Blue 10+ Tflop / 4 TB White 50+ Tflop / 25 TB Large platforms have been provided ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2006 www.llnl.gov/asci/platforms Sandia Los Alamos Livermore Livermore

NASA Langley Cluster SIG NSF’s 13.6 TF TeraGrid coming on line 26 24 8 4 HPSS 5 UniTree External Networks Site Resources SDSC 4.1 TF 225 TB Caltech NCSA/PACI 8 TF 240 TB Argonne TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org

NASA Langley Cluster SIG Algorithmic requirements from architecture l Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache T3E “horizontal” aspects“vertical” aspects message passing, shared memory threadsregister blocking, cache blocking, prefetching

NASA Langley Cluster SIG Building platforms is the “easy” part l Algorithms must be n highly concurrent and straightforward to load balance n latency tolerant n cache friendly (good temporal and spatial locality) n highly scalable (in the sense of convergence) l Domain decomposed multilevel methods “natural” for all of these l Domain decomposition also “natural” for software engineering l Fortunate that mathematicians built up its theory in advance of requirements!

NASA Langley Cluster SIG Decomposition strategies for L u=f in  l Operator decomposition l Function space decomposition l Domain decomposition Consider the implicitly discretized parabolic case

NASA Langley Cluster SIG Operator decomposition l Consider ADI l Iteration matrix consists of four multiplicative substeps per timestep n two sparse matrix-vector multiplies n two sets of unidirectional bandsolves l Parallelism within each substep l But global data exchanges between bandsolve substeps

NASA Langley Cluster SIG Function space decomposition l Consider a spectral Galerkin method l System of ordinary differential equations l Perhaps are diagonal matrices l Perfect parallelism across spectral index l But global data exchanges to transform back to physical variables at each step

NASA Langley Cluster SIG Domain decomposition l Consider restriction and extension operators for subdomains,, and for possible coarse grid, l Replace discretized with l Solve by a Krylov method l Matrix-vector multiplies with n parallelism on each subdomain n nearest-neighbor exchanges, global reductions n possible small global system (not needed for parabolic case) =

NASA Langley Cluster SIG Comparison l Operator decomposition (ADI) n natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) l Function space decomposition (Fourier) n Natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) l Domain decomposition (Schwarz) n Natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem (Of course, domain decomposition can be interpreted as a special operator or function space decomposition)

NASA Langley Cluster SIG Theoretical Scaling of Domain Decomposition for Three Common Network Topologies l With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: n optimal number of processors scales linearly with problem size (scalable, assumes one subdomain per processor) l With 3D torus-based global reductions and scalable nearest neighbor hardware: n optimal number of processors scales as three-fourths power of problem size (almost scalable) l With common bus network (heavy contention): n optimal number of processors scales as one-fourth power of problem size (not scalable) n bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards” using multiple NICs per Beowulf node!

NASA Langley Cluster SIG Basic Concepts l Iterative correction l Schwarz preconditioning l Schur preconditioning “ Advanced” Concepts l Polynomial combinations of Schwarz projections l Schwarz-Schur combinations n Neumann-Neumann/FETI (Schwarz on Schur) n LNKS (Schwarz inside Schur) l Nonlinear Schwarz

NASA Langley Cluster SIG Iterative correction l The most basic idea in iterative methods l Evaluate residual accurately, but solve approximately, where is an approximate inverse to A l A sequence of complementary solves can be used, e.g., with first and then one has l Optimal polynomials of lead to various preconditioned Krylov methods l Scale recurrence, e.g., with, leads to multilevel methods

NASA Langley Cluster SIG smoother Finest Grid First Coarse Grid coarser grid has fewer cells (less work & storage) Restriction transfer from fine to coarse grid Recursively apply this idea until we have an easy problem to solve A Multigrid V-cycle Prolongation transfer from coarse to fine grid Multilevel Preconditioning

NASA Langley Cluster SIG Schwarz Preconditioning l Given A x = b, partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of l Let Boolean rectangular matrix extract the subset of : l Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support

NASA Langley Cluster SIG Iteration Count Estimates From the Schwarz Theory l In terms of N and P, where for d- dimensional isotropic problems, N=h -d and P=H -d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows: Ο(P 1/3 ) 1-level Additive Schwarz Ο(1) 2-level Additive Schwarz Ο((NP) 1/6 )Ο((NP) 1/4 ) Domain Jacobi (  =0) Ο(N 1/3 )Ο(N 1/2 ) Point Jacobi in 3Din 2DPreconditioning Type l Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system

NASA Langley Cluster SIG Schur Preconditioning l Given a partition l Condense: l Let M be a good preconditioner for S l Then is a preconditioner for A l Moreover, solves with may be done approximately if all degrees of freedom are retained

NASA Langley Cluster SIG Schwarz polynomials l Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations l We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively) l This leads to “Hybrid II” in Smith, Bjorstad & Gropp:

NASA Langley Cluster SIG Schwarz-on-Schur l Preconditioning the Schur complement is complex in and of itself; Schwarz is used on the reduced problem l Neumann-Neumann l Balancing Neumann-Neumann l Multigrid on the Schur complement

NASA Langley Cluster SIG Schwarz-inside-Schur l Equality constrained optimization leads to the KKT system for states x, designs u, and multipliers l Then l Newton Reduced SQP solves the Schur complement system H  u = g, where H is the reduced Hessian

NASA Langley Cluster SIG Schwarz-inside-Schur, cont. l Problems n is the Jacobian of a PDE  huge! n involve Hessians of objective and constraints  second derivatives and huge n H is unreasonable to form, store, or invert l Solutions n Use Schur preconditioning on full system n Form forward action of Hessians by automatic differentiation n Form approximate inverse action of state Jacobian and its transpose by Schwarz

NASA Langley Cluster SIG Nonlinear Schwarz Preconditioning l Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free l It replaces with a new nonlinear system possessing the same root, l Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition l Then sum the corrections:

NASA Langley Cluster SIG Nonlinear Schwarz, cont. l It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root l To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : n The residual n The Jacobian-vector product l Remarkably, (Cai-Keyes, 2000) it can be shown that where and l All required actions are available in terms of !

NASA Langley Cluster SIG Experimental Example of Nonlinear Schwarz Newton’s method Additive Schwarz Preconditioned Inexact Newton (ASPIN) Difficulty at critical Re Stagnation beyond critical Re Convergence for all Re

NASA Langley Cluster SIG “Unreasonable effectiveness” of Schwarz l When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! l Good decompositions are a compromise between conditioning and parallel complexity, in practice Let be a complete set of orthonormal row eigenvectors for A : or Then and — the Schwarz formula!

NASA Langley Cluster SIG Some anecdotal successes l Newton-Krylov-Schwarz n Computational aerodynamics (Anderson et al., NASA, ODU, ANL) n Free convection (Shadid & Tuminaro, Sandia) l Pseudo-spectral Schwarz n Incompressible flow (Fischer & Tufo, ANL) l FETI n Computational structural dynamics (Farhat, CU-Boulder & Pierson, Sandia) l LNKS n PDE-constrained optimization (Biros, NYU & Ghattas, CMU)

NASA Langley Cluster SIG Newton-Krylov-Schwarz Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994)

NASA Langley Cluster SIG Jacobian-Free Newton-Krylov Method l In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products l These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where  is chosen with a fine balance between approximation and floating point rounding error Schwarz preconditions, using approximate elements

NASA Langley Cluster SIG Computational Aerodynamics mesh c/o D. Mavriplis, ICASE Implemented in PETSc www.mcs.anl.gov/petsc Transonic “Lambda” Shock, Mach contours on surfaces

NASA Langley Cluster SIG Fixed-size Parallel Scaling Results Four orders of magnitude in 13 years c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith 128 nodes 43min 3072 nodes 2.5min, 226Gf/s 11M unknowns 15µs/unknown 70% efficient This scaling study, featuring our widest range of processor number, was done for the incompressible case.

NASA Langley Cluster SIG Fixed-size Parallel Scaling Results on ASCI Red ONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)

NASA Langley Cluster SIG PDE Workingsets l Smallest: data for single stencil l Largest: data for entire subdomain l Intermediate: data for a neighborhood collection of stencils, reused as possible

NASA Langley Cluster SIG Improvements Resulting from Locality Reordering Factor of Five!

NASA Langley Cluster SIG Cache Traffic for PDEs l As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth Traffic decreases as cache gets bigger or subdomains get smaller

NASA Langley Cluster SIG 1 proc Transport Modeling Newton-Krylov solver with Aztec non-restarted GMRES with 1 – level domain decomposition preconditioner, ILUT subdomain solver, and ML 2-level DD with Gauss-Seidel subdomain solver. Coarse Solver: “Exact” = Superlu (1 proc), “Approx” = one step of ILU (8 proc. in parallel) Temperature iso-lines on slice plane, velocity iso-surfaces and streamlines in 3D N.45 N.24 N0N0 2 – Level DD Exact Coarse Solve 2 – Level DD Approx. Coarse Solve 1 – Level DD 3D Results 512 procs Total Unknowns Avg. Iterations per Newton Step Thermal Convection Problem (Ra = 1000) c/o J. Shadid and R. Tuminaro MPSalsa/Aztec Newton-Krylov-Schwarz solver www.cs.sandia.gov/CRF/MPSalsa

NASA Langley Cluster SIG l unsteady Navier-Stokes solver n high-order tensor-product polynomials in space (N~5-15) n high-order operator splitting in time l two-level overlapping additive Schwarz for pressure n spectral elements taken as subdomains n fast local solves (tensor-product diagonalization) n fast coarse-grid solver u sparse exact factorization yields x 0 =A 0 -1 b = XX T b as parallel matrix-vector product u low communication; scales to 10,000 processors Transition near roughness element Incompressible Flow www.mcs.anl.gov/appliedmath/Flow/cfd.html Transition in arterio-venous graft, Re v =2700 ASCI-Red scaling to P=4096, 376 Gflops 27 M gridpoint Navier-Stokes calculation dual MPI multithreading Nek5000: unstructured spectral element code c/o P. Fischer and H. Tufo

NASA Langley Cluster SIG FETI-DP for structural mechanics 1mdof 4mdof 9mdof 18mdof30mdof 60mdof c/o C. Farhat and K. Pierson l Numerically scalable, hardware scalable solutions for realistic solid/shell models l Used in Sandia applications Salinas, Adagio, Andante

NASA Langley Cluster SIG PDE-constrained Optimization c/o G. Biros and O. Ghattas Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc wing tip vortices, no control (l); optimal control (r) wing tip vortices, no control (l); optimal control (r) optimal boundary controls shown as velocity vectors l Optimal control of laminar viscous flow n optimization variables are surface suction/injection n objective is minimum drag n 700,000 states; 4,000 controls n 128 Cray T3E processors n ~5 hrs for optimal solution (~1 hr for analysis) www.cs.nyu.edu/~biros/veltisto/

NASA Langley Cluster SIG Agenda for future research l High concurrency (100,000 processors) l Asynchrony l Fault tolerance l Automated tuning l Integration of simulation with studies of sensitivity, stability, and optimization

NASA Langley Cluster SIG High Concurrency Today Future l 100,000 processors, in a room or as part of a grid l Most phases of DD computations scale well n favorable surface-to-volume comm-to-comp ratio l However, latencies will nix frequent exact reductions l Paradigm: extrapolate data in retarded messages; correct (if necessary) when message arrives, such as in C(p,q,j) schemes by Garbey and Tromeur-Dervout l 10,000 processors in a single room with tightly coupled network l DD computations scale well, when provided with n network rich enough for parallel near neighbor communication n fast global reductions (complexity sublinear in processor count)

NASA Langley Cluster SIG Asynchrony Today Future l Adaptivity requirements and far- flung, nondedicated networks will lead to idleness and imbalance at synchronization points l Need algorithms with looser outer loops than global Newton-Krylov l Can we design algorithms that are robust with respect to incomplete convergence of inner tasks, like inexact Newton? l Paradigm: nonlinear Schwarz with regional (not global) nonlinear solvers where most execution time is spent l A priori partitionings for quasi- static meshes provide load- balanced computational tasks between frequent synchronization points l Good load balance is critical to parallel scalability on 1,000 processors and more

NASA Langley Cluster SIG Fault Tolerance Today Future c/o A. Geist l With 100,000 processors or worldwide networks, MTBF will be in minutes l Checkpoint-restart could take longer than the time to next failure l Paradigm: naturally fault tolerant algorithms, robust with respect to failure, such as a new FD algorithm at ORNL l Fault tolerance is not a driver in most scientific application code projects l FT handled as follows: n Detection of wrong u System – in hardware u Framework – by runtime env u Library – in math or comm lib n Notification of application u Interrupt – signal sent to job u Error code returned by app process n Recovery u Restart from checkpoint u Migration of task to new hardware u Reassignment of work to remaining tasks

NASA Langley Cluster SIG Automated Tuning Today Future l Less knowledgeable users required to employ parallel iterative solvers in taxing applications l Need safe defaults and automated tuning strategies l Paradigm: parallel direct search (PDS) derivative-free optimization methods, using overall parallel computational complexity as objective function and algorithm tuning parameters as design variables, to tune solver in preproduction trial executions l Knowledgeable user-developers parameterize their solvers with experience and theoretically informed intuition for: n problem size/processor ratio n outer solver type n Krylov solver type n DD preconditioner type n maximum subspace dimensions n overlaps n fill levels n inner tolerances n potentially many others

NASA Langley Cluster SIG Integrated Software Today Future l Analysis increasingly an “inner loop” around which more sophisticated science-driven tasks are wrapped l Need PDE task functionality (e.g., residual evaluation, Jacobian evaluation, Jacobian inverse) exposed to optimization/sensitivity/stability algorithms l Paradigm: integrated software based on common distributed data structures l Each analysis is a “special effort” l Optimization, sensitivity analysis (e.g., for uncertainty quantification), and stability analysis to fully exploit and contextualize scientific results are rare

NASA Langley Cluster SIG l Lab-university collaborations to develop “Integrated Software Infrastructure Centers” (ISICs) and partner with application groups l For FY2002, 51 new projects at $57M/year total n Approximately one-third for ISICs n A third for grid infrastructure and collaboratories n A third for applications groups l 5 Tflop/s IBM SP platforms “Seaborg” at NERSC (#3 in latest “Top 500”) and “Cheetah” at ORNL (being installed now) available for SciDAC

NASA Langley Cluster SIG Introducing “Terascale Optimal PDE Simulations” (TOPS) ISIC Nine institutions, five years, 24 co-PIs

NASA Langley Cluster SIG TOPS l Not just algorithms, but vertically integrated software suites l Portable, scalable, extensible, tunable implementations Starring PETSc and hypre, among other existing packages l Driven by three applications SciDAC groups n LBNL-led “21st Century Accelerator” designs n ORNL-led core collapse supernovae simulations n PPPL-led magnetic fusion energy simulations intended for many others

NASA Langley Cluster SIG Background of PETSc Library (in which FUN3D example was implemented) l Developed by Balay, Gropp, McInnes & Smith (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by four additional staff (Buschelman, Kaushik, Knepley, Zhang) under SciDAC l Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays l Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly l Portable, and callable from C, C++, Fortran l Uniform high-level API, with multi-layered entry l Aggressively optimized: copies minimized, communication aggregated and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitivetasks (e.g., gather/scatter) See http://www.mcs.anl.gov/petsc

NASA Langley Cluster SIG PETSc code User code Application Initialization Function Evaluation Jacobian Evaluation Post- Processing PCKSP PETSc Main Routine Linear Solvers (SLES) Nonlinear Solvers (SNES) Timestepping Solvers (TS) User Code/PETSc Library Interactions

NASA Langley Cluster SIG PETSc code User code Application Initialization Function Evaluation Jacobian Evaluation Post- Processing PCKSP PETSc Main Routine Linear Solvers (SLES) Nonlinear Solvers (SNES) Timestepping Solvers (TS) User Code/PETSc Library Interactions To be AD code

NASA Langley Cluster SIG Background of Hypre Library (to be combined with PETSc under SciDAC) l Developed by Chow, Cleary & Falgout (LLNL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by seven additional staff (Henson, Jones, Lambert, Painter, Tong, Treadway, Yang) under ASCI and SciDAC l Object-oriented design similar to PETSc l Concentrates on linear problems only l Richer in preconditioners than PETSc, with focus on algebraic multigrid l Includes other preconditioners, including sparse approximate inverse (Parasails) and parallel ILU (Euclid) See http://www.llnl.gov/CASC/hypre/

NASA Langley Cluster SIG Hypre’s “Conceptual Interfaces” Data Layout structuredcompositeblock-strucunstrucCSR Linear Solvers GMG,...FAC,...Hybrid,...AMGe,...ILU,... Linear System Interfaces Slide c/o E. Chow, LLNL

NASA Langley Cluster SIG Sample of Hypre’s Scaled Efficiency Slide c/o E. Chow, LLNL

NASA Langley Cluster SIG Scope for TOPS l Design and implementation of “solvers” n Time integrators, with sens. analysis n Nonlinear solvers, with sens. analysis n Optimizers n Linear solvers n Eigensolvers l Software integration l Performance optimization Optimizer Linear solver Eigensolver Time integrator Nonlinear solver Indicates dependence Sens. Analyzer

NASA Langley Cluster SIG Keyword: “Optimal” l Convergence rate nearly independent of discretization parameters n Multilevel schemes for linear and nonlinear problems n Newton-like schemes for quadratic convergence of nonlinear problems l Convergence rate as independent as possible of physical parameters n Continuation schemes n Physics-based preconditioning unscalable scalable Problem Size (increasing with number of processors) Time to Solution 200 150 50 0 100 10100 1000 1 Prometheus steel/rubber composite Parallel multigrid c/o M. Adams, Berkeley-Sandia The solver is a key part, but not the only part, of the simulation that needs to be scalable

NASA Langley Cluster SIG TOPS Philosophy on PDEs l Solution of a system of PDEs is rarely a goal in itself n PDEs are solved to derive various outputs from specified inputs n Actual goal is characterization of a response surface or a design or control strategy n Together with analysis, sensitivities and stability are often desired  Tools for PDE solution should also support related desires

NASA Langley Cluster SIG TOPS Philosophy on Operators l A continuous operator may appear in a discrete code in many different instances n Optimal algorithms tend to be hierarchical and nested iterative n Processor-scalable algorithms tend to be domain- decomposed and concurrent iterative n Majority of progress towards desired highly resolved, high fidelity result occurs through cost-effective low resolution, low fidelity parallel efficient stages  Operator abstractions and recurrence must be supported

NASA Langley Cluster SIG Traditional Approach to Software Interoperability l Direct interfacing between different packages/libraries/apps  Public interfaces are unique l Many-to-Many couplings require Many 2 interfaces  Often a heroic effort to understand the details of both codes  Not a scalable solution SUMAA3d DAs GRACE Overture Data / mesh software Hypre Trilinos ISIS++ PETSc Linear solvers Slide c/o L. McInnes, ANL

NASA Langley Cluster SIG CCA Approach: Common Interface Specification l Reduces the Many-to-Many problem to a Many-to-One problem n Allows interchangeability and experimentation n Difficulties u Interface agreement u Functionality limitations u Maintaining performance SUMAA3d DAs GRACE Overture DataData Hypre Trilinos ISIS++ PETSc ESIESI Slide c/o L. McInnes, ANL

NASA Langley Cluster SIG AAAA BB BB MPI Process MPI application using CCA for interaction between components A and B within the same address space Adaptive mesh component written by user1 Solver component written by user2 Direct Connection supplied by framework at compile/runtime CCA Concept: SCMD (SPMD) Components Proc1Proc2Proc3etc... Slide c/o L. McInnes, ANL

NASA Langley Cluster SIG Conclusions l Domain decomposition and multilevel iteration the dominant paradigm in contemporary terascale PDE simulation l Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes l Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive! l Many challenges loom at the “next scale” of computation l Undoubtedly, new theory/algorithms will be part of the solution

NASA Langley Cluster SIG Acknowledgments This talk was prepared despite interruptions by the …

NASA Langley Cluster SIG Acknowledgments l Early supporters at ICASE: n Bob Voigt n Yousuff Hussaini l Early supporters at NASA: n Manny Salas n Jim Thomas

NASA Langley Cluster SIG Acknowledgments l Collaborators or Contributors: n George Biros (NYU) n Xiao-Chuan Cai (Univ. Colorado, Boulder) n Paul Fischer (ANL) n Al Geist (ORNL) n Omar Ghattas (Carnegie-Mellon) n Dinesh Kaushik (ODU) n Dana Knoll (LANL) n Dimitri Mavriplis (ICASE) n Kendall Pierson (Sandia) n Henry Tufo (ANL) n AZTEC team at Sandia National Laboratory John Shadid, Ray Tuminaro n PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith l Sponsors: DOE, NASA, NSF l Computer Resources: LLNL, LANL, SNL, NERSC

NASA Langley Cluster SIG Related URLs l Personal homepage: papers, talks, etc. http://www.math.odu.edu/~keyes l SciDAC initiative http://www.science.doe.gov/scidac l TOPS project http://www.math.odu.edu/~keyes/scidac l PETSc project http://www.mcs.anl.gov/petsc l Hypre project http://www.llnl.gov/CASC/hypre l ASCI platforms http://www.llnl.gov/asci/platforms

NASA Langley Cluster SIG Bibliography l Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002, to be submitted to J. Comp. Phys. l Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, to appear in SIAM J. Sci. Comp. l High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362 l Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17 l Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136 l Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99 l Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 l How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297

David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research.

Similar presentations

Presentation on theme: "David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research.

Similar presentations

Presentation on theme: "David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research."— Presentation transcript:

Similar presentations

About project

Feedback