Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
OpenFOAM on a GPU-based Heterogeneous Cluster
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Steady Aeroelastic Computations to Predict the Flying Shape of Sails Sriram Antony Jameson Dept. of Aeronautics and Astronautics Stanford University First.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Landscape Erosion Kirsten Meeker
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
PETSc Portable, Extensible Toolkit for Scientific computing.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
The Next Four Orders of Magnitude in Parallel PDE Simulation Performance David E. Keyes Department of Mathematics.
Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
ANS 1998 Winter Meeting DOE 2000 Numerics Capabilities 1 Barry Smith Argonne National Laboratory DOE 2000 Numerics Capability
Using the PETSc Linear Solvers Lois Curfman McInnes in collaboration with Satish Balay, Bill Gropp, and Barry Smith Mathematics and Computer Science Division.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.
Efficient Integration of Large Stiff Systems of ODEs Using Exponential Integrators M. Tokman, M. Tokman, University of California, Merced 2 hrs 1.5 hrs.
Parallel Simulation of Continuous Systems: A Brief Introduction
CMRS Review, PPPL, 5 June 2003 &. 4 projects in high energy and nuclear physics 5 projects in fusion energy science 14 projects in biological and environmental.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
ACES WorkshopJun-031 ACcESS Software System & High Level Modelling Languages by
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
1 1  Capabilities: Scalable algebraic solvers for PDEs Freely available and supported research code Usable from C, C++, Fortran 77/90, Python, MATLAB.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Connections to Other Packages The Cactus Team Albert Einstein Institute
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Outline Introduction Research Project Findings / Results
Discretization Methods Chapter 2. Training Manual May 15, 2001 Inventory # Discretization Methods Topics Equations and The Goal Brief overview.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
An Object-Oriented Software Framework for Building Parallel Navier-Stokes Solvers Xing Cai Hans Petter Langtangen Otto Munthe University of Oslo.
Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.
On the Performance of PC Clusters in Solving Partial Differential Equations Xing Cai Åsmund Ødegård Department of Informatics University of Oslo Norway.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Center for Component Technology for Terascale Simulation Software (CCTTSS) 110 April 2002CCA Forum, Townsend, TN CCA Status, Code Walkthroughs, and Demonstrations.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Adaptive grid refinement. Adaptivity in Diffpack Error estimatorError estimator Adaptive refinementAdaptive refinement A hierarchy of unstructured gridsA.
Quality of Service for Numerical Components Lori Freitag Diachin, Paul Hovland, Kate Keahey, Lois McInnes, Boyana Norris, Padma Raghavan.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Parallel Computing Activities at the Group of Scientific Software Xing Cai Department of Informatics University of Oslo.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
Xing Cai University of Oslo
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
Programming Models for SimMillennium
Meros: Software for Block Preconditioning the Navier-Stokes Equations
HPC Modeling of the Power Grid
Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application
Salient application properties Expectations TOPS has of users
A Software Framework for Easy Parallelization of PDE Solvers
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application Kyle Anderson, NASA Langley Research Center William Gropp, Argonne National Laboratory Dinesh Kaushik, Old Dominion University & Argonne David Keyes, Old Dominion University, LLNL & ICASE Barry Smith, Argonne National Laboratory

Gordon Bell Prize Finalist Talk SC’99 Application Performance History 3 orders of magnitude in 10 years

Gordon Bell Prize Finalist Talk SC’99 Features of this 1999 Submission Based on “legacy” (but contemporary) CFD application with significant F77 code reuse Portable, message-passing library-based parallelization, run on NT boxes through Tflop/s ASCI platforms Simple multithreaded extension (for ASCI Red) Sparse, unstructured data, implying memory indirection with only modest reuse - nothing in this category has ever advanced to Bell finalist round Wide applicability to other implicitly discretized multiple- scale PDE workloads - of interagency, interdisciplinary interest Extensive profiling has led to follow-on algorithmic research

Gordon Bell Prize Finalist Talk SC’99 Application Domain: Computational Aerodynamics

Gordon Bell Prize Finalist Talk SC’99 Background of FUN3D Application Tetrahedral vertex-centered unstructured grid code developed by W. K. Anderson (LaRC) for steady compressible and incompressible Euler and Navier-Stokes equations (with one-equation turbulence modeling) Used in airplane, automobile, and submarine applications for analysis and design Standard discretization is 2nd-order Roe for convection and Galerkin for diffusion Newton-Krylov solver with global point-block-ILU preconditioning, with false timestepping for nonlinear continuation towards steady state; competitive with FAS multigrid in practice Legacy implementation/ordering is vector-oriented

Gordon Bell Prize Finalist Talk SC’99 Surface Visualization of Test Domain for Computing Flow over an ONERA M6 Wing Wing surface outlined in green triangles Nearly 2.8 M vertices in this computational domain

Gordon Bell Prize Finalist Talk SC’99 Fixed-size Parallel Scaling Results (Flop/s)

Gordon Bell Prize Finalist Talk SC’99 Fixed-size Parallel Scaling Results (Time in seconds)

Gordon Bell Prize Finalist Talk SC’99 Inside the Parallel Scaling Results on ASCI Red ONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices (about 11 million unknowns) on up to 3072 ASCI Red Nodes (each with dual Pentium Pro 333 MHz processors)

Gordon Bell Prize Finalist Talk SC’99 Algorithm: Newton-Krylov-Schwarz Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

Gordon Bell Prize Finalist Talk SC’99 Merits of NKS Algorithm/Implementation Relative characteristics: the “exponents” are naturally good ¾ Convergence scalability ¨ weak (or no) degradation in problem size and parallel granularity (with use of small global problems in Schwarz preconditioner) ¾ Implementation scalability ¨ no degradation in ratio of surface communication to volume work (in problem-scaled limit) ¨ only modest degradation from global operations (for sufficiently richly connected networks) Absolute characteristics: the “constants” can be made good ¾ Operation count complexity ¨ residual reductions of in 10 3 “work units” ¾ Per-processor performance ¨ up to 25% of theoretical peak Overall, machine-epsilon solutions require as little as 15 microseconds per degree of freedom!

Gordon Bell Prize Finalist Talk SC’99 Primary PDE Solution Kernels Vertex-based loops ¾ state vector and auxiliary vector updates Edge-based “stencil op” loops ¾ residual evaluation ¾ approximate Jacobian evaluation ¾ Jacobian-vector product (often replaced with matrix- free form, involving residual evaluation) Sparse, narrow-band recurrences ¾ approximate factorization and back substitution Vector inner products and norms ¾ orthogonalization/conjugation ¾ convergence progress and stability checks

Gordon Bell Prize Finalist Talk SC’99 Additive Schwarz Preconditioning for Au=f in Ω Form preconditioner B out of (approximate) local solves on (overlapping) subdomains Let R i and R i T be Boolean gather and scatter operations, mapping between a global vector and its i th subdomain support

Gordon Bell Prize Finalist Talk SC’99 Iteration Count Estimates from the Schwarz Theory [ref: Smith, Bjorstad & Gropp, 1996, Camb. Univ. Pr.] Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square- root of the condition number of the Schwarz- preconditioned system In terms of N and P, where for d-dimensional isotropic problems, N=h -d and P=H -d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows: Preconditioning Typein 2Din 3D Point JacobiΟ(N 1/2 )Ο(N 1/3 ) Domain Jacobi Ο((NP) 1/4 )Ο((NP) 1/6 ) 1-level Additive SchwarzΟ(P 1/3 ) 2-level Additive SchwarzΟ(1)

Gordon Bell Prize Finalist Talk SC’99 Time-Implicit Newton-Krylov-Schwarz Method For nonlinear robustness, NKS iteration is wrapped in time-stepping: for (l = 0; l < n_time; l++) {# n_time ~ 50 select time step for (k = 0; k < n_Newton; k++) {# n_Newton ~ 1 compute nonlinear residual and Jacobian for (j = 0; j < n_Krylov; j++) {# n_Krylov ~ 50 forall (i = 0; i < n_Precon ; i++) { solve subdomain problems concurrently } // End of loop over subdomains perform Jacobian-vector product enforce Krylov basis conditions update optimal coefficients check linear convergence } // End of linear solver perform DAXPY update check nonlinear convergence } // End of nonlinear loop } // End of time-step loop

Gordon Bell Prize Finalist Talk SC’99 PETSc codeUser code Application Initialization Function Evaluation Jacobian Evaluation Post- Processing PCKSP PETSc Main Routine Linear Solvers (SLES) Nonlinear Solvers (SNES) Timestepping Solvers (TS) Separation of Concerns: User Code/PETSc Library

Gordon Bell Prize Finalist Talk SC’99 Key Features of Implementation Strategy Follow the “owner computes” rule under the dual constraints of minimizing the number of messages and overlapping communication with computation Each processor “ghosts” its stencil dependences in its neighbors Ghost nodes ordered after contiguous owned nodes Domain mapped from (user) global ordering into local orderings Scatter/gather operations created between local sequential vectors and global distributed vectors, based on runtime connectivity patterns Newton-Krylov-Schwarz operations translated into local tasks and communication tasks Profiling used to help eliminate performance bugs in communication and memory hierarchy

Gordon Bell Prize Finalist Talk SC’99 Background of PETSc Developed by Gropp, Smith, McInnes & Balay (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly Portable, and callable from C, C++, Fortran Uniform high-level API, with multi-layered entry Aggressively optimized: copies minimized, communication aggregated and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitivetasks (e.g., gather/scatter)

Gordon Bell Prize Finalist Talk SC’99 Single-processor Performance of PETSc-FUN3D Processor Clock MHz Peak Mflop/s Opt. % of Peak Opt. Mflop/s Reord. Only Mflop/s Interl. only Mflop/s Orig. Mflop/s Orig. % of Peak R P P2SC (2 card) P2SC (4 card) e Alpha Alpha Ultra II Ultra II Ultra II/HPC Pent. II/LIN Pent. II/NT Pent. Pro Pent. Pro

Gordon Bell Prize Finalist Talk SC’99 Single-processor Performance of PETSc-FUN3D Processor Clock MHz Peak Mflop/s Opt. % of Peak Opt. Mflop/s Reord. Only Mflop/s Interl. only Mflop/s Orig. Mflop/s Orig. % of Peak R P P2SC (2 card) P2SC (4 card) e Alpha Alpha Ultra II Ultra II Ultra II/HPC Pent. II/LIN Pent. II/NT Pent. Pro Pent. Pro

Gordon Bell Prize Finalist Talk SC’99 Single-processor Performance of PETSc-FUN3D Processor Clock MHz Peak Mflop/s Opt. % of Peak Opt. Mflop/s Reord. Only Mflop/s Interl. only Mflop/s Orig. Mflop/s Orig. % of Peak R P P2SC (2 card) P2SC (4 card) e Alpha Alpha Ultra II Ultra II Ultra II/HPC Pent. II/LIN Pent. II/NT Pent. Pro Pent. Pro

Gordon Bell Prize Finalist Talk SC’99 Lessons for High-end Simulation of PDEs Unstructured (static) grid codes can run well on distributed hierarchical memory machines, with attention to partitioning, vertex ordering, component ordering, blocking, and tuning Parallel solver libraries can give new life to the most valuable, discipline-specific modules of legacy PDE codes Parallel scalability is easy, but attaining high per-processor performance for sparse problems gets more challenging with each machine generation The NKS family of algorithms can be and must be tuned to an application-architecture combination; profiling is critical Some gains from hybrid parallel programming models (message passing and multithreading together) require little work; squeezing the last drop is likely much more difficult

Gordon Bell Prize Finalist Talk SC’99 Remaining Challenges Parallelization of the solver leaves mesh generation, I/O, and post processing as Amdahl bottlenecks in overall time- to-solution ¾ moving finest mesh cross-country with ftp may take hours -- ideal software environment would generate and verify correctness of mesh in parallel, from relatively small geometry file Solution adaptivity of the mesh and parallel redistribution important in ultimate production environment Better multilevel preconditioners needed in some applications In progress: ¾ wrapping a parallel optimization capability (Lagrange-Newton- Krylov-Schwarz) around our parallel solver, with substantial code reuse (automatic differentiation tools will help) ¾ integrating computational snooping and steering into the PETSc environment

Gordon Bell Prize Finalist Talk SC’99 Bibliography Toward Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes & Smith, 1999, in “Proceedings of Parallel CFD'99,” Elsevier (to appear) Implementation of a Parallel Framework for Aerodynamic Design Optimization on Unstructured Meshes, Nielsen, Anderson & Kaushik, 1999, in “Proceedings of Parallel CFD'99,” Elsevier (to appear) Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp Newton-Krylov-Schwarz Methods for Aerodynamics Problems: Compressible and Incompressible Flows on Unstructured Grids, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp How Scalable is Domain Decomposition in Practice, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 10th Intl. Conf. on Domain Decomposition Methods,” AMS, pp

Gordon Bell Prize Finalist Talk SC’99 Acknowledgments Accelerated Strategic Computing Initiative, DOE ¾ access to ASCI Red and Blue machines National Energy Research Scientific Computing Center (NERSC), DOE ¾ access to large T3E SGI-Cray ¾ access to large T3E National Science Foundation ¾ research sponsorship under Multidisciplinary Computing Challenges Program U. S. Department of Education ¾ graduate fellowship support for D. Kaushik

Gordon Bell Prize Finalist Talk SC’99 Related URLs Follow-up on this talk PETSc FUN3D ASCI platforms International Conferences on Domain Decomposition Methods International Conferences on Parallel CFD