Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Resolution Aerospace Applications using the NASA Columbia Supercomputer Dimitri J. Mavriplis University of Wyoming Michael J. Aftosmis NASA Ames Research.

Similar presentations


Presentation on theme: "High Resolution Aerospace Applications using the NASA Columbia Supercomputer Dimitri J. Mavriplis University of Wyoming Michael J. Aftosmis NASA Ames Research."— Presentation transcript:

1 High Resolution Aerospace Applications using the NASA Columbia Supercomputer Dimitri J. Mavriplis University of Wyoming Michael J. Aftosmis NASA Ames Research Center Marsha Berger Courant Institute, NYU

2 Computational Aerospace Design and Analysis Computational Fluid Dynamics (CFD) Successes –Preliminary design estimates for various conditions –Accurate prediction at cruise conditions (no flow separation)

3 Computational Aerospace Design and Analysis Computational Fluid Dynamics (CFD) Successes –Preliminary design estimates for various conditions –Accurate prediction at cruise conditions (no flow separation) CFD Shortcomings –Poor predictive ability at off-design conditions Complex geometry, flow separation High accuracy requirements: –Drag coefficient to 10 -4 (i.e. wind tunnels) Production runs with 10 9 grid points (vs 10 7 today) should become commonplace (NIA Congressional Report, 2005)

4 Computational Aerospace Design and Analysis Computational Fluid Dynamics (CFD) Successes –Preliminary design estimates for various conditions –Accurate prediction at cruise conditions (no flow separation) CFD Shortcomings –Poor predictive ability at off-design conditions Complex geometry, flow separation High accuracy requirements: –Drag coefficient to 10 -4 (i.e. wind tunnels) Production runs with 10 9 grid points (vs 10 7 today) should become commonplace (NIA Congressional Report, 2005) Drive towards Simulation based Design –Design Optimization (20 – 100 Analyses) –Flight envelope simulation (10 3 – 10 6 Analyses) –Unsteady Simulations Aeroelastics Digital Flight (simulation of maneuvering vehicle)

5 MOTIVATION CFD computational requirements in aerospace vehicle design are essentially insatiable for the foreseeable future –Addressable through hardware and software (algorithmic) advances Recently installed NASA Columbia Supercomputer provides quantum leap in agency’s/stakeholders’ computing capability Demonstrate advances in state-of-the-art using 2 production codes on Columbia –Cart3D (NASA Ames, NYU) Lower Fidelity, rapid turnaround anaysis –NSU3D (ICASE/NASA Langley, U. Wyoming) Higher Fidelity, analysis and design

6 Cart3D: Cartesian Mesh Inviscid Flow Simulation Package Unprecedented level of automation –Inviscid analysis package –Surface modeling, mesh generation, data extraction –Insensitive to geometric complexity Aimed at: –Aerodynamic database generation –Parametric studies –Preliminary design Wide dissemination –NASA, DoD, DOE, Intel Agencies –US Aerospace industry, commercial and general aviation Mesh GenerationDomain DecompositionFlow solutionParametric Analysis

7 Multigrid Scheme: Fast O(N) Convergence

8 Cart3D Solver programming paradigm Explicit subdomain communication Partition 1 Partition 0 Coarse Fine Space-Filling-Curve based partitioner and mesh coarsener Each subdomain has own local grid hierarchy Good (not perfect) nesting – favor load balance at each level Restrict use of OpenMP constructs - Use MPI-like architecture Exchange via structure copy (OpenMP), send/receive (MPI) Each subdomain resides in processor local memory

9 Flight-Envelope Data-Base Generation (parametric analysis) Configuration space –Vary geometric parameters Control surface deflection Shape optimization –Requires remeshing Wind-Space Parameters –Vary wind vector –Mach,  :incidence,  :sideslip –No remeshing Completely Automated –Hierarchical Job launching, scheduling –Data retrieval

10 Aerodynamic Database Generation Parametric Analysis

11 Flight-Envelope Data-Base Generation (parametric analysis) Configuration space –Vary geometric parameters Control surface deflection Shape optimization –Requires remeshing Wind-Space Parameters –Vary wind vector –Mach,  :incidence,  :sideslip –No remeshing Completely Automated –Hierarchical Job launching, scheduling –Data retrieval

12 Database Generation parametric Analysis: Wind-Space Wind-Space: M ∞ ={0.2-6.0}, α ={-5°–30°},  ={0°–30°} P has dimensions (38 x 25 x 5) 2900 simulations Wind-Space: M ∞ ={0.2-6.0}, α ={-5°–30°},  ={0°–30°} P has dimensions (38 x 25 x 5) 2900 simulations Liquid glide-back booster - Crank delta wing, canards, tail Wind-space only Liquid glide-back booster - Crank delta wing, canards, tail Wind-space only

13 Typically smaller resolution runs 32-64 cpus each Farmed out simultaneously (PBS) 2900 simulations Typically smaller resolution runs 32-64 cpus each Farmed out simultaneously (PBS) 2900 simulations Wind-Space: M ∞ ={0.2-6.0}, α ={-5°–30°},  ={0°–30°} P has dimensions (38 x 25 x 5) 2900 simulations Wind-Space: M ∞ ={0.2-6.0}, α ={-5°–30°},  ={0°–30°} P has dimensions (38 x 25 x 5) 2900 simulations Liquid glide-back booster - Crank delta wing, canards, tail Wind-space only Liquid glide-back booster - Crank delta wing, canards, tail Wind-space only

14 Need for higher resolution simulations - Selected data-base points - General drive to higher accuracy Good large cpu count scalability important Need for higher resolution simulations - Selected data-base points - General drive to higher accuracy Good large cpu count scalability important Wind-Space: M ∞ ={0.2-6.0}, α ={-5°–30°},  ={0°–30°} P has dimensions (38 x 25 x 5) 2900 simulations Wind-Space: M ∞ ={0.2-6.0}, α ={-5°–30°},  ={0°–30°} P has dimensions (38 x 25 x 5) 2900 simulations Liquid glide-back booster - Crank delta wing, canards, tail Wind-space only Liquid glide-back booster - Crank delta wing, canards, tail Wind-space only

15 NSU3D: Unstructured Navier- Stokes Solver High fidelity viscous analysis –Resolves thin boundary layer to wall O(10 -6 ) normal spacing Stiff discrete equations to solve Suite of turbulence models available –High accuracy objective: 0.01% Cd –50-100 times cost of inviscid analysis (Cart3D) Unstructured mixed element grids for complex geometries –VGRID: NASA Langley Production use in commercial, general aviation industry Extension to Design Optimization and Unsteady Simulations

16 Agglomeration Multigrid Agglomeration Multigrid solvers for unstructured meshes –Coarse level meshes constructed by agglomerating fine grid cells/equations

17 Agglomeration Multigrid Automated Graph-Based Coarsening Algorithm Coarse Levels are Graphs Coarse Level Operator by Galerkin Projection Grid independent convergence rates (order of magnitude improvement)

18 Anisotropy Induced Stiffness Convergence rates for RANS (viscous) problems much slower then inviscid flows –Mainly due to grid stretching –Thin boundary and wake regions –Mixed element (prism-tet) grids Use directional solver to relieve stiffness –Line solver in anisotropic regions

19 Method of Solution Line-implicit solver Strong coupling

20 Parallelization through Domain Decomposition Intersected edges resolved by ghost vertices Generates communication between original and ghost vertex –Handled using MPI and/or OpenMP (Hybrid implementation) –Local reordering within partition for cache-locality Multigrid levels partitioned independently –Match levels using greedy algorithm –Optimize intra-grid communication vs inter-grid communication

21 Partitioning (Block) Tridiagonal Lines solver inherently sequential Contract graph along implicit lines Weight edges and vertices Partition contracted graph Decontract graph –Guaranteed lines never broken –Possible small increase in imbalance/cut edges

22 Partitioning Example 32-way partition of 30,562 point 2D grid Unweighted partition: 2.6% edges cut, 2.7% lines cut Weighted partition: 3.2% edges cut, 0% lines cut

23 Hybrid MPI-OMP (NSU3D) MPI master gathers/scatters to OMP threads OMP local thread-to-thread communication occurs during MPI Irecv wait time (attempt to overlap)

24 Simulation Strategy NSU3D: Isolated high resolution analyses and design optimization Cart3D: Rapid flight envelope data-base fill-in Examine performance of each code individually on Columbia –Both codes use customized multigrid solvers –Domain-decomposition based parallelism –Extensive cache-locality reordering optimization 1.3 to 1.6 Gflops on 1.6GHz Itanium2 cpu (pfmon utility) –NSU3D: Hybrid MPI/OpenMP –Cart3D: MPI or OpenMP (exclusively)

25 NASA Columbia Supercluster 20 SGI Atix Nodes –512 Itanium2 cpus each –1 Tbyte memory each –1.5Ghz / 1.6Ghz –Total 10,240 cpus 3 Interconnects –SGI NUMAlink (shared memory in node) –Infiniband (across nodes) –10Gig Ethernet (File I/O) Subsystems: –8 Nodes: Double density Altix 3700BX2 –4 Nodes: NUMAlink4 interconnect between nodes BX2 Nodes, 1.6GHz cpus

26 NASA Columbia Subsystems 4 Node: Altix 3700 BX2 –2048 cpus, 1.6GHz, 4 Tbytes RAM –NUMAlink4 between all cpus (6.4Gbytes/s) MPI, SHMEM, MLP across all cpus OpenMP (OMP) within a node –Infiniband using MPI across all cpus Limited to 1524 mpi processes 8 Node: Altix 3700BX2 –4096 cpus, 1.6 GHz / 1.5 GHz, 8 Tbytes RAM –Infiniband using MPI required for cross-node communication –Hardware limitation: 2048 MPI connections Requires hybrid MPI/OMP to access all cpus 2 OMP threads running under each MPI process Overall well balanced machine: 1Gbyte/sec I/O on each node

27 Cart3D Solver Performance Test problem: –Full Space Shuttle Launch Vehicle –Mach 2.6, –AoA = 2.09 deg –Mesh 25 M cells

28 Cart3D Solver Performance Test problem: –Full Space Shuttle Launch Vehicle –Mach 2.6, –AoA = 2.09 deg –Mesh 25 M cells Pressure Contours

29 Cart3D Solver Performance Pure OpenMP restricted to single 512 cpu node Ran from 32-504 cpus (node c18) Perfect speed-up assumed on 32 cpus OpenMP show slight beak at 128 cpus due to change in global addressing scheme - MPI unaffected (no global addressing) MPI achieves ~0.75 TFLOP/s on 496 cpus Compare OpenMP and MPI

30 Cart3D Solver Performance Used NUMAlink on c17-c20 MPI only, from 32-2016 cpus Reducing multigrid de- emphasizes communication Single-grid scalability ~1900 on 2016 cpus Coarsest mesh in 4 level multigrid has ~16 cells/partition 4 Level multigrid shows parallel speedups of 1585 on 2016 cpus Single Mesh vs Multigrid

31 Cart3D Solver Performance MPI only, from 32-2016 cpus, 4 Level multigrid –32-496 cpus run on 1 node - (no interconnect) –508-1000 cpus run on 2 nodes –1000-2016 cpus on 4 nodes IB lags due to decrease in delivered bandwidth Delivered bandwidth drops again when going from 2 to 4 nodes NUMAlink on 2016 cpus achieves over 2.4 TFLOP/s Compare NUMAlink & Infiniband

32 NSU3D TEST CASE Wing-Body Configuration 72 million grid points Transonic Flow Mach=0.75, Incidence = 0 degrees, Reynolds number=3,000,000

33 NSU3D Scalability 72M pt grid –Assume perfect speedup on 128 cpus Good scalability up to 2008 using NUMAlink –Superlinear ! Multigrid slowdown due to coarse grid communication ~3TFlops on 2008 cpus GFLOPSGFLOPS

34 NSU3D Scalability Best convergence with 6 level multigrid scheme Importance of fastest overall solution strategy –5 level Multigrid –10 minutes wall clock time for steady-state solution on 72M pt grid

35 NUMAlink vs. Infiniband (IB) IB required for > 2048 cpus Hybrid MPI/OMP required due to MPI limitation under IB Slight drop-off using IB Additional penalty with increasing OMP Threads –(locally) sequential nature during mpi to mpi communication 128 cpu run split across 4 hosts

36 NUMAlink vs. Infiniband (IB) 2 OMP required for IB on 2048 Excellent scalability for single grid solver (non multigrid) Single Grid (no multigrid)

37 NUMAlink vs. Infiniband (IB) 2 OMP required for IB on 2048 Dramatic drop-off for 6 level multigrid 6 level multigrid

38 NUMAlink vs. Infiniband(IB) 2 OMP required for IB on 2048 Dramatic drop-off for 5 level multigrid 5 level multigrid

39 NUMAlink vs. Infiniband(IB) 2 OMP required for IB on 2048 Dramatic drop-off for 4 level multigrid 4 level multigrid

40 NUMAlink vs. Infiniband(IB) 2 OMP required for IB on 2048 Dramatic drop-off for 3 level multigrid 3 level multigrid

41 NUMAlink vs. Infiniband(IB) 2 OMP required for IB on 2048 Dramatic drop-off for 2 level multigrid 2 level multigrid

42 NUMAlink vs. Infiniband(IB) Similar slowdowns with NUMALINK and Infiniband 2 nd coarse grid level alone ( < 1M pts)

43 NUMAlink vs. Infiniband(IB) Inconsistent NUMAlink/Infiniband performance Multigrid IB drop-off not due to coarse grid level communication Due to inter-grid communication –Not bandwidth related –More non-local communication pattern –Sensitive to system ENV variable settings Addressable through: –More local fine to coarse partitioning –Multi-level communication strategy

44 Single Grid Performance up to 4016 cpus 1 OMP possible for IB on 2008 (8 hosts) 2 OMP required for IB on 4016 (8 hosts) Good scalability up to 4016 5.2 Tflops at 4016 First real world application on Columbia using > 2048 cpus

45 Inhomogeneous CPU Set 1.5GHz (6Mbyte L3) cpus vs 1.6GHz (9Mbyte L3) cpus –Responsible for part of slowdown

46 Concluding Remarks NASA’s Columbia Supercomputer enabling advances in state-of-the-art for aerospace computing applications –~100M grid point solutions (turbulent flow) in 15 minutes –Design Optimization overnight (20-100 analyses) –Rapid flight envelope data-base generation –Time dependent maneuvering solutions: Digital Flight

47 Conclusions Much higher resolution analyses possible –72M pts on 4016 cpus  only 18,000 points per cpu –10 9 Grid points feasible on 2000+ cpus Should become routine in future Approximately 4 hour turnaround on Columbia Other bottlenecks must be addressed –I/O Files: 35 Gbytes for 72M pts -- 400Gbytes for 10 9 pts –Other sequential pre/post processing (i.e. grid generation) –Requires rethinking entire process On demand parallel grid refinement and load balancing Obviates large input files Requires link to CAD database from parallel machine

48 Conclusions Columbia Supercomputer Architecture validated on real world applications (2 production codes) –NUMAlink provides best performance but currently limited to 2048 cpus –Infiniband practical for very large applications Requires hybrid MPI/OMP communication Issues remain concerning communication patterns Bandwidth adequate for current and larger applications –Large applications on 8,000 to 10,240 cpus using IB and OMP 4 should be feasible and achieve ~12Tflops Special thanks to : Bob Ciotti (NASA), Bron Nelson (SGI)


Download ppt "High Resolution Aerospace Applications using the NASA Columbia Supercomputer Dimitri J. Mavriplis University of Wyoming Michael J. Aftosmis NASA Ames Research."

Similar presentations


Ads by Google