Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michael L. Norman, UC San Diego and SDSC

Similar presentations


Presentation on theme: "Michael L. Norman, UC San Diego and SDSC"— Presentation transcript:

1 Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu

2  A parallel AMR application for astrophysics and cosmology simulations  Hybrid physics: fluid + particle + gravity + radiation  Block structured AMR  MPI or hybrid parallelism  Under continuous development since 1994  Greg Bryan and Mike Norman @ NCSA  Shared memory  distributed memory  hierarchical memory  C++/C/F, >185,000 LOC  Community code in widespread use worldwide  Hundreds of users, dozens of developers  Version 2.0 @ http://enzo.googlecode.com

3

4 ASTROPHYSICAL FLUID DYNAMICSHYDRODYNAMIC COSMOLOGY Supersonic turbulence Large scale structure

5 PhysicsEquationsMath typeAlgorithm(s)Communication Dark matterNewtonian N-body Numerical integration Particle-meshGather-scatter GravityPoissonEllipticFFT multigrid Global Gas dynamicsEulerNonlinear hyperbolic Explicit finite volume Nearest neighbor Magnetic fields Ideal MHDNonlinear hyperbolic Explicit finite volume Nearest neighbor Radiation transport Flux-limited radiation diffusion Nonlinear parabolic Implicit finite difference Multigrid solves Global Multispecies chemistry Kinetic equations Coupled stiff ODEs Explicit BE, implicit None Inertial, tracer, source, and sink particles Newtonian N-body Numerical integration Particle-meshGather-scatter Physics modules can be used in any combination in 1D, 2D and 3D making ENZO a very powerful and versatile code

6  Berger-Collela structured AMR  Cartesian base grid and subgrids  Hierarchical timetepping

7 Level 0 AMR = collection of grids (patches); each grid is a C++ object Level 1 Level 2

8 Unigrid = collection of Level 0 grid patches

9  Shared memory (PowerC) parallel (1994-1998)  SMP and DSM architecture (SGI Origin 2000, Altix)  Parallel DO across grids at a given refinement level including block decomposed base grid  O(10,000) grids  Distributed memory (MPI) parallel (1998-2008)  MPP and SMP cluster architectures (e.g., IBM PowerN)  Level 0 grid partitioned across processors  Level >0 grids within a processor executed sequentially  Dynamic load balancing by messaging grids to underloaded processors (greedy load balancing)  O(100,000) grids

10

11 Projection of refinement levels 160,000 grid patches at 4 refinement levels

12 1 MPI task per processor Task = a Level 0 grid patch and all associated subgrids; processed sequentially across and within levels

13  Hierarchical memory (MPI+OpenMP) parallel (2008-)  SMP and multicore cluster architectures (SUN Constellation, Cray XT4/5)  Level 0 grid partitioned across shared memory nodes/multicore processors  Parallel DO across grids at a given refinement level within a node  Dynamic load balancing less critical because of larger MPI task granularity (statistical load balancing)  O(1,000,000) grids

14 N MPI tasks per SMP M OpenMP threads per task Task = a Level 0 grid patch and all associated subgrids processed concurrently within levels and sequentially across levels Each grid is an OpenMP thread

15 ENZO ON CRAY XT51% OF THE 6400 3 SIMULATION  Non-AMR 6400 3 80 Mpc box  15,625 (25 3 ) MPI tasks, 256 3 root grid tiles  6 OpenMP threads per task  93,750 cores  30 TB per checkpoint/re- start/data dump  >15 GB/sec read, >7 GB/sec write  Benefit of threading  reduce MPI overhead & improve disk I/O

16 ENZO ON CRAY XT510 5 SPATIAL DYNAMIC RANGE  AMR 1024 3 50 Mpc box, 7 levels of refinement  4096 (16 3 ) MPI tasks, 64 3 root grid tiles  1 to 6 OpenMP threads per task - 4096 to 24,576 cores  Benefit of threading  Thread count increases with memory growth  reduce replication of grid hierarchy data

17 Using MPI+threads to access more RAM as the AMR calculation grows in size

18 ENZO-RHD ON CRAY XT5COSMIC REIONIZATION  Including radiation transport 10x more expensive  LLNL Hypre multigrid solver dominates run time  near ideal scaling to at least 32K MPI tasks  Non-AMR 1024 3 8 and 16 Mpc boxes  4096 (16 3 ) MPI tasks, 64 3 root grid tiles

19  Cosmic Reionization is a weak-scaling problem  large volumes at a fixed resolution to span range of scales  Non-AMR 4096 3 with ENZO-RHD  Hybrid MPI and OpenMP  SMT and SIMD tuning  128 3 to 256 3 root grid tiles  4-8 OpenMP threads per task  4-8 TBytes per checkpoint/re-start/data dump (HDF5)  In-core intermediate checkpoints (?)  64-bit arithmetic, 64-bit integers and pointers  Aiming for 64-128 K cores  20-40 M hours (?)

20  ENZO’s AMR infrastructure limits scalability to O(10 4 ) cores  We are developing a new, extremely scalable AMR infrastructure called Cello  http://lca.ucsd.edu/projects/cello http://lca.ucsd.edu/projects/cello  ENZO-P will be implemented on top of Cello to scale to

21

22  Hierarchical parallelism and load balancing to improve localization  Relax global synchronization to a minimum  Flexible mapping between data structures and concurrency  Object-oriented design  Build on best available software for fault- tolerant, dynamically scheduled concurrent objects (Charm++)

23 1. hybrid replicated/distributed octree-based AMR approach, with novel modifications to improve AMR scaling in terms of both size and depth; 2. patch-local adaptive time steps; 3. flexible hybrid parallelization strategies; 4. hierarchical load balancing approach based on actual performance measurements; 5. dynamical task scheduling and communication; 6. flexible reorganization of AMR data in memory to permit independent optimization of computation, communication, and storage; 7. variable AMR grid block sizes while keeping parallel task sizes fixed; 8. address numerical precision and range issues that arise in particularly deep AMR hierarchies; 9. detecting and handling hardware or software faults during run-time to improve software resilience and enable software self-management.

24

25

26

27 http://lca.ucsd.edu/projects/cello

28

29  Enzo website (code, documentation)  http://lca.ucsd.edu/projects/enzo http://lca.ucsd.edu/projects/enzo  2010 Enzo User Workshop slides  http://lca.ucsd.edu/workshops/enzo2010 http://lca.ucsd.edu/workshops/enzo2010  yt website (analysis and vis.)  http://yt.enzotools.org http://yt.enzotools.org  Jacques website (analysis and vis.)  http://jacques.enzotools.org/doc/Jacques/Jacques.html http://jacques.enzotools.org/doc/Jacques/Jacques.html

30

31 Level 0 xx x Level 1 Level 2 (0,0) (1,0) (2,0) (2,1)

32 (0) (1,0) (1,1) (2,0) (2,1) (2,2) (2,3) (2,4) (3,0)(3,1)(3,2)(3,4)(3,5)(3,6) (3,7) (4,0) (4,1) (4,3) (4,4) Depth (level) Breadth (# siblings) Scaling the AMR grid hierarchy in depth and breadth

33 LevelGridsMemory (MB)Work = Mem*(2^level) 0512179,029 1223,275114,629229,258 251,52221,22684,904 317,4486,08548,680 47,2161,97531,600 53,3701,00632,192 61,67459938,336 779431139,808 Total305,881324,860683,807

34 real grid object virtual grid object grid metadata physics data grid metadata Current MPI Implementation

35  Flat MPI implementation is not scalable because grid hierarchy metadata is replicated in every processor  For very large grid counts, this dominates memory requirement (not physics data!)  Hybrid parallel implementation helps a lot!  Now hierarchy metadata is only replicated in every SMP node instead of every processor  We would prefer fewer SMP nodes (8192-4096) with bigger core counts (32-64) (=262,144 cores)  Communication burden is partially shifted from MPI to intranode memory accesses

36  Targeted at fluid, particle, or hybrid (fluid+particle) simulations on millions of cores  Generic AMR scaling issues:  Small AMR patches restrict available parallelism  Dynamic load balancing  Maintaining data locality for deep hierarchies  Re-meshing efficiency and scalability  Inherently global multilevel elliptic solves  Increased range and precision requirements for deep hierarchies


Download ppt "Michael L. Norman, UC San Diego and SDSC"

Similar presentations


Ads by Google