Presentation is loading. Please wait.

Presentation is loading. Please wait.

HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam.

Similar presentations


Presentation on theme: "HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam."— Presentation transcript:

1 HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam

2 HFODD

3 DFT AND HPC COMPUTING HFODD for Leadership Class Computers

4 Classes of DFT solvers Resources needed for a “standard HFB” calculation Configuration space: Expansion of the solutions on a basis (HO) Fast and amenable to beyond mean-field extensions Truncation effects: source of divergences/renormalization issues Wrong asymptotic unless different bases are used (WS, PTG, Gamow, etc.) Coordinate-space: Direct integration of the HFB equations Accurate: provide “exact” result Slow and CPU/memory intensive for 2D-3D geometries 1D2D3D R-space1 min, 1 core5 h, 70 cores - HO basis-2 min, 1 core5 h, 1 core

5 Why High Performance Computing? Large-scale DFT  Static: fission, shape coexistence, etc. – compute > 100k different configurations  Dynamics: restoration of broken symmetries, correlations, time-dependent problems – combine > 100k configurations  Optimization of extended functionals on larger sets of experimental data g.s. of even nucleus can be computed in a matter of minutes on a standard laptop: why bother with supercomputing? Core of DFT: Global theory which averages out individual degrees of freedom  Treatment of correlations ?  ~100 keV level precision ?  Extrapolability ? From light nuclei to neutron stars Rich physics Fast and reliable

6 Computational Challenges for DFT Self-consistency = iterative process: – Not naturally prone to parallelization (suggests: lots of thinking…) – Computational cost : (number of iterations) × (cost of one iteration) O ( everything else ) Cost of symmetry breaking: triaxiality, reflection asymmetry, time-reversal invariance – Large dense matrices (LAPACK) constructed and diagonalized many times – size of the order of (2,000 x 2,000) – (10,000 x 10,000) (suggests: message passing) – Many long loops (suggests: threading) Finite-range forces/non-local functionals: exact Coulomb, Yukawa-, Gogny-like – Many nested loops (suggests: threading) – Precision issues

7 HFODD Solve HFB equations in the deformed, Cartesian HO basis Breaks all symmetries (if needed) Zero-range and finite-range forces coded Additional features: cranking, angular momentum projection, etc. Technicalities: – Fortran 77, Fortran 90 – BLAS, LAPACK – I/O with standard input/output + a few files Redde Caesari quae sunt Caesaris

8 OPTIMIZATIONS HFODD for Leadership Class Computers

9 Loop reordering Fortran: matrices are stored in memory column-wise  elements must be accessed first by column index, then by row index (good stride) Cost of bad stride grows quickly with number of indexes and dimensions do k = 1, N do j = 1, N do i = 1, N do j = 1, N do k = 1, N Ex.: Accessing M ijk Time of 10 HF iterations as function of the model space (Skyrme SLy4, 208 Pn, HF, exact Coulomb exchange)

10 Threading (OpenMP) OpenMP designed to auto- matically parallelize loops Ex: calculation of density matrix in HO basis Solutions: – Thread it with OpenMP – When possible, replace all such manual linear algebra with BLAS/LAPACK calls (threaded version exist) do j = 1, N do i = 1, N do  = 1, N Time of 10 HFB iterations as function of the number of threads (Jaguar Cray XT5 – Skyrme SLy4, 152 Dy, HFB, 14 full shells)

11 Parallel Performance (MPI) Time of 10 HFB iterations as function of the cores (Jaguar Cray XT5, no threads – Skyrme SLy4, 152 Dy, HFB, 14 full shells) DFT = naturally parallel 1 core = 1 configuration (only if ‘all’ fits into core) HFODD characteristics −Very little communication overhead −Lots of I/O per processor (specific to that processor): 3 ASCII files/core Scalability limited by: −File system performance −Usability of the results (handling of thousands of files) ADIOS library being implemented

12 ScaLAPACK Multi-threading: more memory/core available How about scalability of diagonalization for large model spaces? ScaLAPACK successfully implemented for simplex-breaking HFB calculations (J. McDonnell) Current issues: – Needs detailed profiling as no speed-up is observed: bottleneck? – Problem size adequate? M M M M N shell 14182226 N680133023003654 4N27205320920014616

13 Hybrid MPI/OpenMP Parallel Model Threads for loop optimization MPI sub-communicator (optional) for very large bases needing ScaLapack ScaLAPACK (MPI) Threading (OpenMP) Task management (MPI) Spread the HFB calculation across a few cores (<12-24) MPI for task management Time HFB - i/NHFB - (i+1)/N Cores

14 Conclusions DFT codes are naturally parallel and can easily scale to 1 M processors or more High-precision applications of DFT are time- and memory- consuming computations  need for fine-grain parallelization HFODD benefits from HPC techniques and code examination – Loop-reordering give N ≫ 1 speed-up (Coulomb exchange: N ~ 3, Gogny force, N ~ 8) – Multi-threading gives extra factor > 2 (only a few routines have been upgraded) – ScaLAPACK implemented: very large bases (Nshell > 25) can now be used (Ex.: near scission) Scaling only average on standard Jaguar file system because of un-optimized I/O

15 Year 4 – 5 Roadmap Year 4 – More OpenMP, debugging of ScaLAPACK routine – First tests of ADIOS library (at scale) – First development of a prototype python visualization interface – Tests of large-scale, I/O-briddled, multi-constrained calculations Year 5 – Full implementation of ADIOS – Set up framework for automatic restart (at scale) SVN repository (ask Mario for account) http://www.massexplorer.org/svn/HFODDSVN/trunk


Download ppt "HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam."

Similar presentations


Ads by Google