HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

ORNL, 6 December 2007 Large-Scale Mass Table Calculations M. Stoitsov J. Dobaczewski, W. Nazarewicz, J. Pei, N. Schunck Department of Physics and Astronomy,

Lawrence Livermore National Laboratory UCRL-XXXX Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work performed under.

A Skyrme QRPA code for deformed nuclei J. Engel and J.T. Univ. of North Carolina at Chapel Hill We now have a working and tested code. We are speeding.

Mean-field calculation based on proton-neutron mixed energy density functionals Koichi Sato (RIKEN Nishina Center) Collaborators: Jacek Dobaczewski (Univ.

Interface HFBTHO/HFODD and Comments on Parallelization UTK-ORNL DFT group.

Towards a Universal Energy Density Functional Towards a Universal Energy Density Functional Study of Odd-Mass Nuclei in EDF Theory N. Schunck University.

Chapter 2, Linear Systems, Mainly LU Decomposition.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

Application of DFT to the Spectroscopy of Odd Mass Nuclei N. Schunck Department of Physics  Astronomy, University of Tennessee, Knoxville, TN-37996, USA.

Bandheads of Rotational Bands and Time-Odd Fields UTK-ORNL DFT group.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.

The hybird approach to programming clusters of multi-core architetures.

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

Basis Light-Front Quantization: a non-perturbative approach for quantum field theory Xingbo Zhao With Anton Ilderton, Heli Honkanen, Pieter Maris, James.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

1 © 2012 The MathWorks, Inc. Speeding up MATLAB Applications.

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.

1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.

Scalable Computational Methods in Quantum Field Theory Advisors: Hemmendinger, Reich, Hiller (UMD) Jason Slaunwhite Computer Science and Physics Senior.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

DFT requirements for leadership-class computers N. Schunck Department of Physics  Astronomy, University of Tennessee, Knoxville, TN-37996, USA Physics.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:

Independent Component Analysis (ICA) A parallel approach.

ESA living planet symposium 2010 ESA living planet symposium 28 June – 2 July 2010, Bergen, Norway GOCE data analysis: realization of the invariants approach.

3D ASLDA solver - status report Piotr Magierski (Warsaw), Aurel Bulgac (Seattle), Kenneth Roche (Oak Ridge), Ionel Stetcu (Seattle) Ph.D. Student: Yuan.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.

Nuclear structure and reactions Nicolas Michel University of Tennessee.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

Strengthening deflation implementation for large scale LQCD inversions Claude Tadonki Mines ParisTech / LAL-CNRS-IN2P3 Review Meeting / PetaQCD LAL / Paris-Sud.

Spectroscopy of Odd-Mass Nuclei in Energy Density Functional Theory Impact of Terascale Computing N. Schunck University of Tennessee, 401 Nielsen Physics,

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Lawrence Livermore National Laboratory Physical Sciences Directorate/N Division The LLNL microscopic fission theory program W. Younes This work performed.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

N. Schunck(1,2,3) and J. L. Egido(3)

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

DFT with Continuum Junchen Pei (Peking U) With Collaborators: US : W. Nazarewicz, G. Fann, Yue Shi, R.Harrison, S. Thornton Europe: M. Kortelainen, P.

Nicolas Michel CEA / IRFU / SPhN / ESNT April 26-29, 2011 Isospin mixing and the continuum coupling in weakly bound nuclei Nicolas Michel (University of.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.

Parallel Computing Presented by Justin Reschke

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Generalized and Hybrid Fast-ICA Implementation using GPU

I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2

Beyond mean-field methods: why and how

Parallel Programming By J. H. Wang May 2, 2017.

Structure and dynamics from the time-dependent Hartree-Fock model

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Presentation transcript:

HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam

HFODD

DFT AND HPC COMPUTING HFODD for Leadership Class Computers

Classes of DFT solvers Resources needed for a “standard HFB” calculation Configuration space: Expansion of the solutions on a basis (HO) Fast and amenable to beyond mean-field extensions Truncation effects: source of divergences/renormalization issues Wrong asymptotic unless different bases are used (WS, PTG, Gamow, etc.) Coordinate-space: Direct integration of the HFB equations Accurate: provide “exact” result Slow and CPU/memory intensive for 2D-3D geometries 1D2D3D R-space1 min, 1 core5 h, 70 cores - HO basis-2 min, 1 core5 h, 1 core

Why High Performance Computing? Large-scale DFT  Static: fission, shape coexistence, etc. – compute > 100k different configurations  Dynamics: restoration of broken symmetries, correlations, time-dependent problems – combine > 100k configurations  Optimization of extended functionals on larger sets of experimental data g.s. of even nucleus can be computed in a matter of minutes on a standard laptop: why bother with supercomputing? Core of DFT: Global theory which averages out individual degrees of freedom  Treatment of correlations ?  ~100 keV level precision ?  Extrapolability ? From light nuclei to neutron stars Rich physics Fast and reliable

Computational Challenges for DFT Self-consistency = iterative process: – Not naturally prone to parallelization (suggests: lots of thinking…) – Computational cost : (number of iterations) × (cost of one iteration) O ( everything else ) Cost of symmetry breaking: triaxiality, reflection asymmetry, time-reversal invariance – Large dense matrices (LAPACK) constructed and diagonalized many times – size of the order of (2,000 x 2,000) – (10,000 x 10,000) (suggests: message passing) – Many long loops (suggests: threading) Finite-range forces/non-local functionals: exact Coulomb, Yukawa-, Gogny-like – Many nested loops (suggests: threading) – Precision issues

HFODD Solve HFB equations in the deformed, Cartesian HO basis Breaks all symmetries (if needed) Zero-range and finite-range forces coded Additional features: cranking, angular momentum projection, etc. Technicalities: – Fortran 77, Fortran 90 – BLAS, LAPACK – I/O with standard input/output + a few files Redde Caesari quae sunt Caesaris

OPTIMIZATIONS HFODD for Leadership Class Computers

Loop reordering Fortran: matrices are stored in memory column-wise  elements must be accessed first by column index, then by row index (good stride) Cost of bad stride grows quickly with number of indexes and dimensions do k = 1, N do j = 1, N do i = 1, N do j = 1, N do k = 1, N Ex.: Accessing M ijk Time of 10 HF iterations as function of the model space (Skyrme SLy4, 208 Pn, HF, exact Coulomb exchange)

Threading (OpenMP) OpenMP designed to auto- matically parallelize loops Ex: calculation of density matrix in HO basis Solutions: – Thread it with OpenMP – When possible, replace all such manual linear algebra with BLAS/LAPACK calls (threaded version exist) do j = 1, N do i = 1, N do  = 1, N Time of 10 HFB iterations as function of the number of threads (Jaguar Cray XT5 – Skyrme SLy4, 152 Dy, HFB, 14 full shells)

Parallel Performance (MPI) Time of 10 HFB iterations as function of the cores (Jaguar Cray XT5, no threads – Skyrme SLy4, 152 Dy, HFB, 14 full shells) DFT = naturally parallel 1 core = 1 configuration (only if ‘all’ fits into core) HFODD characteristics −Very little communication overhead −Lots of I/O per processor (specific to that processor): 3 ASCII files/core Scalability limited by: −File system performance −Usability of the results (handling of thousands of files) ADIOS library being implemented

ScaLAPACK Multi-threading: more memory/core available How about scalability of diagonalization for large model spaces? ScaLAPACK successfully implemented for simplex-breaking HFB calculations (J. McDonnell) Current issues: – Needs detailed profiling as no speed-up is observed: bottleneck? – Problem size adequate? M M M M N shell N N

Hybrid MPI/OpenMP Parallel Model Threads for loop optimization MPI sub-communicator (optional) for very large bases needing ScaLapack ScaLAPACK (MPI) Threading (OpenMP) Task management (MPI) Spread the HFB calculation across a few cores (<12-24) MPI for task management Time HFB - i/NHFB - (i+1)/N Cores

Conclusions DFT codes are naturally parallel and can easily scale to 1 M processors or more High-precision applications of DFT are time- and memory- consuming computations  need for fine-grain parallelization HFODD benefits from HPC techniques and code examination – Loop-reordering give N ≫ 1 speed-up (Coulomb exchange: N ~ 3, Gogny force, N ~ 8) – Multi-threading gives extra factor > 2 (only a few routines have been upgraded) – ScaLAPACK implemented: very large bases (Nshell > 25) can now be used (Ex.: near scission) Scaling only average on standard Jaguar file system because of un-optimized I/O

Year 4 – 5 Roadmap Year 4 – More OpenMP, debugging of ScaLAPACK routine – First tests of ADIOS library (at scale) – First development of a prototype python visualization interface – Tests of large-scale, I/O-briddled, multi-constrained calculations Year 5 – Full implementation of ADIOS – Set up framework for automatic restart (at scale) SVN repository (ask Mario for account)