Presentation is loading. Please wait.

Presentation is loading. Please wait.

PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering.

Similar presentations


Presentation on theme: "PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering."— Presentation transcript:

1 PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering and Materials Science Dept. UC Davis.

2 GW/BSE Method Overview DFT Kohn-Sham (SCF and NSCF) {φ DFT nk (r), E DFT nk } Compute Dielectric Function { } GW: Quasiparticle Properties {φ QP nk (r), E QP nk } BSE: Construct Kernel (coarse grid) K(k,c,v,k',c',v') Interpolate Kernel to Fine Grid / Diagonalize BSE Hamiltonian {A s cvk, E s cvk } Expt. G.E. Jellison, M.F. Chisholm, S.M. Gorbatkin, Appl. Phys. Lett. 62, 3348 (1993).

3 Computational Cost: GW Method for nanotube 80 carbon atoms, 80x80x4.6au 160 occupied (valence) bands, 800 unoccupied (conduction) bands kpoints 1x1x32 (coarse) 1x1x256 (fine) Running on Cray XE6 Hopper Generation of empty states ~30% of computational cost and highest in terms of wall clock time scaling issues for running DFT codes for large number of bands (on relatively small system)

4 Features of Different Codes for generation of empty states (what to use for GW/BSE ? ) SIESTA (Spanish Initiative for Electronic Simulations with Thousands of Atoms Basis set LCAO (Linear Combination of Atomic Orbitals) Less accurate basis allows larger systems to be studied (thousands of atoms) Good for non-periodic systems, large molecules O(N) algorithms implemented in LCAO basis PARSEC (Pseudopotential Algorithm for Real-Space Electronic structure Calculations) Grid based real space representation finite-difference approach Easy to implement non-periodic boundary conditions Good for large molecules etc. Quantum Espresso Plane Wave basis set (same as BerkeleyGW code) PAW (Projector Augmented Wavefunctions) option Hybrid Functionals PARATEC (PARAllel Total Energy Code) Plane Wave basis set (same as BerkeleyGW code) Good for periodic systems (crystals etc, metallic systems) Hybrid Functionals static-COHSEX OpenMP/MPI Hybrid implementation

5 PARATEC (PARAllel Total Energy Code) PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set Written in F90 and MPI Designed to run on large parallel machines Cray, IBM etc. but also runs on PCs PARATEC uses all-band CG approach to obtain wavefunctions of electrons (blocks comms. Specialized 3dffts) Generally obtains high percentage of peak on different platforms (uses BLAS3 and 1d FFT libs) Developed by Louie and Cohen groups (UCB, LBNL) in collaboration with CRD, NERSC

6 Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC Computational Task (CG solver)Scaling OrthogonalizationMN 2 Subspace diagonalizationN3N3 3d FFTs (most communications)NMlogM Nonlocal pseudopotentialMN 2 (N 2 real space) N: number of eigenpairs required (lowest in spectrum) M: matrix (Hamiltonian) dimension, basis set size (M ~ N)

7 Load Balancing, Parallel Data Layout Wavefunctions stored as spheres of points (due to energy cutoff) Data intensive parts (BLAS) proportional to number of Fourier components Pseudopotential calculation, Orthogonalization scales as N 3 (atom system) FFT part scales as N 2 logN FFT Data distribution: load balancing constraints (Fourier Space): each processor should have same number of Fourier coefficients (N 3 calcs.) each processor should have complete columns of Fourier coefficients (3d FFT) Give out sets of columns of data to each processor

8 PARATEC: Performance  Grid size  All architectures generally achieve high performance due to computational intensity of code (BLAS3, FFT)  ES achieves highest overall performance : 5.5Tflop/s on 2048 procs (5.3 Tflops on XT4 on 2048 procs in single proc. node mode)  FFT used for benchmark for NERSC procurements (run on up to 18K procs on Cray XT4, weak scaling )  Vectorisation directives and multiple 1d FFTs required for NEC SX6 Developed with Louie and Cohen’s groups (UCB, LBNL), also work with L. Oliker, J Carter Problem Proc Bassi NERSC (IBM Power5) Jaquard NERSC (Opteron) Thunder (Itanium2) Franklin NERSC (Cray XT4) NEC ES (SX6) IBM BG/L Gflops /Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak Gflops/ Proc % peak 488 Atom CdSe Quantu m Dot % 2.851% 5.164% %1.9845%2.647% %5.062% % %0.9521%2.444% %4.455% % %1.832% %3.646% %2.735%

9 Parallelization in PW DFT codes four levels (k-points, bands, PWs, OpenMP) Band parallelization: n nodes divided into groups k-point parallelization: divide k-points among groups of nodes (limited for large systems, molecules, nanostructures etc) PW parallelization: each group parallelizes over PWs OpenMP, Threaded Libs on the node/chip

10 OpenMP, Threading for on-node/chip parallelism fewer mpi messages avoids communication bottlenecks aggregation of messages per node reduces latency issues smaller memory footprint (from code and mpi buffers) no on-node mpi messaging extra level of parallelism to improve scaling to larger core counts Timing results for threaded version of PARATEC code used to generate VB and CB states for input to GW code PARATEC (Cray XT5 Jaguar) 686 Si atoms Jaguar Cray XT5 at ORNL (224,162 cores) : Node: 2 AMD Istambul 2.6 GHz 6 core chips (Total 12 cores, 2x6cores)

11 Non-SCF problem to generate empty CB states Solve selfconsistently for N VB valence states Solve non-selfconsistently for N VB + N CB states Output Output for GW/BSE codes Non-SCF problem is like simulation of metallic system (no gap above top of spectrum) Slow convergence requires convergence criteria for empty states N VB + N CB can be very large Operations on subspace matrix can dominate High percent of eigenpairs calculated compared to SCF calc. Typically almost all the time is for the Non-SCF calc.

12 Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC Computational Task (CG solver)Scaling OrthogonalizationMN 2 Subspace diagonalizationN3N3 3d FFTs (most communications)NMlogM Nonlocal pseudopotentialMN 2 (N 2 real space) N= N VB + N CB (N VB ): number of eigenpairs required M: matrix (Hamiltonian) dimension, basis set size (M ~10-20N) (M ~ N) NSCF calculation for GW/BSE (compared to standard SCF)

13 PARATEC features for Non-SCF problem Efficient distributed implementation of operations on subspace matrix using Scalapack Extra states calculated above the required number to improve convergence of CG solver Option for using direct solver on Hamiltonian when percentage of eigenpairs required is high (>10%) can be faster than CG iterative solver (P. Zhang) Scaling of Iterative Solver (e.g. CG)  N 2 M Compared to Direct (Lapack, Scalapack)  M 3 (M = matrix size (basis, number of PWs), N = number of states) Block-block data layout Block size chosen for optimal performance

14 PARATEC summary and future developments PARATEC optimized for large parallel machines (Cray, IBM) OpenMP/Threaded version under development (important to get more parallelism, particularly for small systems for GW/BSE, gives faster time to solution) Hybrid Functionals, static-COHSEX (starting point for GW/BSE) Some optimization for generation of empty states for GW/BSE Direct diagonalization of H for cases when high % of eigenstates required (to be in released version soon) for GW/BSE


Download ppt "PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering."

Similar presentations


Ads by Google