Presentation on theme: "HIGH PERFORMANCE ELECTRONIC STRUCTURE THEORY Mark S. Gordon, Klaus Ruedenberg Ames Laboratory Iowa State University BBG."— Presentation transcript:
HIGH PERFORMANCE ELECTRONIC STRUCTURE THEORY Mark S. Gordon, Klaus Ruedenberg Ames Laboratory Iowa State University BBG
OUTLINE Methods and Strategies –Correlated electronic structure methods –Distributed Data Interface (DDI) –Approaches to efficient HPC in chemistry –Scalability with examples
CORRELATED ELECTRONIC STRUCTURE METHODS Well Correlated Methods Needed for –Accurate relative energies, dynamics –Treatment of excited states, photochemistry –Structures of diradicals, complex species Computationally demanding: Scalability important HF Often Reasonable Starting Point for Ground States, Small Diradical Character –Single reference perturbation theory MP2/MBPT2 Scales ~N 5 Size consistent Higher order MBPT methods often perform worse
SINGLE REFERENCE COUPLED CLUSTER METHODS –Cluster expansion is more robust Can sum all terms in expansion Size-consistent –State-of-the-art single reference method CCSD, CCSDT, CCSDTQ, … CCSD(T), CR-CCSD(T): efficient compromise –Scales ~N 7 Methods often fail for bond-breaking: consider N 2 –Breaking 3 bonds: –Minimal active space = (6,6)
MCSCF METHODS Single configuration methods can fail for –Species with significant diradical character –Bond breaking processes –Often for excited electronic states –Unsaturated transition metal complexes Then MCSCF-based method is necessary Most common approach is –Complete active space SCF (CASSCF/FORS) Active space = orbitals+electrons involved in process Full CI within active space: optimize orbitals & CI coeffs Size-consistent
MULTI-REFERENCE METHODS Multi reference methods, based on MCSCF –Second order perturbation theory (MRPT2) Relatively computationally efficient Size consistency depends on implementation –Multi reference configuration interaction (MRCI) Very accurate, very time-consuming Highly resource demanding Most common is MR(SD)CI Generally limited to (14,14) active space Not size-consistent –How to improve efficiency?
DISTRIBUTED PARALLEL COMPUTING Distribute large arrays among available processors Distributed Data Interface (DDI) in GAMESS –Developed by G. Fletcher, M. Schmidt, R. Olson –Based on one-sided message passing –Implemented on T3E using SHMEM –Implemented on clusters using sockets or MPI, and paired CPU/data server
The virtual shared-memory model. Each large box (grey) represents the memory available to a given CPU. The inner boxes represent the memory used by the parallel processes (rank in lower right). The gold region depicts the memory reserved for the storage of distributed data. The arrows indicate memory access (through any means) for the distributed operations: get, put and accumulate.
FULL shared-memory model: All DDI processes within a node attach to all the shared-memory segments. The accumulate operation shown can now be completed directly through memory.
CURRENTLY DDI ENABLED Currently implemented –Closed shell MP2 energies & gradients Most efficient closed shell correlated method when appropriate (single determinant) Geometry optimizations Reaction path following On-the-fly “direct dynamics –Unrestricted open shell MP2 energies & gradients Simplest correlated method for open shells –Restricted open shell (ZAPT2) energies & grad Most efficient open shell correlated method No spin contamination through second order
CURRENTLY DDI ENABLED –CASSCF Hessians Necessary for vibrational frequencies, transition state searches, building potential energy surfaces –MRMP2 energies Most efficient correlated multi-reference method –Singles CI energies & gradients Simplest qualitative method for excited electronic states –Full CI energies Exact wavefunction for a given atomic basis –Effective fragment potentials Sophisticated model for intermolecular interactions
COMING TO DDI In progress –Vibronic (derivative) coupling (Tim Dudley) Conical intersections, photochemistry –GVVPT2 energies&gradients: Mark Hoffmann –ORMAS energies, gradients Joe Ivanic, Andrey Adsatchev Subdivides CASSCF active space into subspaces –Coupled cluster methods Ryan Olson, Ian Pimienta, Alistair Rendell Collaboration w/ Piotr Piecuch, Ricky Kendall Key Point: –Must grow problem size to maximize scalability
FULL CI: ZHENGTING GAN –Full CI = exact wavefunction for given atomic basis –Extremely computationally demanding Scales ~ e N Can generally only be applied to atoms & small molecules Very important because all other approximate methods can be benchmarked against Full CI Can expand the size of applicable molecules by making the method highly scalable/parallel CI part of FORS/CASSCF
–Parallel performance for FCI on IBM P3 cluster * singlet state of H 3 COH: –14 electrons in 14 orbitals –11,778,624 determinants ** singlet state of H 2 O 2 –14 electrons in 15 orbitals –41,409,225 determinants JCP, 119, 47 (2003)
–Parallel performance for FCI on Cray X1 (ORNL) O - –Aug-cc-pVTZ atomic basis, O 1s orbitals frozen –7 valence electrons in 79 orbitals –14,851,999,576 determinants: ~ 8-10 Gflops/12.5 theoretical –Latest results:aug-cc-pVTZ C 2, 8 electrons in 68 orbitals –64,931,348,928 determinants, < 4 hours wall time!
–Comparison with Coupled Cluster
Full Potential Energy Surfaces
F 2 potential energy curves: cc-pVTZ
MCSCF HESSIANS: TIM DUDLEY –Analytic Hessians generally superior to numerical or semi-numerical –Finite displacements frequently cause artificial symmetry breaking or root flipping –Necessary step for derivative coupling –Computationally demanding: Parallel efficiency desirable –DDI-based MCSCF Hessians –IBM clusters, 64-bit Linux
304 basis fxns, small active space Dominated by calc of derivative integrals
Large active space, small AO basis Dominated by calc of CI blocks of H
216 basis fxns, full active space Calc is mix of all bottlenecks
ZAPT2 BENCHMARKS IBM p640 nodes connected by dual Gigabit Ethernet –4 Power3-II processors at 375 MHz –16 GB memory Tested –Au 3 H 4 –Au 3 O 4 –Au 5 H 4 –Ti 2 Cl 2 Cp 4 –Fe-porphyrin: imidazole
Au 3 H 4 Basis set –aug-cc-pVTZ on H –uncontracted SBKJC with 3f2g polarization functions and one diffuse sp function on Au –380 spherical harmonic basis functions 31 DOCC, 1 SOCC 9.5 MWords replicated 170 MWords distributed
Au 3 O 4 Basis set –aug-cc-pVTZ on O –uncontracted SBKJC with 3f2g polarization functions and one diffuse sp function on Au –472 spherical harmonic basis functions 44 DOCC, 1 SOCC 20.7 MWords replicated 562 MWords distributed
Au 5 H 4 Basis set –aug-cc-pVTZ on H –uncontracted SBKJC with 3f2g polarization functions and one diffuse sp function on Au –572 spherical harmonic basis functions 49 DOCC, 1 SOCC 30.1 MWords replicated 1011 MWords distributed
Fe-porphyrin: imidazole Two basis sets –MIDI with d polarization functions (N = 493) –TZV with d,p polarization functions (N = 728) 110 DOCC, 2 SOCC N = 493 –32.1 MWords replicated –2635 MWords distributed N = 728 –52.1 MWords replicated –5536 MWords distributed
Load Balancing Au 3 H 4 on 64 processors –Total CPU time ranged from 1124 to 1178 sec. –Master spent 1165 sec. –average: 1147 sec. –standard deviation: 13.5 sec. Large Fe-porphyrin on 64 processors –Total CPU time ranged from to sec. –Master spent sec. –average: sec. –standard deviation: 162 sec.
THANKS! GAMESS Gang DOE SciDAC program IBM SUR grants