Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Chemistry at Daresbury 16-22 November 2002 Computational Science and Engineering Department Daresbury Laboratory Computational Chemistry.

Similar presentations


Presentation on theme: "Computational Chemistry at Daresbury 16-22 November 2002 Computational Science and Engineering Department Daresbury Laboratory Computational Chemistry."— Presentation transcript:

1 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Computational Chemistry at Daresbury Laboratory Quantum Chemistry Group Martyn. F. Guest, Paul. Sherwood, and Huub J.J. van Dam Molecular Simulation Group Bill Smith and Maurice Leslie

2 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Outline Performance of Computational Chemistry Codes Serial Applications Benchmarks Serial Applications Benchmarks GAMESS-UK and DL_POLY GAMESS-UK and DL_POLY Parallel Performance on High-end and Commodity-class systems Parallel Performance on High-end and Commodity-class systems NWChem NWChem Global Array (GA) Tools Global Array (GA) Tools Parallel Eigensolver (PeIGS) Parallel Eigensolver (PeIGS) GAMESS-UK GAMESS-UK SCF, DFT, MP2 and 2nd Derivatives SCF, DFT, MP2 and 2nd Derivatives DL_POLY DL_POLY Version 2: Replicated data Version 2: Replicated data Version 3: Distributed data (domain decomposition) Version 3: Distributed data (domain decomposition) CHARMM and QM/MM Calculations CHARMM and QM/MM Calculations Thrombin and TIM benchmarks Thrombin and TIM benchmarks

3 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK and DL_POLY Serial Benchmarks 12 Typical QC Calculations ModuleBasis (GTOs) Species ModuleBasis (GTOs) Species 1. SCFSTO-3G (124)Morphine 2. SCF6-31G (154)C 6 H 3 (NO 2 ) 3 3. ECP GeometryECPDZ (70)Na 7 Mg + 4. Direct-SCF6-31G (82)Cytosine 5. CAS-geometryTZVP (52)H 2 CO 6. MCSCFEXT1 (74)H 2 CO 7. Direct-CIEXT2 (64)H 2 CO/H 2 +CO 8. MRD-CI (26M)ECP (59)TiCl 4 9. MP2-geometry6-31G* (70)H 3 SiNCO 10. SCF 2nd derivs.6-31G (64)C 5 H 5 N 11. MP2 2nd derivs.6-31G* (60)C Direct-MP2DZP (76)C 5 H 5 N Six Typical Simulations SimulationAtomsTime steps 1. Na-K disilicate glass Metallic Al with Sutton-Chen potential 3. Valinomycin in water molecules 4. Dynamic shell model water with 1024 sites 5. Dynamic shell model MgCl2 with 1280 sites 6. Model membrane, membrane chains, 202 solute and 2746 solvent molecules

4 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory The GAMESS-UK Benchmark Performance relative to the Compaq Alpha ES45/ minutes

5 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory The DL _ POLY Benchmark. Performance relative to the Compaq Alpha ES45/

6 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory High-End Systems Evaluated Cray T3E/1200E Cray T3E/1200E 816 processor system at Manchester (CSAR service) 816 processor system at Manchester (CSAR service) 600 Mz EV56 Alpha processor with 256 MB memory 600 Mz EV56 Alpha processor with 256 MB memory IBM SP/WH2-375 and SP/Regatta-H 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory, 375 MHz Power3-II processors with 8 MB L2 cache 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory, 375 MHz Power3-II processors with 8 MB L2 cache IBM Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier IBM Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier IBM SP/Regatta-H (8-way LPAR’d nodes, 1.3 GHZ) at ORNL IBM SP/Regatta-H (8-way LPAR’d nodes, 1.3 GHZ) at ORNL Compaq AlphaServer SC Compaq AlphaServer SC 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM); 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM); GB memory per node, 8MB L2 cache TCS1 system at PSC (comprising way ES45 nodes - 3,000 EV68 CPUs - with 4 GB memory per node, 8MB L2 cache Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W) Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W) SGI Origin 3800 SGI Origin 3800 SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs Cray Supercluster at Eagen Cray Supercluster at Eagen Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet) Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet)

7 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Commodity Systems (CSx) Prototype / Evaluation Hardware SystemsLocationCPUsConfiguration CS1Daresbury32PentiumIII / 450 MHz + FE (EPSRC) CS2Daresbury6424 X dual UP2000/EV67-667, QSNet Alpha/LINUX cluster, 8 X dual CS20/EV (“loki”) CS3RAL16Athlon K7 850MHz + myrinet CS4Sara32Athlon K7 1.2 GHz + FE CS6CLiC528PentiumIII / 800 MHz; fast ethernet (Chemnitzer Cluster) CS7Daresbury64AMD K7/1000 MP + SCALI/SCI (“ukcp”) CS8NCSA dual IBM Itanium/800 + Myrinet 2k (“titan”) CS9Bristol96Pentium4 Xeon/ Myrinet 2k (“dirac”) Protoype Systems CS0Daresbury1010 CPUS, Pentium II/266 CS5Daresbury168 X dual Pentium III/933, SCALI

8 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx [ T 32-node T3E / T 32-node CS1 Pentium III/450 + FE ] T 32-node T3E / T 32-node CS6 Pentium III/800 + FE T 32-node T3E / T 32-CPU CS2 Alpha Linux Cluster + Quadrix T 32-node T3E / T 32-CPU CS2 Alpha Linux Cluster + Quadrix Performance Metrics: Attempted to quantify delivered performance from the Commodity-based systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP- node platforms (e.g. SGI Origin 3800) i.e. Performance Metric (% 32-node Cray T3E) Performance Metrics: 2002 Performance Metric (% 32-node AlphaServer SC [PSC]) T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx T 32-CPU AlphaServer ES45 / T 32-CPU CS9 Pentium 4 Xeon / Myrinet 2k

9 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Beowulf Comparisons with the T3E & O3800/R14k-500 CSx - Pentium III + FE % of 32-node Cray T3E/1200E GAMESS-UKCS1CS6 SCF53-69%96% DFT65-85% % DFT (Jfit)44-77%65-131% DFT Gradient90%130% MP2 Gradient44%73% SCF Forces80%127% NWChem (DFT Jfit)50-60% REALC 67% CRYSTAL % DL_POLY Ewald-based % % bond constraints34-56%69% CHARMM 96%172% CASTEP 33%42% CPMD 62% ANGUS 60%68% FLITE3D 104% CS2 - QSNet Alpha Linux Cluster % of 32-node Cray T3E and O3800/R14k-500 GAMESS-UK SCF256%99% DFT † %99% DFT (Jfit) % % DFT Gradient † 289%89% MP2 Gradient228%87% SCF Forces154%86% (DFT Jfit) † % -135% NWChem (DFT Jfit) † % % †349% CRYSTAL † 349% DL_POLY Ewald-based † %95% bond constraints %82% †404%78% CHARMM † 404%78% CASTEP 166%78% ANGUS 145% †480% FLITE3D † 480%

10 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory High-End Computational Chemistry The NWChem Software Capabilities (Direct, Semi-direct and conventional): Capabilities (Direct, Semi-direct and conventional): RHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st and 2nd derivatives. RHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st and 2nd derivatives. DFT with a wide variety of local and non-local XC potentials, using up to 10,000 basis functions; analytic 1st and 2nd derivatives. DFT with a wide variety of local and non-local XC potentials, using up to 10,000 basis functions; analytic 1st and 2nd derivatives. CASSCF; analytic 1st and numerical 2nd derivatives. CASSCF; analytic 1st and numerical 2nd derivatives. Semi-direct and RI-based MP2 calculations for RHF and UHF wave functions using up to 3,000 basis functions; analytic 1st derivatives and numerical 2nd derivatives. Semi-direct and RI-based MP2 calculations for RHF and UHF wave functions using up to 3,000 basis functions; analytic 1st derivatives and numerical 2nd derivatives. Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis functions; numerical 1st and 2nd derivatives of the CC energy. Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis functions; numerical 1st and 2nd derivatives of the CC energy. Classical molecular dynamics and free energy simulations with the forces obtainable from a variety of sources

11 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Global Arrays

12 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory (Solution of real symmetric generalized and standard eigensystem problems) Full eigensolution performed on a matrix generated in a charge density fitting procedure (966 fitting functions for a fluorinated biphenyl). uaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues Packed Storage Smaller scratch space requirements Number of processors Time (sec) IBM SP Cray T3E Features (not available elsewhere): Inverse iteration using Dhillon-Fann- Parlett’s parallel algorithm (fastest uniprocessor performance and good parallel scaling) Guaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues Packed Storage Smaller scratch space requirements PeIGS 3.0 Parallel Performance

13 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory SGI Origin 3800/R12k-400 (“green”) Scalability of Numerical Algorithms Real symmetric eigenvalue problems Number of processors Time (sec) Number of processors

14 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory CS7 AMD K7/1000 MP + SCALI Parallel Eigensolvers Real symmetric eigenvalue problems Number of processors Scalapack (PDSYEV) Measured Time (seconds) CS9 P4/2000 Xeon + Myrinet

15 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Case Studies - Zeolite Fragments Si 8 O 7 H /832 Si 8 O 25 H /1444 Si 26 O 37 H /2818 Si 28 O 67 H /3928 DFT Calculations with Coulomb Fitting DFT Calculations with Coulomb Fitting Basis (Godbout et al.) DZVP - O, Si DZVP - O, Si DZVP2 - H Fitting Basis: DGAUSS-A1 - O, Si DGAUSS-A2 - H NWChem & GAMESS-UK NWChem & GAMESS-UK Both codes use auxiliary fitting basis for coulomb energy, with 3 centre 2 electron integrals held in core.

16 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - NWChem Number of CPUs Measured Time (seconds) Si 8 O 7 H /832 Si 8 O 25 H /1444 Measured Time (seconds) 76%,95% 88%,104%

17 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - NWChem Number of CPUs Measured Time (seconds) Si 28 O 67 H /3928 Si 26 O 37 H /2818 T IBM-SP/P2SC-120 (256) = 1137 T IBM-SP/P2SC-120 (256) = %,210% 85%,227%

18 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Memory-driven Approaches: NWChem - DFT (LDA) 1. Performance on the SGI Origin 3800 DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: AO basis: 3554 AO basis: 3554 CD basis:12713 CD basis:12713 MIPS R14k-500 CPUs (Teras) MIPS R14k-500 CPUs (Teras) Wall time (13 SCF iterations): 64 CPUs = 5,242 seconds 128 CPUs= 3,951 seconds Est. time on 32 CPUs = 40,000 secs Zeolite ZSM-5 3-centre 2e-integrals = 1.00 X centre 2e-integrals = 1.00 X Schwarz screening = 5.95 X Schwarz screening = 5.95 X % 3c 2e-ints. In core = 100% % 3c 2e-ints. In core = 100%

19 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Memory-driven Approaches: NWChem - DFT (LDA) 2. Performance on the HP/Compaq AlphaServer SC DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: AO basis: 5457 AO basis: 5457 CD basis:12713 CD basis: EV67/6-667 CPUs (64 Compaq AlphaServer SC nodes) 256 EV67/6-667 CPUs (64 Compaq AlphaServer SC nodes) Wall time (10 SCF iterations) on 256 CPUs = 11,960 seconds (60% efficiency) Pyridine in Zeolite ZSM-5 3-centre 2e-integrals = 3.79 X centre 2e-integrals = 3.79 X Schwarz screening = 2.81 X Schwarz screening = 2.81 X % 3c 2e-ints. In core = 1.66% % 3c 2e-ints. In core = 1.66%

20 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Parallel Implementations of GAMESS-UK Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL) Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL) SCF and DFT SCF and DFT Replicated data, but … Replicated data, but … GA Tools for caching of I/O for restart and checkpoint files GA Tools for caching of I/O for restart and checkpoint files Storage of 2-centre 2-e integrals in DFT Jfit Storage of 2-centre 2-e integrals in DFT Jfit Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) SCF second derivatives SCF second derivatives Distribution of and integrals via GAs Distribution of and integrals via GAs MP2 gradients MP2 gradients Distribution of and integrals via GAs Distribution of and integrals via GAs

21 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK  SCF Performance Cray T3E/1200E, High-end and Commodity-based Systems Number of CPUs Cyclosporin:(3-21G Basis, 1000 GTOS) Elapsed Time (seconds) 95%,135% T3E 128 = 436 Impact of Serial Linear Algebra: T IBM-SP (16) = 2656 [1289] T IBM-SP (32) = 2184 [ 821]

22 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK. DFT B3LYP Performance Cray T3E/1200, High-end and Commodity-based Systems Basis: 6-31G Elapsed Time (seconds) Cyclosporin, 1000 GTOs 70%,117% Number of CPUs

23 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK. DFT B3LYP Performance The Cray T3E/1200 and High-end Systems Basis: 6-31G Elapsed Time (seconds) Cyclosporin, 1000 GTOs S = Speed-up Number of CPUs

24 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DFT BLYP Gradient: Cray T3E/1200, High-end and Commodity-based Systems Number of CPUs Geometry optimisation of polymerisation catalyst Cl(C 3 H 5 O).Pd[(P(CMe 3 ) 2 ) 2.C 6 H 4 ] Basis 3-21G* (446 GTOs): 10 energy + gradient evaluations Speed-up Elapsed Time (seconds) linear 69%,114%

25 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Auxilliary Basis Coulomb Fit (I) Where V is the matrix of 2-centre 2-electron repulsion integrals in the charge density basis and b are the three centre electron repulsion integrals between the wavefunction basis set and the charge density basis. The approach is based on the expansion of the charge density in an auxiliary basis of Gaussian functions As suggested by Dunlap, a variational choice of the fitting coefficients C can be obtained as follows :

26 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory The number of 3-centre integrals is significantly smaller than the 4-centre integrals used in the conventional coulomb evaluation, but for large molecules additional screening is required. The number of 3-centre integrals is significantly smaller than the 4-centre integrals used in the conventional coulomb evaluation, but for large molecules additional screening is required. We make use of the Schwarz inequality We make use of the Schwarz inequality Auxilliary Basis Coulomb Fit (ii) Where p and q are AO basis functions and u are the fitting functions. Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored. Where p and q are AO basis functions and u are the fitting functions. Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored. Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre integrals in core. Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre integrals in core.

27 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting Number of CPUs Measured Time (seconds) Basis: DZV_A2 (Dgauss) A1_DFT Fit: 882/ %,161% J EXPLICIT 61%,93% J FIT T T3E/1200E (128) = 2139 T T3E/1200E (128) = 995

28 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting: Cray T3E/1200, Cray Super Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500 Number of CPUs Measured Time (seconds) Basis: DZV_A2 (Dgauss) A1_DFT Fit: 882/3012 J EXPLICIT J FIT

29 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT HCTH on Valinomycin. Speedups for both Explicit and Coulomb Fitting. J EXPLICIT J FIT Number of CPUs Speedup 

30 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DFT HCTH Performance : Valinomycin Number of CPUs JFit XC SCF Cray T3E/1200E SCF XC JFit CS6 PIII/800 + FE SCF XC JFit CS2 QSNet Alpha Cluster/667

31 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT S-VWN Impact of Coulomb Fitting: Compaq AlphaServer SC /833 Number of CPUs Measured Time (seconds) Basis: DZVP, DZVP2 (DGAUSS) Fit: DGAUSS A1, A2 Fit: DGAUSS A1, A2 J EXPLICIT J FIT J EXPLICIT J FIT Si 26 O 37 H /2818 Si 28 O 67 H /3928

32 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - GAMESS-UK Number of CPUs Measured Time (seconds) Si 28 O 67 H /3928 Si 26 O 37 H / %,104%62%,110%

33 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DFT JFit Performance : DFT JFit Performance : Si 26 O 37 H 36 Number of CPUs JFit XC SCF Cray T3E/1200E SCF XC JFit AlphaServer SC/833 SCF XC JFit SGI Origin 3000/R12k-400 Number of CPUs

34 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Memory-driven Approaches - SCF and DFT Number of CPUs HF and DFT Energy and Gradient calculation for NiSiMe 2 CH 2 CH 2 C 6 F 13 : Basis Ahlrichs DZ (1485 GTOs) Elapsed Time (seconds) Integrals written directly to memory, rather than to disk, and are not re- calculated SGI Origin 3800/R14k-500

35 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory MP2 Gradient Algorithms Conventional Conventional integrals written to disk integrals written to disk read back, transformed, written out, resorted etc. read back, transformed, written out, resorted etc. heavy I/O demands heavy I/O demands Direct/Semi-direct (Frisch, Head-Gordon & Pople, Hasse and Ahlrichs) Direct/Semi-direct (Frisch, Head-Gordon & Pople, Hasse and Ahlrichs) replace all/some I/O with batched integral recomputation replace all/some I/O with batched integral recomputation Poor I/O-to-compute performance of MPPs direct approach Current MPPs have large global memories Store subset of MO integrals reduce number of integral recomputations increase communication overhead Subset includes VOVO, VVOO, VOOO, VVVO-class too large to store compute VVVO-terms in separate step SerialParallel

36 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Mn(CO) 5 H - MP2 geometry optimisation BASIS: TZVP + f (217 GTOs Performance of MP2 Gradient Module Cray T3E/1200, High-end and Commodity- based Systems Number of CPUs Elapsed Time (seconds) Speed-up 59%,78%

37 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory SCF Analytic 2nd Derivatives Performance Cray T3E/1200, High-end and Commodity-based Systems (C 6 H 4 (CF 3 )) 2 : Basis 6-31G (196 GTO) Elapsed Time (seconds) Terms from MO 2e-integrals in GA storage (CPHF & pert. Fock matrices); Calculation dominated by CPHF: Gaussian98 - L1002 (CPU) - 32 nodes: 1181 secs; 64 nodes: 1058 secs. GAMESS-UK (total job time); 128 nodes: 499 secs. CPUs 92%,148% CPUs G98: 2271 G98: 1706 G98: 1490

38 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Vampir 2.5 V A Visualization and Analysis of MPI r MPI Programs GAMESS-UK on High-end and Commodity class machines extensions to handle GA applications Performance Analysis of GA- based Applications using Vampir

39 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK / Si 8 O 25 H 18 : 8 CPUs: One DFT Cycle

40 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK / Si 8 O 25 H 18 : 8 CPUs Q † HQ (GAMULT2) and PEIGS

41 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Materials Simulation Codes Plane Wave DFT Codes: CASTEP VASP CPMD These codes have similar functionality, power and problems. CASTEP is the flagship code of UKCP and hence subsequent discussions will focus on this. Local Gaussian Basis Set Codes: CRYSTAL This code presents a different set of problems when considering performance on HPC(x). SIESTA and CONQUEST: O(n) scaling codes which will be extremely attractive to users. O(n) scaling codes which will be extremely attractive to users. Both are currently development rather than production codes. Both are currently development rather than production codes.

42 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory CRYSTAL Distributed Data implementation Distributed Data implementation Benchmark: Benchmark: An Acid Centre in Zeolite-Y (Faujasite) An Acid Centre in Zeolite-Y (Faujasite) Single point energy Single point energy 145 atoms / cell, No symmetry / 8k-points 145 atoms / cell, No symmetry / 8k-points 2208 basis functions, (6-21G * ) 2208 basis functions, (6-21G * ) Elapsed Time (seconds) Speed-up Number of CPUs T 256 SGI Origin 3800 = 945 secs

43 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Materials Simulation. Plane Wave Methods: CASTEP Direct minimisation of the total energy (avoiding diagonalisation) Pseudopotentials must be used to keep the number of plane waves manageable Large number of basis functions N~10 6 (especially for heavy atoms). The plane wave expansion means that the bulk of the computation comprises large 3D Fast Fourier Transforms (FFTs) between real and momentum space. These are distributed across the processors in various ways. The actual FFT routines are optimized for the cache size of the processor.

44 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory CASTEP Parallel Benchmark Chabazite Acid sites in a zeolite. (Si 11 O 24 Al H) Vanderbilt ulatrasoft pseudo- potential Pulay density mixing minimiser scheme single k point total energy, 96 bands plane waves on 3D FFT grid size = 54x54x54; convergence in 17 SCF cycles CPUs Measured Time (seconds) Time (comms) IBM SP/WH Cray T3E/1200E 90 CS6 PIII/800+FE600 CS7 AMD K7/ SCI242 CS9 P4/2000 +Myrinet115 CS2 QSNet Alpha111 SGI Origin 3800/R14k 71 67%,75% G

45 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory CASTEP Parallel Benchmark II. TiN: A TiN 32 atom slab, 8 k-points, single point energy calculation with Mulliken analysis, CPUs Measured Time (seconds) kG 47%,81%

46 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory CPMD - Car-Parrinello Molecular Dynamics CPMD Version 3.5.1: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello ( ) DFT, plane-waves, pseudo-potentials and FFT's Benchmark Example: Liquid Water Physical Specifications: 32 molecules, Simple cubic periodic box of length 9.86 A, Temperature 300K Electronic Structure; BLYP functional, Trouillier Martins pseudopotential, Recriprocal space cutoff 70 Ry = 952 eV CPMD is the base code for the new CCP1 flagship project CPUs Elapsed Time (secs.) Sprik and Vuilleumier (Cambridge) 30%,53%

47 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DL_POLY Parallel Benchmarks (Cray T3E/1200) Number of Nodes 4. NaCl; Ewald, 27,000 ions 5. NaK-disilicate glass; 8,640 atoms,MTS+ Ewald 8. MgO microcrystal: 5,416 atoms Number of Nodes 9. Model membrane/Valinomycin (MTS, 18,886) 7. Gramicidin in water (SHAKE, 12,390) 6. K/valinomycin in water (SHAKE, AMBER, 3,838) 1. Metallic Al (19,652 atoms, Sutton Chen) 3. Transferrin in Water (neutral groups + SHAKE, 27,593) 2. Peptide in water (neutral groups + SHAKE, 3993). Speed-up Linear V2: Replicated Data

48 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DL_POLY: Cray/T3E, High-end and Commodity-based Systems Bench 4. 27,000 ions, NaCl; 27,000 ions, Ewald, 75 time steps, Cutoff=24Å Number of CPUs Measured Time (seconds) T3E 128 =94 44%,71%

49 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DL_POLY: Scalability on the T3E, High-end & Commodity Systems Measured Time (seconds) Bench 5: NaK-disilicate glass; 8,640 atoms, MTS + Ewald: 270 time steps Number of CPUs T3E 128 = 75 Speed-up Number of CPUs 53%,84%

50 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory DL_POLY: Macromolecular Simulations Bench 7. Gramicidin in water; rigid bonds and SHAKE, 12,390 atoms, 500 time steps Number of CPUs Measured Time (seconds) T3E 128 =166 41%,64%

51 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3 : Domain Decomposition Distribute atoms, forces across the nodes Distribute atoms, forces across the nodes More memory efficient, can address much larger cases ( ) More memory efficient, can address much larger cases ( ) Shake and short-ranges forces require only neighbour communication Shake and short-ranges forces require only neighbour communication communications scale linearly with number of nodes communications scale linearly with number of nodes Coulombic energy remains global Coulombic energy remains global strategy depends on problem and machine characteristics strategy depends on problem and machine characteristics Adopt Particle Mesh Ewald scheme includes Fourier transform smoothed charge density (reciprocal space grid typically 64x64x x128x128) includes Fourier transform smoothed charge density (reciprocal space grid typically 64x64x x128x128) A B C D

52 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Conventional routines (e.g. fftw) assume plane or column distributions Conventional routines (e.g. fftw) assume plane or column distributions A global transpose of the data is required to complete the 3D FFT and additional costs are incurred re-organising the data from the natural block domain decomposition. A global transpose of the data is required to complete the 3D FFT and additional costs are incurred re-organising the data from the natural block domain decomposition. An alternative FFT algorithm has been designed to reduce communication costs. An alternative FFT algorithm has been designed to reduce communication costs. the 3D FFT are performed as a series of 1D FFTs, each involving communications only between blocks in a given column the 3D FFT are performed as a series of 1D FFTs, each involving communications only between blocks in a given column More data is transferred, but in far fewer messages More data is transferred, but in far fewer messages Rather than all-to-all, the communications are column-wise only Rather than all-to-all, the communications are column-wise only Plane Block Migration from Replicated to Distributed data DL_POLY-3: Coulomb Energy Evaluation

53 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3: Coulomb Energy Performance Speed-up Number of CPUs NaCl Simulation; Ewald summation DLPOLY_2.11, Ewald summation DLPOLY_3, PMES DLPOLY_ ,000 ions, 500 time steps, Cutoff=24Å DLPOLY_3 27,000 ions, 500 time steps, Cutoff=12Å DLPOLY_3 216,000 ions, 200 time steps, Cutoff=12Å

54 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3: Macromolecular Simulations DLPOLY_3 792,960 ions, 50 time steps Speed-up Number of CPUs DL_POLY ,390 atoms, 500 time steps DL_POLY 3 99,120 atoms, 100 time steps Gramicidin in water; rigid bonds + SHAKE

55 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Parallel CHARMM Benchmark Benchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules: atoms, 1000 steps (1 ps), A shift. Measured Time (seconds) Number of CPUs 95%,103% T3E 16 = 665, T3E 128 = 106

56 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Parallel CHARMM Benchmark: LAM MPI vs. MPICH Benchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules: Measured Time (seconds) Number of CPUs

57 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory QM/MM Thrombin Benchmark QM/MM calculation on thrombin pocket: QM region (69 Atoms) is modelled by SCF 6-31G calculation (401 GTOs), total system size for the classical calculation is 16,659 atoms. The MM calculation was performed using all non-bonded interactions (i.e. without a pairlist), and the QM/MM interaction was cut off at 15 angstroms using a neutral group scheme (leading to the inclusion in the QM calculation of 1507 classical centres). Scaling for Thrombin + Inhibitor (NAPAP) + Water Number of Nodes

58 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory QM region 35 atoms (DFT BLYP) – – include residues with possible proton donor/acceptor roles – – GAMESS-UK, MNDO, TURBOMOLE MM region (4,180 atoms + 2 link) – – CHARMM force-field, implemented in CHARMM, DL_POLY Triosephosphate isomerase (TIM) Central reaction in glycolysis, catalytic interconversion of DHAP to GAP Demonstration case within QUASI (Partners UZH, and BASF) Triosephosphate isomerase (TIM) Central reaction in glycolysis, catalytic interconversion of DHAP to GAP Demonstration case within QUASI (Partners UZH, and BASF) QM/MM Applications Measured Time (seconds) T 128 (O3800/R14k-500) = 181 secs 89%,139%

59 Computational Chemistry at Daresbury November 2002 Computational Science and Engineering Department Daresbury Laboratory Commodity Comparisons with High-end Systems: Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500 CS9 - Pentium4/2000 Xeon + Myrinet 2k % of 32-CPU AlphaServer SC Origin 3800 GAMESS-UK SCF SCF95% 135% DFT DFT70-73% % DFT (Jfit) DFT (Jfit)59-70% % DFT Gradient DFT Gradient69% 114% MP2 Gradient MP2 Gradient59% 78% SCF Forces SCF Forces92% 148% NWChem (DFT Jfit) NWChem (DFT Jfit)65-88% % GAMESS / CHARMM GAMESS / CHARMM 89% 139%DL_POLY Ewald-based Ewald-based 44-53% 71-84% bond constraints bond constraints41% 64% CHARMM CHARMM 95% 103% CASTEP CASTEP 47-67% 75-81% CPMD CPMD 30% 53% ANGUS ANGUS 35-69% %


Download ppt "Computational Chemistry at Daresbury 16-22 November 2002 Computational Science and Engineering Department Daresbury Laboratory Computational Chemistry."

Similar presentations


Ads by Google