Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Performance analysis of GA-based.

Similar presentations


Presentation on theme: "Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Performance analysis of GA-based."— Presentation transcript:

1 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Performance analysis of GA-based applications using the Vampir tool NWChem and GAMESS-UK on High-end and Commodity class machines. H.J.J. van Dam, Martyn Guest and Paul Sherwood, Quantum Chemistry Group, CLRC Daresbury Laboratory http://www.cse.clrc.ac.uk Miles Deegan Compaq (Galway)

2 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury LaboratoryOutline Background : PNNL, Daresbury and PALLAS Background : PNNL, Daresbury and PALLAS Tool for Performance Analysis - VAMPIR & VAMPIR Trace Tool for Performance Analysis - VAMPIR & VAMPIR Trace VAMPIR - analysis of trace files VAMPIR - analysis of trace files VAMPIR Trace VAMPIR Trace Trace Library for MPI applications Trace Library for MPI applications Extensions to handle GA applications Extensions to handle GA applications Case Studies Case Studies DFT Calculations on Zeolite Fragments (347 - 1687 GTOs) with Coulomb Fitting DFT Calculations on Zeolite Fragments (347 - 1687 GTOs) with Coulomb Fitting High-end Systems - Cray T3E/1200E, Compaq AlphaServer SC (667 & 833 MHz), SGI Origin 3000/R12k-400 and IBM SP/WH2-375 High-end Systems - Cray T3E/1200E, Compaq AlphaServer SC (667 & 833 MHz), SGI Origin 3000/R12k-400 and IBM SP/WH2-375 Commodity Clusters (IA32 and Alpha Linux) Commodity Clusters (IA32 and Alpha Linux) NWChem and GAMESS-UK NWChem and GAMESS-UK Distributed data (NWchem) and Replicated Data (GAMESS-UK) Distributed data (NWchem) and Replicated Data (GAMESS-UK) Analysis of GAs and PeIGs Analysis of GAs and PeIGs Summary Summary

3 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory PNNL - Daresbury - Pallas Collaborations PNNL - Daresbury Collaboration PNNL - Daresbury Collaboration Long term interaction between chemistry activities Long term interaction between chemistry activities Proposed developments around DFT derivative codes Proposed developments around DFT derivative codes UK Chemistry Collaboration Forum (CCP1) UK Chemistry Collaboration Forum (CCP1) DFT Flagship project and subsequent DL extensions DFT Flagship project and subsequent DL extensions DFT Functional Repository (http://www.dl.ac.uk/DFTlib) DFT Functional Repository (http://www.dl.ac.uk/DFTlib) Daresbury - Pallas Collaboration Daresbury - Pallas Collaboration Demonstrate that clusters of IA32 and Alpha processors are competitive with HPC servers (with low to medium processor numbers) for a wide range of applications Demonstrate that clusters of IA32 and Alpha processors are competitive with HPC servers (with low to medium processor numbers) for a wide range of applications Evaluate the suitability of clusters for high-end computing Evaluate the suitability of clusters for high-end computing Analyse kernels and full applications (May 2000 - Sep.2001) Analyse kernels and full applications (May 2000 - Sep.2001)

4 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Vampir 2.5 V A Visualization and Analysis of MPI r MPI Programs

5 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Vampir Features Vampir Features Offline trace analysis for MPI (and others...) Offline trace analysis for MPI (and others...) Traces generated by Vampirtrace tool (`ld... -lVT -lpmpi -lmpi`) Traces generated by Vampirtrace tool (`ld... -lVT -lpmpi -lmpi`) Convenient user–interface Convenient user–interface Scalability in time and processor–space Scalability in time and processor–space Excellent zooming and filtering Excellent zooming and filtering High–performance graphics High–performance graphics Display and analysis of MPI and application events: Display and analysis of MPI and application events: execution of MPI routines execution of MPI routines point–to–point and collective communication point–to–point and collective communication MPI–2 I/O operations MPI–2 I/O operations execution of application subroutines (optional) execution of application subroutines (optional) “Easy” customization “Easy” customization

6 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Vampir Displays Vampir Displays Global displays show all selected processes Global displays show all selected processes Summary Chart: aggregated profiling information Summary Chart: aggregated profiling information Activity Chart: presents per–process profiling information Activity Chart: presents per–process profiling information Timeline: detailed application execution over time axis Timeline: detailed application execution over time axis Communication statistics: message statistics for each process pair Communication statistics: message statistics for each process pair Global Comm. Statistics: collective operations statistics Global Comm. Statistics: collective operations statistics I/O Statistics: MPI I/O operation statistics I/O Statistics: MPI I/O operation statistics Calling Tree: draws global or local dynamic calling trees Calling Tree: draws global or local dynamic calling trees Process displays show a single process per window Process displays show a single process per window Activity Chart Activity Chart Timeline Timeline Calling Tree Calling Tree

7 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Timeline Display (Message Info) Timeline Display (Message Info) Source–code references are displayed if recorded by Vampirtrace Source–code references are displayed if recorded by Vampirtrace Click on message line See message details Message send op Message receive op

8 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Vampirtrace Tracing of MPI and Application Events

9 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury LaboratoryVampirtrace Current version: Vampirtrace 2.0 Current version: Vampirtrace 2.0 Significant new features: Significant new features: records collective communication records collective communication enhanced filter functions enhanced filter functions extended API extended API records source–code information (selected platforms) records source–code information (selected platforms) support for shmem (Cray T3E) support for shmem (Cray T3E) records MPI–2 I/O operations records MPI–2 I/O operations Available for all major MPI platforms Available for all major MPI platforms Library that records all MPI calls, point to point communication, and collective operations. Runtime filters available to limit tracefile size. Provides an API for user instrumentation. Requires MPI to gather performance data. Uses the profiling interface of MPI and is therefore independent of the specifics of a given MPI implementation.

10 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Vampirtrace API Switching tracing on/off Switching tracing on/off SUBROUTINE VTTRACEOFF( ) SUBROUTINE VTTRACEOFF( ) SUBROUTINE VTTRACEON( ) SUBROUTINE VTTRACEON( ) Specifying user-defined states Specifying user-defined states SUBROUTINE VTSYMDEF(ICODE, STATE, ACTIVITY, IERR) SUBROUTINE VTSYMDEF(ICODE, STATE, ACTIVITY, IERR) Entering/leaving user-defined states Entering/leaving user-defined states SUBROUTINE VTBEGIN(ICODE, IERR) SUBROUTINE VTBEGIN(ICODE, IERR) SUBROUTINE VTEND(ICODE, IERR) SUBROUTINE VTEND(ICODE, IERR) Logging message send/receive events (undocumented) SUBROUTINE VTLOGSENDMSG( IME, ITO, ICNT, ITAG, ICOMMID, IERR) SUBROUTINE VTLOGRECVMSG( IME, IFRM, ICNT, ITAG, ICOMMID, IERR)

11 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Global Arrays

12 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Instrumenting single-sided memory access Approach 1: Instrument the puts, gets and data server Approach 1: Instrument the puts, gets and data server Advantage: robust and accurate Advantage: robust and accurate Disadvantage: one does not always have access to the source of the data server Disadvantage: one does not always have access to the source of the data server Approach 2: Instrument the puts and gets only, cheating on the source and destination of the messages Approach 2: Instrument the puts and gets only, cheating on the source and destination of the messages Advantage: no instrumentation of the data server required Advantage: no instrumentation of the data server required Disadvantage: timings of the messages are inaccurate in case of non-blocking operations Disadvantage: timings of the messages are inaccurate in case of non-blocking operations

13 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Runtime tracing options The tracing of activities can be modified at runtime through a configuration file. The tracing of activities can be modified at runtime through a configuration file. Tracing of messages can not be changed. Tracing of messages can not be changed. VTTRACEON and VTTRACEOFF should be used sparingly. VTTRACEON and VTTRACEOFF should be used sparingly. Logfile-name /home/user/prog.bpv Symbol nnodes off Symbol nodeid off Symbol GA_Nnodes off Symbol GA_Nodeid off Practical issues The vampirtrace library and evaluation licenses can be downloaded from http://www.pallas.com/ Evaluation licenses are limited to 32 processors Evaluation licenses are limited to 32 processors CPU cycle providers are not too keen to provide vampirtrace? CPU cycle providers are not too keen to provide vampirtrace?

14 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Case Studies - Zeolite Fragments Si 8 O 7 H 18 347/832 Si 8 O 25 H 18 617/1444 Si 26 O 37 H 36 1199/2818 Si 28 O 67 H 30 1687/3928 DFT Calculations with Coulomb Fitting DFT Calculations with Coulomb Fitting Basis (Godbout et al.) DZVP - O, Si DZVP - O, Si DZVP2 - H Fitting Basis: DGAUSS-A1 - O, Si DGAUSS-A2 - H NWChem & GAMESS-UK NWChem & GAMESS-UK Both codes use auxiliary fitting basis for coulomb energy, with 3 centre 2 electron integrals held in core.

15 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory High-End and Commodity Systems Cray T3E/1200E Cray T3E/1200E 816 processor system at Manchester (CSAR service) 816 processor system at Manchester (CSAR service) 600 Mz EV56 Alpha processor with 256 MB memory 600 Mz EV56 Alpha processor with 256 MB memory IBM SP (32 CPU system at DL) IBM SP (32 CPU system at DL) 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory 375 MHz Power3-II processors with 8 MB L2 cache 375 MHz Power3-II processors with 8 MB L2 cache Compaq AlphaServer SC - 667 (APAC) and 833 MHz CPUs Compaq AlphaServer SC - 667 (APAC) and 833 MHz CPUs 4-way ES40/667 and /833 SMP nodes with 2 GB memory 4-way ES40/667 and /833 SMP nodes with 2 GB memory Alpha 21264a (EV67) CPUs with 8 MB L2 cache Alpha 21264a (EV67) CPUs with 8 MB L2 cache Quadrics “fat tree” interconnect (5 usec latency, 150 MB/sec B/W) Quadrics “fat tree” interconnect (5 usec latency, 150 MB/sec B/W) SGI Origin 3800 SGI Origin 3800 SARA (1000 CPUs) - Numalink with R12k/400 CPUs SARA (1000 CPUs) - Numalink with R12k/400 CPUs Commodity Systems (DL) Commodity Systems (DL) 32 X IA32 single processor CPUs (Pentium III/450), fast ethernet 32 X IA32 single processor CPUs (Pentium III/450), fast ethernet Linux Alpha Cluster (16 X UP2000/667 - Quadrics Interconnect) Linux Alpha Cluster (16 X UP2000/667 - Quadrics Interconnect)

16 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory 32 64 96 128 160 192 224 256 326496128160192224256 Number of Nodes Speed-up Measured Parallel Efficiency for NWChem - DFT on IBM-SP; Wall Times to Solution for SCF Convergence D.A Dixon et al., HPC, Plenum, 1999, p. 215

17 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - NWChem Number of CPUs Measured Time (seconds) Si 8 O 7 H 18 347/832 Si 8 O 25 H 18 617/1444 Measured Time (seconds)

18 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - NWChem Number of CPUs Measured Time (seconds) Si 28 O 67 H 30 1687/3928 Si 26 O 37 H 36 1199/2818 T IBM-SP/P2SC-120 (256) = 1137 T IBM-SP/P2SC-120 (256) = 2766

19 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory NWChem : Si 8 O 7 H 18 and Si 26 O 37 H 36 Si 8 O 7 H 18 Si 26 O 37 H 36

20 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory NWChem / Si 8 O 25 H 18 / Cycle

21 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory NWChem / Si 8 O 25 H 18 / Diag

22 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory NWChem / Si 8 O 25 H 18 / subdiag

23 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory NWChem / Si 8 O 25 H 18 / subdiag

24 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Parallel Implementations of GAMESS-UK Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL) Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL) SCF and DFT energies and gradients SCF and DFT energies and gradients Replicated data, but … Replicated data, but … GA Tools for caching of I/O for restart and checkpoint files GA Tools for caching of I/O for restart and checkpoint files Storage of 3-centre 2-e integrals in DFT Jfit Storage of 3-centre 2-e integrals in DFT Jfit Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) SCF second derivatives SCF second derivatives Distribution of and integrals via GAs Distribution of and integrals via GAs MP2 gradients MP2 gradients Distribution of and integrals via GAs Distribution of and integrals via GAs

25 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT S-VWN Impact of Coulomb Fitting: Compaq AlphaServer SC /833 Number of CPUs Measured Time (seconds) Basis: DZV_A2 (Dgauss) A1_DFT Fit: J EXPLICIT J FIT J EXPLICIT J FIT Si 26 O 37 H 36 1199/2818 Si 28 O 67 H 30 1687/3928

26 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - GAMESS-UK Number of CPUs Measured Time (seconds) Si 28 O 67 H 30 1687/3928 Si 26 O 37 H 36 1199/2818

27 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory DFT JFit Performance : DFT JFit Performance : Si 26 O 37 H 36 Number of CPUs JFit XC SCF Cray T3E/1200E SCF XC JFit AlphaServer SC/833 SCF XC JFit SGI Origin 3000/R12k-400 Number of CPUs

28 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK / Si 8 O 25 H 18 : 8 CPUs: One DFT Cycle

29 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK / Si 8 O 25 H 18 : 8 CPUs Q † HQ (GAMULT2) and PEIGS

30 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury LaboratorySummary PNNL, Daresbury and PALLAS collaborations PNNL, Daresbury and PALLAS collaborations Tool for Performance Analysis - VAMPIR & VAMPIR Trace Tool for Performance Analysis - VAMPIR & VAMPIR Trace Extended to handle GA Applications Extended to handle GA Applications Applied in a number of DFT Calculations on Zeolite Fragments on a variety of high-end and commodity-based platforms Applied in a number of DFT Calculations on Zeolite Fragments on a variety of high-end and commodity-based platforms Instrumentation of both NWChem and GAMESS-UK: Instrumentation of both NWChem and GAMESS-UK: Distributed data (NWchem) Distributed data (NWchem) Replicated Data (GAMESS-UK) Replicated Data (GAMESS-UK) Analysis of GAs and PeIGs Analysis of GAs and PeIGs Findings Findings non-intrusive non-intrusive Tracing of substantial runs possible Tracing of substantial runs possible Size of trace files in distributed data applications Size of trace files in distributed data applications Use in quantifying scaling problems Use in quantifying scaling problems e.g. GA_MULT2 in GAMESS-UK e.g. GA_MULT2 in GAMESS-UK

31 Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury LaboratoryAcknowledgements Bob Gingold Australian National Univeristy Supercomputer Facility Mario Deilmann, Hans Plum, Heinrich Bockhorst Pallas


Download ppt "Performance analysis of GA applications19th June 2001 Computational Science and Engineering Department Daresbury Laboratory Performance analysis of GA-based."

Similar presentations


Ads by Google