Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Similar presentations


Presentation on theme: "1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006."— Presentation transcript:

1 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

2 2 Architectural Comparison Node Type WhereNetwork CPU/ Node Clock MHz Peak GFlop Stream BW GB/s/P Peak byte/flo p MPI BW GB/s/P MPI Latency  sec Network Topology Power3 NERSCColony163751.50.40.260.1316.3Fat-tree Itanium2 LLNLQuadrics414005.61.10.190.253.0Fat-tree Opteron NERSCInfiniBand222004.42.30.510.596.0Fat-tree Power5 NERSCHPS819007.66.80.850.694.7Fat-tree X1E ORNLCustom4113018.09.70.542.95.0 4D- Hypercube ES ESCIN81000 8.026.33.291.55.6Crossbar SX-8 HLRSINX8200016.041.02.562.05.0Crossbar

3 3 NERSC 5 Application Benchmarks CAM3 –Climate model, NCAR GAMESS –Computational chemistry, Iowa State, Ames Lab GTC –Fusion, PPPL MADbench –Astrophysics (CMB analysis), LBL Milc –QCD, multi-site collaboration Paratec –Materials science,developed LBL and UC Berkeley PMEMD –Computational chemistry, University of North Carolina- Chapel Hill

4 4 Application Summary ApplicationScience Area Basic Algorithm LanguageLibrary Use Comment CAM3 Climate (BER) CFD, FFTFORTRAN 90netCDFIPCC GAMESS Chemistry (BES) DFTFORTRAN 90DDI, BLAS GTC Fusion (FES) Particle- in-cell FORTRAN 90FFT(opt)ITER emphasis MADbench Astrophysics (HEP & NP) Power Spectrum Estimation CScalapack1024 proc. 730 MB per task, 200 GB disk MILC QCD (NP) Conjugate gradient Cnone2048 proc. 540 MB per task PARATEC Materials (BES) 3D FFTFORTRAN 90ScalapackNanoscience emphasis PMEMD Life Science (BER) Particle Mesh Ewald FORTRAN 90none

5 5 CAM3 Community Atmospheric Model version 3 –Developed at NCAR with substantial DOE input, both scientific and software. The atmosphere model for CCSM, the coupled climate system model. –Also the most timing consuming part of CCSM. –Widely used by both American and foreign scientists for climate research. For example, Carbon, bio-geochemistry models are built upon (integrated with) CAM3. IPCC predictions use CAM3 (in part) –About 230,000 lines codes in Fortran 90. 1D Decomposition, runs up to 128 processors at T85 resolution (150Km) 2D Decomposition, runs up to 1680 processors at 0.5 deg (60Km) resolution.

6 6 CAM3: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 560.2215%0.356%0.9312% 2400.1813%0.386%0.8311%

7 7 GAMESS Computational chemistry application –Variety of electronic structure algorithms available About 550,000 lines of Fortran 90 Communication layer makes use of highly optimized vendor libraries Many methods available within the code –Benchmarks are DFT energy and gradient calculation, MP2 energy and gradient calculation –Many computational chemistry studies rely on these techniques Exactly the same as DOD HPCMP TI-06 GAMESS benchmark –Vendors will only have to do the work once

8 8 GAMESS: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 640.021%0.071%0.072%0.061% 3840.032%0.325%0.314% Small case: large, messy, low computational- intensity kernels problematic for compilers Large case depends on asynchronous messaging

9 9 GTC Gyrokinetic Toroidal Code Important code for Fusion SciDAC Project and for the International Fusion collaboration ITER. Transport of thermal energy via plasma microturbulence using particle-in-cell approach (PIC) 3D visualization of electrostatic potential in magnetic fusion device

10 10 GTC: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 640.1510%0.519%0.6415%0.729%1.710%1.923%2.314% 2560.138%0.447%0.5813%0.689%1.710%1.822%2.315% SX8 highest raw performance (ever) but lower efficiency than ES Scalar architectures suffer from low computational intensity, irregular data access, and register spilling Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1 –Opteron: on-chip memory controller and caching of FP L1 data X1 suffers from overhead of scalar code portions

11 11 MADbench Cosmic microwave background radiation analysis tool (MADCAP) –Used large amount of time in FY04 and one of the highest scaling codes at NERSC MADBench is a benchmark version of the original code –Designed to be easily run with synthetic data for portability. –Used in a recent study in conjunction with Berkeley Institute for Performance Studies (BIPS). Written in C making extensive use of ScaLAPACK libraries Has extensive I/O requirements

12 12 MADbench: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 640.5637%2.643%1.740%4.154% 256 0.5034%2.236%1.840%3.244% 20480.7047%1.627% Dominated by –Blas3 –I/O

13 13 MILC Quantum ChromoDynamics application –Widespread community use, large allocation –Easy to build, no dependencies, standards conforming –Can be setup to run on wide-range of concurrency Conjugate gradient algorithm Physics on a 4D lattice Local computations are 3x3 complex matrix multiplies, with sparse (indirect) access pattern

14 14 MILC: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 640.1812%0.264%0.6014%1.3518% 256 0.149%0.264%0.5112%0.8611% 20480.128%0.254%0.4711%

15 15 PARATEC Parallel Total Energy Code Plane Wave DFT using custom 3D FFT 70% of Materials Science Computation at NERSC is done via Plane Wave DFT codes. PARATEC capture the performance of a wide range of codes (VASP, CPMD, PETOT).

16 16 PARATEC: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 640.6040%1.829%2.353%4.458%3.821%5.164%7.549% 2560.4127%0.7913%1.738%3.343%3.318%5.062%6.843% All architectures generally perform well due to computational intensity of code (BLAS3, FFT) SX8 achieves highest per-processor performance X1/X1E shows lowest % of peak –Non-vectorizable code much more expensive on X1/X1E (32:1) –Lower bisection bandwidth to computational ratio (4D-hypercube) –X1 Performance is comparable to Itanium2 Itanium2 outperforms Opteron because –Paratec less sensitive to memory access issues (BLAS3) –Opteron lacks FMA unit –Quadrics shows better scaling of all-to-all at large concurrencies

17 17 PMEMD Particle Mesh Ewald Molecular Dynamics –A F90 code with advanced MPI coding should test compiler and stress asynchronous point to point messaging. PMEMD is very similar to the MD Engine in AMBER 8.0 used in both chemistry and biosciences Test system is a 91K atom blood coagulation protein

18 18 PMEMD: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 640.139%0.213%0.4610%0.527% 2560.053%0.102%0.194%0.324%

19 19 Summary

20 20 Summary seaborgbassijacquardthunders/b MILC M 1028.9138312708.07.5 MILC L 9562.7149625305069.06.4 MILC XL 12697.3194532896129.0 GTC M 8236.9166718762345.04.9 GTC L 9572.3179020792759.05.3 PARA M 3306.4451.0861.01134.07.3 PARA L 6811.0854.01654.03534.08.0 GAM M 18665.05837.05404.05277.03.2 GAM L 42167.04683.04516.09.0 MAD M 8013.91094.02585.01727.07.3 MAD L 8421.61277.02417.01942.06.6 MAD XL 2943.9447.0846.01291.0 PME M 2080.05386061344.03.9 PME L 3020.04757821541.06.4 CAM M 7932.81886.049884.2 CAM L 2439.0527.01158.04.6

21 21 Summary Average ratio bassi to seaborg is 6.0 for N5 application benchmarks


Download ppt "1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006."

Similar presentations


Ads by Google