1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Slides:



Advertisements
Similar presentations
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Cray Optimization and Performance Tools Harvey Wasserman Woo-Sun Yang NERSC User Services Group Cray XE6 Workshop February 7-8, 2011 NERSC Oakland Scientific.
U.S. Department of Energy’s Office of Science Basic Energy Sciences Advisory Committee Dr. Daniel A. Hitchcock October 21, 2003
PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering.
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 NERSC User Group Business Meeting, June 3, 2002 High Performance Computing Research Juan Meza Department Head High Performance Computing Research.
An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Performance Engineering and Debugging HPC Applications David Skinner
Towards Petascale Computing for Science Horst Simon Lenny Oliker, David Skinner, and Erich Strohmaier Lawrence Berkeley National Laboratory The Salishan.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Domain Applications: Broader Community Perspective Mixture of apprehension and excitement about programming for emerging architectures.
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Computer Performance Computer Engineering Department.
Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.
QCD Project Overview Ying Zhang September 26, 2005.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
CSC-115 Introduction to Computer Programming
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
NERSC 5 Status NERSC 5 Team Bill Kramer Ernest Orlando Lawrence Berkeley National Laboratory.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
Presented by On the Path to Petascale: Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.
Using IOR to Analyze the I/O Performance
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
Preliminary CPMD Benchmarks On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI.
The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Sunpyo Hong, Hyesoon Kim
EGRE 426 Computer Organization and Design Chapter 4.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Software Practices for a Performance Portable Climate System Model
BlueGene/L Supercomputer
Presentation transcript:

1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006

2 Architectural Comparison Node Type WhereNetwork CPU/ Node Clock MHz Peak GFlop Stream BW GB/s/P Peak byte/flo p MPI BW GB/s/P MPI Latency  sec Network Topology Power3 NERSCColony Fat-tree Itanium2 LLNLQuadrics Fat-tree Opteron NERSCInfiniBand Fat-tree Power5 NERSCHPS Fat-tree X1E ORNLCustom D- Hypercube ES ESCIN Crossbar SX-8 HLRSINX Crossbar

3 NERSC 5 Application Benchmarks CAM3 –Climate model, NCAR GAMESS –Computational chemistry, Iowa State, Ames Lab GTC –Fusion, PPPL MADbench –Astrophysics (CMB analysis), LBL Milc –QCD, multi-site collaboration Paratec –Materials science,developed LBL and UC Berkeley PMEMD –Computational chemistry, University of North Carolina- Chapel Hill

4 Application Summary ApplicationScience Area Basic Algorithm LanguageLibrary Use Comment CAM3 Climate (BER) CFD, FFTFORTRAN 90netCDFIPCC GAMESS Chemistry (BES) DFTFORTRAN 90DDI, BLAS GTC Fusion (FES) Particle- in-cell FORTRAN 90FFT(opt)ITER emphasis MADbench Astrophysics (HEP & NP) Power Spectrum Estimation CScalapack1024 proc. 730 MB per task, 200 GB disk MILC QCD (NP) Conjugate gradient Cnone2048 proc. 540 MB per task PARATEC Materials (BES) 3D FFTFORTRAN 90ScalapackNanoscience emphasis PMEMD Life Science (BER) Particle Mesh Ewald FORTRAN 90none

5 CAM3 Community Atmospheric Model version 3 –Developed at NCAR with substantial DOE input, both scientific and software. The atmosphere model for CCSM, the coupled climate system model. –Also the most timing consuming part of CCSM. –Widely used by both American and foreign scientists for climate research. For example, Carbon, bio-geochemistry models are built upon (integrated with) CAM3. IPCC predictions use CAM3 (in part) –About 230,000 lines codes in Fortran 90. 1D Decomposition, runs up to 128 processors at T85 resolution (150Km) 2D Decomposition, runs up to 1680 processors at 0.5 deg (60Km) resolution.

6 CAM3: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk %0.356%0.9312% %0.386%0.8311%

7 GAMESS Computational chemistry application –Variety of electronic structure algorithms available About 550,000 lines of Fortran 90 Communication layer makes use of highly optimized vendor libraries Many methods available within the code –Benchmarks are DFT energy and gradient calculation, MP2 energy and gradient calculation –Many computational chemistry studies rely on these techniques Exactly the same as DOD HPCMP TI-06 GAMESS benchmark –Vendors will only have to do the work once

8 GAMESS: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk %0.071%0.072%0.061% %0.325%0.314% Small case: large, messy, low computational- intensity kernels problematic for compilers Large case depends on asynchronous messaging

9 GTC Gyrokinetic Toroidal Code Important code for Fusion SciDAC Project and for the International Fusion collaboration ITER. Transport of thermal energy via plasma microturbulence using particle-in-cell approach (PIC) 3D visualization of electrostatic potential in magnetic fusion device

10 GTC: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk %0.519%0.6415%0.729%1.710%1.923%2.314% %0.447%0.5813%0.689%1.710%1.822%2.315% SX8 highest raw performance (ever) but lower efficiency than ES Scalar architectures suffer from low computational intensity, irregular data access, and register spilling Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1 –Opteron: on-chip memory controller and caching of FP L1 data X1 suffers from overhead of scalar code portions

11 MADbench Cosmic microwave background radiation analysis tool (MADCAP) –Used large amount of time in FY04 and one of the highest scaling codes at NERSC MADBench is a benchmark version of the original code –Designed to be easily run with synthetic data for portability. –Used in a recent study in conjunction with Berkeley Institute for Performance Studies (BIPS). Written in C making extensive use of ScaLAPACK libraries Has extensive I/O requirements

12 MADbench: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk %2.643%1.740%4.154% %2.236%1.840%3.244% %1.627% Dominated by –Blas3 –I/O

13 MILC Quantum ChromoDynamics application –Widespread community use, large allocation –Easy to build, no dependencies, standards conforming –Can be setup to run on wide-range of concurrency Conjugate gradient algorithm Physics on a 4D lattice Local computations are 3x3 complex matrix multiplies, with sparse (indirect) access pattern

14 MILC: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk %0.264%0.6014%1.3518% %0.264%0.5112%0.8611% %0.254%0.4711%

15 PARATEC Parallel Total Energy Code Plane Wave DFT using custom 3D FFT 70% of Materials Science Computation at NERSC is done via Plane Wave DFT codes. PARATEC capture the performance of a wide range of codes (VASP, CPMD, PETOT).

16 PARATEC: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk %1.829%2.353%4.458%3.821%5.164%7.549% %0.7913%1.738%3.343%3.318%5.062%6.843% All architectures generally perform well due to computational intensity of code (BLAS3, FFT) SX8 achieves highest per-processor performance X1/X1E shows lowest % of peak –Non-vectorizable code much more expensive on X1/X1E (32:1) –Lower bisection bandwidth to computational ratio (4D-hypercube) –X1 Performance is comparable to Itanium2 Itanium2 outperforms Opteron because –Paratec less sensitive to memory access issues (BLAS3) –Opteron lacks FMA unit –Quadrics shows better scaling of all-to-all at large concurrencies

17 PMEMD Particle Mesh Ewald Molecular Dynamics –A F90 code with advanced MPI coding should test compiler and stress asynchronous point to point messaging. PMEMD is very similar to the MD Engine in AMBER 8.0 used in both chemistry and biosciences Test system is a 91K atom blood coagulation protein

18 PMEMD: Performance P Power3 Seaborg Itanium2 Thunder Opteron Jacquard Power5 Bassi GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk %0.213%0.4610%0.527% %0.102%0.194%0.324%

19 Summary

20 Summary seaborgbassijacquardthunders/b MILC M MILC L MILC XL GTC M GTC L PARA M PARA L GAM M GAM L MAD M MAD L MAD XL PME M PME L CAM M CAM L

21 Summary Average ratio bassi to seaborg is 6.0 for N5 application benchmarks