C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational.

Slides:

Advertisements

Similar presentations

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Scaling of the Community Atmospheric Model to ultrahigh resolution Michael F. Wehner Lawrence.

Advertisements

Andrew Canning and Lin-Wang Wang Computational Research Division LBNL

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering.

One-day Meeting, INI, September 26th, 2008 Role of spectral turbulence simulations in developing HPC systems YOKOKAWA, Mitsuo Next-Generation Supercomputer.

Introduction CS 524 – High-Performance Computing.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

CSE351/ IT351 Modeling and Simulation

Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.

1 NERSC User Group Business Meeting, June 3, 2002 High Performance Computing Research Juan Meza Department Head High Performance Computing Research.

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Latest Advances in “Hybrid” Codes & their Application to Global Magnetospheric Simulations A New Approach to Simulations of Complex Systems H. Karimabadi.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.

GTC Status: Physics Capabilities & Recent Applications Y. Xiao for GTC team UC Irvine.

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :

Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.

Towards Petascale Computing for Science Horst Simon Lenny Oliker, David Skinner, and Erich Strohmaier Lawrence Berkeley National Laboratory The Salishan.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.

Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,

Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.

Presented by XGC: Gyrokinetic Particle Simulation of Edge Plasma CPES Team Physics and Applied Math Computational Science.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Scientific Application Performance on Candidate PetaScale Applications Lenny Oliker

The Nuts and Bolts of First-Principles Simulation Durham, 6th-13th December : DFT Plane Wave Pseudopotential versus Other Approaches CASTEP Developers’

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.

Low-Power Wireless Sensor Networks

1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Introduction to the Particle In Cell Scheme for Gyrokinetic Plasma Simulation in Tokamak a Korea National Fusion Research Institute b Courant Institute,

Challenging problems in kinetic simulation of turbulence and transport in tokamaks Yang Chen Center for Integrated Plasma Studies University of Colorado.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Scientific Application Performance on Candidate PetaScale Applications Leonid Oliker Andrew.

R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.

Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.

Evaluation of Modern Parallel Vector Architectures Lenny Oliker.

The Geometry of Biomolecular Solvation 2. Electrostatics Patrice Koehl Computer Science and Genome Center

Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National.

Mark Rast Laboratory for Atmospheric and Space Physics Department of Astrophysical and Planetary Sciences University of Colorado, Boulder Kiepenheuer-Institut.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Andrew Canning, Jonathan Carter, John Shalf Lawrence.

Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Computer Staff Scientist Future Technologies Group Computational Research Division.

1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.

J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Tackling I/O Issues 1 David Race 16 March 2010.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Unstructured Meshing Tools for Fusion Plasma Simulations

Tohoku University, Japan

Finite difference code for 3D edge modelling

MFE Simulation Data Management

Presentation transcript:

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational Research Division Lawrence Berkeley National Laboratory

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Overview  Stagnating application performance is well-know problem in scientific computing  By end of decade mission critical applications expected to have 100X computational demands of current levels  Many HEC platforms are poorly balanced for demands of leading applications  Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology  Traditional superscalar trends slowing down  Mined most benefits of ILP and pipelining, Clock frequency limited by power concerns  In order to continuously increase computing power and reap its benefits: major strides necessary in architecture development, software infrastructure, and application development

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Application Evaluation  Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural  However full-scale application performance is final arbiter of system utility and necessary as baseline to support all complementary approaches  Our evaluation work emphasizes full applications, with real input data, at the appropriate scale  Requires coordination of computer scientists and application experts from highly diverse backgrounds  Our initial efforts have focused on comparing performance between high-end vector and scalar platforms  Effective code vectorization is an integral part of the process

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Benefits of Evaluation  Full scale application evaluation lead to more efficient use of the community resources in both current installation and in future designs.  Head-to-head comparisons on full applications:  Help identifying the suitability of a particular architecture for a given service site or set of users,  Give application scientists information about how well various numerical methods perform across systems  Reveal performance-limiting system bottlenecks that can aid designers of the next generation systems.  In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale to achieve high performance.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Application Overview NAMEDisciplineProblem/MethodStructure MADCAPCosmologyCMB analysisDense Matrix CACTUSAstrophysicsTheory of GRGrid LBMHDPlasma PhysicsMHDLattice GTCMagnetic FusionVlasov-PoissonParticle/Grid PARATECMaterial ScienceDFTFourier/Grid FVCAMClimate ModelingAGCMGrid Examining set of applications with potential to run at ultra-scale and abundant data parallelism

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N IPM Overview Integrated Performance Monitoring  portable, lightweight, scalable profiling  fast hash method  profiles MPI topology  profiles code regions  open source MPI_Pcontrol(1,”W”); …code… MPI_Pcontrol(-1,”W”); ########################################### # IPMv0.7 :: csnode tasks ES/ESOS # madbench.x (completed) 10/27/04/14:45:56 # # (sec) # # … ############################################### # W # (sec) # # # call [time] %mpi %wall # MPI_Reduce 2.395e # MPI_Recv 9.625e # MPI_Send 2.708e # MPI_Testall 7.310e # MPI_Isend 2.597e ############################################### …

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Plasma Physics: LBMHD  LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)  Performs 2D/3D simulation of high temperature plasma  Evolves from initial conditions and decaying to form current sheets  Spatial grid is coupled to octagonal streaming lattice  Block distributed over processor grid Developed by George Vahala’s group College of William & Mary, ported Jonathan Carter Evolution of vorticity into turbulent structures

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N LBMHD-3D: Performance Grid Size P NERSC (Power3)Thunder (Itan2)Phoenix (X1)ES (SX6 * ) Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak %0.265%5.241%5.569% %0.356%5.241%5.366% %0.326%5.241%5.568% %0.356%5.265%  Not unusual to see vector achieve > 40% peak while superscalar architectures achieve < 10%  There exists plenty of computation, however large working set causes register spilling in scalars  Large vector register sets hide latency  ES sustains 68% of peak up to 4800 processors: 26TFlops - the highest performance ever attained for this code by far!

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Astrophysics: CACTUS  Numerical solution of Einstein’s equations from theory of general relativity  Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms  CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes  Evolves PDE’s on regular grid using finite differences Visualization of grazing collision of two black holes Developed at Max Planck Institute, vectorized by John Shalf

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N CACTUS: Performance  ES achieves fastest performance to date: 45X faster than Power3!  Vector performance related to x-dim (vector length)  Excellent scaling on ES using fixed data size per proc (weak scaling)  Opens possibility of computations at unprecedented scale  X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector  Unvectorized boundary, required 15% of runtime on ES and 30+% on X1  < 5% for the scalar version: unvectorized code can quickly dominate cost  Poor superscalar performance despite high computational intensity  Register spilling due to large number of loop variables  Prefetch engines inhibited due to multi-layer ghost zones calculations Problem Size P NERSC (Power 3)Thunder (Itan2)Phoenix (X1)ES (SX6*) Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 250x80x80 per processor %0.5810%0.816%2.835% %0.5610%0.726%2.734% %0.5510%0.685%2.734%

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Magnetic Fusion: GTC  Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)  Goal magnetic fusion is burning plasma power plant producing cleaner energy  GTC solves 3D gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC)  PIC scales N instead of N 2 – particles interact w/ electromagnetic field on grid  Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs) Electrostatic potential in magnetic fusion device Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N GTC: Performance  New particle decomposition method to efficiently utilize large numbers of processors (as opposed to 64 on ES)  Breakthrough of Tflop barrier on ES: 3.7 Tflop/s on 2048 processors  Opens possibility of new set of high-phase space-resolution simulations, that have not been possible to date  X1 suffers from overhead of scalar code portions  Scalar architectures suffer from low computational intensity, irregular data access, and register spilling Part/ Cell P NERSC (Power3)Thunder (Itan2)Phoenix (X1)ES (SX6 * ) Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak %0.397%1.29%1.620% %0.397%1.29%1.620% %0.387%1.519% %0.377%1.924% %0.377%1.823%

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Cosmology: MADCAP  Anisotropy Dataset Computational Analysis Package  Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)  Anisotropies in the CMB contains early history of the Universe  Recasts problem in dense linear algebra: ScaLAPACK  Out of core calculation: holds approx 3 of the 50 matrices in memory Temperature anisotropies in CMB (Boomerang) Developed by Julian Borrill, LBNL

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N MADCAP: Performance  Overall performance can be surprising low, for dense linear algebra code  I/O takes a heavy toll on Phoenix and Columbia: I/O optimization currently in progress  NERSC Power3 shows best system balance wrt to I/O  ES lacks high-performance parallel I/O Number Pixels P NERSC (Power3)Columbia (Itan2)Phoenix (X1)ES (SX6 * ) Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 10K %1.220%2.217%2.937% 20K %1.119%0.65%4.050% 40K %4.658%

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Climate: FVCAM  Atmospheric component of CCSM  AGCM: consists of physics and dynamical core (DC)  DC approximates Navier-Stokes eqn’s to describe dynamics of atmosphere  Default approach uses spectral transform (1D decomp)  Finite volume (FV) approach uses a 2D decomposition in latitude and level: allows higher concurrency  Requires remapping between Lagrangian surfaces and Eulerian reference frame Experiments conducted by Michael Wehner, vectorized by Pat Worley, Art Mirin, Dave Parks

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N FVCAM: Performance  2D approach allows both architectures to effectively use >2X as many procs  At high concurrencies both platforms achieve low % peak (about 4%)  ES suffers from short vector lengths for fixed problem size  ES can achieve more than 1000 simul year / wall clock year (3200 on 896 processors), NERSC cannot exceed 600 regardless of concurrency  Speed up of 1000x or more is necessary for reasonable turnaround time  Preliminary results: CAM3.1 experiments currently underway on ES, X1, Thunder, Power3 CAM3.0 results on ES and Power3, using D Mesh (0.5ºx0.625º)

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Material Science: PARATEC  PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set  Density Functional Theory to calc structure & electronic properties of new materials  DFT calc are one of the largest consumers of supercomputer cycles in the world  33% 3D FFT, 33% BLAS3, 33% Hand coded F90  Part of calculation in real space other in Fourier space  Uses specialized 3D FFT to transform wavefunction Crystallized glycine induced current & charge

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N PARATEC: Performance  All architectures generally achieve high performance due to computational intensity of code (BLAS3, FFT)  ES achieves fastest performance to date: 5.5Tflop/s on 2048 procs  Main ES advantage for this code is fast interconnect  Allows never before possible, high resolution simulations  X1 shows lowest % of peak  Non-vectorizable code much more expensive on X1 (32:1)  Lower bisection bandwidth to computational ratio (2D Torus) ProblemP NERSC (Power3)Thunder (Itan2)Phoenix (X1)ES (SX6 * ) Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 488 Atom CdSe Quantum Dot %2.851%3.225%5.164% %2.647%3.024%5.062% %2.444%4.455% %1.832%3.646% Developed by Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Overview Tremendous potential of vector architectures: 4 codes running faster than ever before  Vector systems allows resolution not possible with scalar arch (regardless of # procs)  Opportunity to perform scientific runs at unprecedented scale  ES shows high raw and much higher sustained performance compared with X1 Limited X1 specific optimization - optimal programming approach still unclear (CAF, etc) Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio) Evaluation codes contain sufficient regularity in computation for high vector performance GTC example code at odds with data-parallelism Much more difficult to evaluate codes poorly suited for vectorization  Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale)  Plan to expand scope of application domains/methods, and examine latest HPC architectures Code (P=64) % peak(P=Max avail) Speedup ES vs. Pwr3Pwr4AltixESX1Pwr3Pwr4AltixX1 LBMHD3D CACTUS GTC MADCAP PARATEC FVCAM AVERAGE

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Collaborators  Rupak Biswas, NASA Ames  Andrew Canning LBNL  Jonathan Carter, LBNL  Stephane Ethier, PPPL  Bala Govindasamy, LLNL  Art Mirin, LLNL  David Parks, NEC  John Shalf, LBNL  David Skinner, LBNL  Yoshinori Tsunda, JAMSTEC  Michael Wehner, LBNL  Patrick Worley, ORNL