1 Performance Tuning of Scientific Applications David H Bailey, Lawrence Berkeley National Lab, USA (speaker) DHB’s website:

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…
SUPER Bob Lucas University of Southern California Sept. 23, 2011 SciDAC-3 Institute for Sustained Performance, Energy, and Resilience
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Introductory Courses in High Performance Computing at Illinois David Padua.
OpenFOAM on a GPU-based Heterogeneous Cluster
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
May 25, 2010 Mary Hall May 25, 2010 Advancing the Compiler Community’s Research Agenda with Archiving and Repeatability * This work has been partially.
A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.
Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
1 First-Principles Molecular Dynamics for Petascale Computers François Gygi Dept of Applied Science, UC Davis
Presented by XGC: Gyrokinetic Particle Simulation of Edge Plasma CPES Team Physics and Applied Math Computational Science.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program CASC, May 3, ADVANCED SCIENTIFIC COMPUTING RESEARCH An.
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.
DV/dt - Accelerating the Rate of Progress towards Extreme Scale Collaborative Science DOE: Scientific Collaborations at Extreme-Scales:
Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.
Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.
Presented by Performance Evaluation and Analysis Consortium (PEAC) End Station Patrick H. Worley Computational Earth Sciences Group Computer Science and.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.
An Integrated Design Environment to Evaluate Power/Performance Tradeoffs for Sensor Network Applications Amol Bakshi, Jingzhao Ou, and Viktor K. Prasanna.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
MESQUITE: Mesh Optimization Toolkit Brian Miller, LLNL
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Connections to Other Packages The Cactus Team Albert Einstein Institute
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Presented by Performance Engineering Research Institute (PERI) Patrick H. Worley Computational Earth Sciences Group Computer Science and Mathematics Division.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,
Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
VisIt Project Overview
Performance Technology for Scalable Parallel Systems
Condensed Matter Physics and Materials Science: David M
Programming Models for SimMillennium
for more information ... Performance Tuning
Model-Driven Analysis Frameworks for Embedded Systems
Presentation transcript:

1 Performance Tuning of Scientific Applications David H Bailey, Lawrence Berkeley National Lab, USA (speaker) DHB’s website:

2 Why performance is important  Highly-parallel scientific computing is widely used in science and technology:  Climate modeling – exploring scenarios for global warming.  Materials science – photovoltaics, batteries and nanoelectronics.  Astrophysics – supernova explosions, cosmology, data processing.  Physics – tests of standard model and alternatives.  Biology – model biochemical processes, develop new drugs.  Combustion – test designs for greater efficiency and less pollution.  However, achieved performance is often poor – typically only 1-5% of peak. Common reasons include:  Limited parallel concurrency in key loops, resulting in poor load balance.  Suboptimal structure of loops and blocks.  Subtle interference effects in data communication channels.  Low performance is unacceptable, not only because of the high purchase cost of state-of-the-art systems, but also because of the increasing cost of providing electrical power for these systems.

3 The Performance Engineering Research Institute (PERI)  Performing research in theory, techniques and application of performance tuning for scientific computing.  Funded by U.S. Dept. of Energy, Office of Science (SciDAC program).  Participating members:  University of Southern Cal. (lead)Lawrence Berkeley Natl. Lab. (asst. lead)  Argonne National Lab.Oak Ridge National Lab.  Lawrence Livermore National Lab.Rice University  University of California, San DiegoUniversity of Maryland  University of North Carolina/RENCIUniversity of Tennessee, Knoxville  University of Utah  Principal research thrusts:  Performance modeling and analysis.  Automatic performance tuning.  Application analysis.

4 PERI performance modeling  Semi-automated performance modeling methodology:  Performance trace runs obtain profiles of applications.  Performance probes obtain profiles of computer systems.  A “convolution” approach combines application and system profiles to produce quantitative predictions of performance.  Uses:  Permits scientists to understand the bottlenecks in their codes and future potential for parallel scalability.  Permits computing facility managers to plan future requirements and improve the selection process of large systems.  Recent advances include:  Techniques to significantly reduce the volume of trace data required.  Techniques to extrapolate models to larger future systems.  Extensions of modeling methods to encompass energy consumption.  Applications to both DOD and DOE computational workloads. Credit: Allan Snavely, UCSD

5 Performance modeling at LBNL  Erich Strohmaier’s ApexMAP: A simple modeling framework based on dataset size, spatial locality and temporal locality.  Samuel Williams’ “roofline” model: Compares achieved performance to a “roofline” graph of peak data streaming bandwidth and peak flop/s capacity.

6 PERI automatic performance tuning  Background: We have found that most computational scientists are reluctant to learn and use performance tools in day-to-day research work.  Solution: Extend semi-automatic performance tuning techniques, such as those developed for specialized libraries like FFTW (FFTs) and ATLAS (dense matrix computation), to the more general area of large-scale scientific computing.

7 HPC Toolkit (Rice) ROSE (LLNL) CHiLL (USC/ISI and Utah) ROSE (LLNL) Orio (Argonne) { OSKI (LBNL) Active Harmony (UMD) GCO (UTK) PerfTrack (LBNL, SDSC, RENCI) The PERI autotuning framework

8 Applications of PERI research  PERI research tools and expertise have been applied to numerous scientific application codes, in many cases with notable results.  Even modest performance improvements in widely-used, high-profile codes can save hundreds of thousands of dollars in computer time. Examples:  S3D (Sandia code to model turbulence):  Improved exp routine (later supplanted by improved exp from Cray).  Improved set of compiler settings.  Achieved 12.7% overall performance improvement.  S3D runs consume 6,000,000 CPU-hours of computer time per year, so 762,000 CPU-hours are potentially saved each year.  PFLOTRAN (LANL code to subsurface reactive flows):  Two key PETSc routines (17% of run time) and a third routine (7% of run time) were each accelerated by more than 2X using autotuning.  40X speedup in initialization phase, and 4X improvement in I/O stage.  Overall 5X speedup on runs with 90,000 or more cores.

9 SMG2000  SMG2000: A semicoarsening multigrid solver code, used for various applications including modeling of groundwater diffusion.  PERI researchers integrated several tools, then developed a “smart” search technique to find an optimal tuning strategy among 581 million different choices.  Achieved 2.37X performance improvement on one key kernel.  Achieved 27% overall performance improvement.

10 Outlined code (from ROSE outliner) for (si = 0; si < stencil_size; si++) for (kk = 0; kk < hypre__mz; kk++) for (jj = 0; jj < hypre__my; jj++) for (ii = 0; ii < hypre__mx; ii++) rp[((ri+ii)+(jj*hypre__sy3))+(kk*hypre__sz3)] -= ((Ap_0[((ii+(jj*hypre__sy1))+ (kk*hypre__sz1))+ (((A->data_indices)[i])[si])])* (xp_0[((ii+(jj*hypre__sy2))+(kk*hypre__sz2))+(( *dxp_s)[si])])); CHiLL transformation recipe permute([2,3,1,4]) tile(0,4,TI) tile(0,3,TJ) tile(0,3,TK) unroll(0,6,US) unroll(0,7,UI) Credit: Mary Hall, Utah Constraints on search 0 ≤ TI, TJ, TK ≤ ≤ UI ≤ 16 0 ≤ US ≤ 10 compilers ∈ {gcc, icc} Search space: x16x10x2 = 581,071,360 points Autotuning the central SMG2000 kernel

11 Selected parameters: TI=122,TJ=106,TK=56,UI=8,US=3,Comp=gcc Performance gain on residual computation: 2.37X Performance gain on full application: 27.23% improvement Parallel heuristic search (using Active Harmony) evaluates 490 total points and converges in 20 steps. 11 Credit: Mary Hall, Utah Search for optimal tuning parameters for SMG kernel

12 Autotuning the triangular solve kernel of the Nek5000 turbulence code CompilerOriginalActive HarmonyExhaustive Time (u1,u2)SpeedupTime(u1,u2)Speedup pathscale (3,11) (3,15)1.93 gnu (5,13) (5,7)1.54 pgi (5,3) (5,3)1.70 cray (15,5) (15,15)1.63 Credit: Jeff Hollingsworth, Maryland

13  LBMHD (left): Implements a lattice Boltzmann method for magnetohydrodynamic plasma turbulence simulation.  GTC (right): A gyrokinetic toroidal code for plasma turbulence modeling. Credit: Samuel Williams, LBNL Autotuning LBMHD and GTC (LBNL work)

14 LS3DF (LBNL work)  LS3DF: “linearly scaling 3-dimensional fragment” method.  Developed at LBNL by Lin-Wang Wang and several collaborators.  Used for for electronic structure calculations – numerous applications in materials science and nanoscience.  Employs a novel divide-and-conquer scheme including a new approach for patching the fragments together.  Achieves nearly linear scaling in computational cost versus size of problem, compared with n 3 scaling in many other comparable codes.  Potential for nearly linear scaling in performance versus number of cores. Challenge:  Initial implementation of LS3DF had disappointingly low performance and parallel scalability.

15 (i,j,k) Fragment (2x1) Interior area Artificial surface passivation Buffer area Boundary effects are (nearly) cancelled out between the fragments: Total = Σ F { F F } F F 2-D domain patching scheme in LS3DF Credit: Lin-Wang Wang, LBNL

16 LBNL’s performance analysis of LS3DF LBNL researchers (funded through PERI) applied performance monitoring tools to analyze run-time performance of LS3DF. Key issues uncovered:  Limited concurrency in a key step, resulting in a significant load imbalance between processors.  Solution: Modify code for two-dimensional parallelism.  Costly file I/O operations were used for data communication between processors.  Solution: Replace all file I/O operations with MPI send-receive operations.

17 Resulting performance of tuned LS3DF  135 Tflops/s on 36,864 cores of the Cray XT4 Franklin system at LBNL.  40% efficiency on 36,864 cores.  224 Tflops/s on 163,840 processors of the BlueGene/P Intrepid system at Argonne Natl. Lab.  40% efficiency on 163,840 cores.  442 Tflops/s on 147,456 processors of the Cray XT5 Jaguar system at Oak Ridge Natl. Lab.  33% efficiency on 147,456 cores. The authors of the LS3DF paper were awarded the 2008 ACM Gordon Bell Prize in a special category for “algorithm innovation.”

18 Near-linear parallel scaling for up to 163,840 cores and up to 442 Tflop/s

19 ZnTe bottom of cond. band stateHighest O induced state Solar cell application of tuned LS3DF  Single-band material theoretical photovoltaic efficiency is limited to 30%.  With an intermediate state, the photovoltaic efficiency may increase to 60%.  Proposed material: ZnTe:O.  Is there really a gap?  Is there sufficient oscillator strength?  LS3DF calculation used for 3500 atom 3% O alloy [one hour on 17,000 cores of Franklin system].  Result: There is a gap, and O induced states are highly localized. Credit: Lin-Wang Wang, LBNL

20 For additional details: Performance Tuning of Scientific Applications Editors: Bailey (LBNL), Lucas (USC/ISI), Williams (LBNL); numerous individual authors of various chapters. Publisher: CRC Computational Science, Jan Contents: 1.Introduction 2.Parallel computer architecture 3.Software interfaces to hardware counters 4.Measurement and analysis of parallel program performance using TAU and HPCToolkit. 5.Trace-base tools 6.Large-scale numerical simulations on high-end computational platforms 7.Performance modeling: the convolution approach 8.Analytic modeling for memory access patterns based on Apex-MAP 9.The roofline model 10.End-to-end auto-tuning with active harmony 11.Languages and compilers for auto-tuning 12.Empirical performance tuning of dense linear algebra software 13.Auto-tuning memory-intensive kernels for multicore 14.Flexible tools supporting a scalable first-principles MD code 15.The community climate system model 16.Tuning an electronic structure code