Cray Optimization and Performance Tools Harvey Wasserman Woo-Sun Yang NERSC User Services Group Cray XE6 Workshop February 7-8, 2011 NERSC Oakland Scientific.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Automated Instrumentation and Monitoring System (AIMS)
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Tools for applications improvement George Bosilca.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Performance Engineering and Debugging HPC Applications David Skinner
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
MCTS Guide to Microsoft Windows 7
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
CS 584. Performance Analysis Remember: In measuring, we change what we are measuring. 3 Basic Steps Data Collection Data Transformation Data Visualization.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
RAM, PRAM, and LogP models
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Parallel Programming with MPI and OpenMP
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Outline Why this subject? What is High Performance Computing?
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Sunpyo Hong, Hyesoon Kim
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
MCTS Guide to Microsoft Windows 7
Performance Evaluation of Adaptive MPI
CSCE 212 Chapter 4: Assessing and Understanding Performance
Department of Computer Science University of California, Santa Barbara
Summary Background Introduction in algorithms and applications
CMSC 611: Advanced Computer Architecture
Architectural Interactions in High Performance Clusters
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
CMSC 611: Advanced Computer Architecture
Chapter 01: Introduction
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Cray Optimization and Performance Tools Harvey Wasserman Woo-Sun Yang NERSC User Services Group Cray XE6 Workshop February 7-8, 2011 NERSC Oakland Scientific Facility

Outline Introduction, motivation, some terminology Using CrayPat Using Apprentice2 Hands-on lab 2

Why Analyze Performance? Improving performance on HPC systems has compelling economic and scientific rationales. –Dave Bailey: Value of improving performance of a single application, 5% of machine’s cycles by 20% over 10 years: $1,500,000 –Scientific benefit probably much higher Goal: solve problems faster; solve larger problems Accurately state computational need Only that which can be measured can be improved The challenge is mapping the application to an increasingly more complex system architecture –or set of architectures 3

Performance Analysis Issues Difficult process for real codes Many ways of measuring, reporting Very broad space: Not just time on one size –for fixed size problem (same memory per processor): Strong Scaling –scaled up problem (fixed execution time): Weak Scaling A variety of pitfalls abound –Must compare parallel performance to best uniprocessor algorithm, not just parallel program on 1 processor (unless it’s best) –Be careful relying on any single number Amdahl’s Law 4

Performance Questions How can we tell if a program is performing well? Or isn’t? If performance is not “good,” how can we pinpoint why? How can we identify the causes? What can we do about it? 5

6 Performance Evaluation as an Iterative Process Sell Machine Vendor User Buy Machine Improve machine Improve code Overall goal: more / better science results

7 Supercomputer Architecture

Performance Metrics Primary metric: application time –but gives little indication of efficiency Derived measures: –rate (Ex.: messages per unit time, Flops per Second, clocks per instruction), cache utilization Indirect measures: –speedup, efficiency, scalability 8

Performance Metrics Most basic: –counts: how many MPI_Send calls? –duration: how much time in MPI_Send ? –size: what size of message in MPI_Send? (MPI performance as a function of message size) 9 L =Message Size T=Time }t s = startup cost }t w = cost per word T msg = t s + t w L = Bandwidth

Performance Data Collection Two dimensions: When data collection is triggered: –Externally (asynchronous): Sampling OS interrupts execution at regular intervals and records the location (program counter) (and / or other event(s)) –Internally (synchronous): Tracing Event based Code instrumentation, Automatic or manual 10

Instrumentation Instrumentation: adding measurement probes to the code to observe its execution. Different techniques depending on where the instrumentation is added. Different overheads and levels of accuracy with each technique 11 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linker OS VM instrumentation performance data run preprocessor Karl Fuerlinger, UCB

Source-Level Instrumentation Goal is to allow performance measurement without modification of user source code 12

Instrumentation Instrumentation: adding measurement probes to the code to observe its execution. Different techniques depending on where the instrumentation is added. Different overheads and levels of accuracy with each technique 13 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linker OS VM instrumentation performance data run preprocessor Karl Fuerlinger, UCB

Performance Instrumentation Approach: use a tool to “instrument” the code 1.Transform a binary executable before executing 2.Include “hooks” for important events 3.Run the instrumented executable to capture those events, write out raw data file 4.Use some tool(s) to interpret the data 14

Performance Data Collection How performance data are presented: –Profile: combine sampled events over time Reflects runtime behavior of program entities –functions, loops, basic blocks –user-defined “semantic” entities Good for low-overhead performance assessment Helps to expose performance hotspots (“bottleneckology”) –Trace file: Sequence of events over time Gather individual time-stamped events (and arguments) Learn when (and where?) events took place on a global timeline Common for message passing events (sends/receives) Large volume of performance data generated; generally intrusive Becomes very difficult at large processor counts, large numbers of events –Example in Apprentice section at end of tutorial 15

Performance Analysis Difficulties Tool overhead Data overload User knows the code better than the tool Choice of approaches Choice of tools CrayPat is an attempt to overcome several of these –By attempting to include intelligence to identify problem areas –However, in general the problems remain 16

Performance NERSC IPM: Integrated Performance Monitor Vendor Tools: –CrayPat Community Tools (Not all fully supported): –TAU (U. Oregon via ACTS) –OpenSpeedShop (DOE/Krell) –HPCToolKit (Rice U) –PAPI (Performance Application Programming Interface) 17

Profiling: Inclusive vs. Exclusive Inclusive time for main: –100 secs Exclusive time for main: – =10 secs –Exclusive time sometimes called “self” 18

USING CRAYPAT Woo-Sun Yang 19

USING CRAY’S APPRENTICE TOOL Harvey Wasserman 20

Using Apprentice Optional visualization tool for Cray’s perftools data Use it in a X Windows environment Uses a data file as input (XXX.ap2) that is prepared by pat_report 1.module load perftools 2.ftn -c mpptest.f 3.ftn -o mpptest mpptest.o 4.pat_build -u -g mpi mpptest 5.aprun -n 16 mpptest+pat 6.pat_report mpptest+pat+PID.xf > my_report 7.app2 [--limit_per_pe tags] [XXX.ap2] 21

Opening Files Identify files on the command line or via the GUI: 22

23 Apprentice Basic View Can select new (additional) data file and do a screen dump Can select other views of the data Worthless Useful Can drag the “calipers” to focus the view on portions of the run

24 Apprentice Call Tree Report Horizontal size = cumulative time in node’s children Vertical size = time in computation Green nodes: no callees Stacked bar charts: load balancing info. Yellow=Max purple=Average Light Blue=Minimum Stacked bar charts: load balancing info. Yellow=Max purple=Average Light Blue=Minimum Calipers work Right-click to view source Useful

25 Apprentice Call Tree Report Red arc identifies path to the highest detected load imbalance. Call tree stops there because nodes were filtered out. To see the hidden nodes, right-click on the node attached to the marker and select "unhide all children” or "unhide one level". Double-click on for more info about load imbalance.

Apprentice Event Trace Views Run code with setenv PAT_RT_SUMMARY 0 Caution: Can generate enormous data files and take forever 26

Apprentice Traffic Report 27 Shows message traces as a function of time Look for large blocks of barriers held up by a single processor Zoom is important; also, run just a portion of your simulation Scroll, zoom, filter: right- click on trace Click here to select this report

Apprentice Traffic Report: Zoomed Mouse hover pops up window showing source location. 28

29 Tracing Analysis Example

Mosaic View 30 Click here to select this report Can right-click here for more options Colors show average time (green=low, red=high) Very difficult to interpret by itself – use the Craypat message statistics with it. Shows Interprocessor communication topology and color-coded intensity

31 Mosaic View SP CG LU MG FT BT

NERSC6 Application Benchmark Characteristics BenchmarkScience AreaAlgorithm SpaceBase Case Concurrency Problem Description CAMClimate (BER) Navier Stokes CFD56, 240 Strong scaling D Grid, (~.5 deg resolution); 240 timesteps GAMESSQuantum Chem (BES) Dense linear algebra384, 1024 (Same as Ti-09) DFT gradient, MP2 gradient GTCFusion (FES) PIC, finite difference512, 2048 Weak scaling 100 particles per cell IMPACT-TAccelerator Physics (HEP) PIC, FFT component256,1024 Strong scaling 50 particles per cell MAESTROAstrophysics (HEP) Low Mach Hydro; block structured- grid multiphysics 512, 2048 Weak scaling 16 32^3 boxes per proc; 10 timesteps MILCLattice Gauge Physics (NP) Conjugate gradient, sparse matrix; FFT 256, 1024, 8192 Weak scaling 8x8x8x9 Local Grid, ~70,000 iters PARATECMaterial Science (BES) DFT; FFT, BLAS3256, 1024 Strong scaling 686 Atoms, 1372 bands, 20 iters 32

NERSC6 Benchmarks Communication Topology* MILC PARATEC IMPACT-T CAM MAESTROGTC 33 *From IPM

Sample of CI & %MPI *CI is the computational intensity, the ratio of # of Floating Point Operations to # of memory operations. 34

For More Information Using Cray Performance Analysis Tools, S–2376–51 – man craypat man pat_build man pat_report man pat_help  very useful tutorial program man app2 man hwpc man intro_perftools man papi man papi_counters 35

For More Information “Performance Tuning of Scientific Applications,” CRC Press

37 Exercise Same code, same problem size, run on the same 24 cores. What is different? Why might one perform better than the other? What performance characteristics are different?

Exercise Get the sweep3d code. Untar To build: type ‘make mpi’ Instrument for mpi, user Get an interactive batch session, 24 cores Run 3 sweep3d cases on 24 cores creating Apprentice traffic/mosaic views: –cp input1 input; aprun –n 24 … –cp input2 input; aprun –n 24 … –cp input3 input; aprun –n 24 … View the results from each run in Apprentice and try to explain what you see. 38

39 Performance Metrics CPU Time = N inst * CPI * Clock rate Application Compiler CPU Time = Instructions Program Cycles Instruction Seconds Cycle XX Instruction Set Architecture Technology