GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL.

Slides:

Advertisements

Similar presentations

Process A process is usually defined as an instance of a running program and consists of two components: A kernel object that the operating system uses.

Advertisements

How to use TinyOS Jason Hill Rob Szewczyk Alec Woo David Culler An event based execution environment for Networked Sensors.

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

MPI Message Passing Interface

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION

MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 14 Server and Network Monitoring.

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Estimating Flight CPU Utilization of Algorithms in a Desktop Environment Using Xprof Dr. Matt Wette 2013 Flight Software Workshop.

1 LiveViz – What is it? Charm++ library Visualization tool Inspect your program’s current state Client runs on any machine (java) You code the image generation.

Performance Measurement on kraken using fpmpi and craypat Kwai Wong NICS at UTK / ORNL March 24, 2010.

MCTS Guide to Microsoft Windows 7

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

Micro Focus Net Express / Server Express in GDT Update.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

Software Tools and Processes Training and Discussion October 16, :00-4:30 p.m. Jim Willenbring.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

CVM-H v11.1 Release in a Nutshell Distribution of a new version of CVM-H that includes: Vs30-based arbitrary precision GTL Extended Region using 1D background.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.

Profiling Tools In Ranger Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Ethernet Driver Changes for NET+OS V5.1. Design Changes Resides in bsp\devices\ethernet directory. Source code broken into more C files. Native driver.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.

Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Multi-Tasking The Multi-Tasking service is offered by VxWorks with its real- time kernel “WIND”.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

So, You Need to Look at a New Application … Scenarios:  New application development  Analyze/Optimize external application  Suspected bottlenecks First.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

MPI and OpenMP.

An Introduction to MPI (message passing interface)

Using the ARCS Grid and Compute Cloud Jim McGovern.

1 Using PMPI routines l PMPI allows selective replacement of MPI routines at link time (no need to recompile) l Some libraries already make use of PMPI.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Status & development of the software for CALICE-DAQ Tao Wu On behalf of UK Collaboration.

RSS Interfaces and Standards Chander Iyer. Really Simple Syndication (RSS) Web data format providing users with frequently updated content. Make a collection.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Department of Electronic & Electrical Engineering Program design. USE CASES. Flow charts. Decisions. Program state.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.

CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

Java Object-Relational Layer Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Process Management Process Concept Why only the global variables?

In-situ Visualization using VisIt

CS4402 – Parallel Computing

NVIDIA Profiler’s Guide

A configurable binary instrumenter

Realizing Concurrency using Posix Threads (pthreads)

Program Control Structures

Realizing Concurrency using the thread model

Introduction to parallelism and the Message Passing Interface

Realizing Concurrency using Posix Threads (pthreads)

Realizing Concurrency using the thread model

Presentation transcript:

GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL

Outline Motivation and Basic Usage Auto-instrumentation Auto-profiling MPI routines Summary across threads and tasks Induced overhead Choice of underlying timing routine PAPI interface Utility functions Future work NCAR SEA 2

Motivation Needed something to simplify, for an arbitrary number of regions to be timed: time = 0; for (i = 0; i < 10; i++) { gettimeofday (tp1,0); compute (); gettimeofday (tp2,0); delta = tp2.tv_sec - tp1.tv_sec + 1.e6*(tp2.tv_usec - tp1.tv_usec); time += delta; } printf (“compute took %g seconds\n”, time); NCAR SEA 3

Solution #include... ret = GPTLinitialize () ret = GPTLstart (“total”); for (i = 0; i < 10; i++) { ret = GPTLstart (“compute”); compute (); ret = GPTLstop (“compute”);... } ret = GPTLstop (“total”); ret = GPTLpr (0); NCAR SEA 4

Results Output file timing.0 contains: Called Wallclock total compute NCAR SEA 5

Most of the API #include... ret = GPTLsetoption (PAPI_FP_OPS, 1); // Enable a PAPI counter ret = GPTLsetutr (GPTLnanotime); // Better wallclock timer... ret = GPTLinitialize (); // Once per process ret = GPTLstart (“total”); // Start a timer ret = GPTLstart (“compute”); // Start another timer compute (); // Do work ret = GPTLstop (“compute”); // Stop a timer... ret = GPTLstop (“total”); // Stop a timer ret = GPTLpr (iam); // Print results ret = GPTLpr_summary (MPI_COMM_WORLD); // Print results summary // across threads and tasks NCAR SEA 6

Set options via Fortran namelist Avoid recoding/recompiling by using Fortran namelist option: call gptlprocess_namelist (‘my_namelist’, unitno, ret) Example contents of ‘my_namelist’: &gptlnl utr = ‘nanotime’ eventlist = ‘GPTL_CI’,’PAPI_FP_OPS‘ / NCAR SEA 7

Auto-instrumentation Works with Intel, GNU, Pathscale, PGI, AIX # icc –g –finstrument-functions *.c –lgptl # gfortran –g –finstrument-functions *.f90 –lgptl # pgcc –g –Minstrument:functions *.c –lgptl Inserts automatically at function start: __cyg_profile_func_enter (void *this_fn, void *call_site); And at function exit: __cyg_profile_func_exit (void *this_fn, void *call_site); NCAR SEA 8

Auto-instrumentation (cont’d) GPTL handles these entry points with: void __cyg_profile_func_enter (void *this_fn, void *call_site)‏ { (void) GPTLstart_instr (this_fn); } void __cyg_profile_func_exit (void *this_fn, void *call_site)‏ { (void) GPTLstop_instr (this_fn); } NCAR SEA 9

Auto-instrumentation (cont’d) After running the app, convert addresses to names with: hex2name.pl [-demangle] NCAR SEA 10

Dynamic call tree from auto- instrumentation Stats for thread 0: Called Wallclock max min FP_OPS total e+08 HPCC_Init * HPL_pdinfo * HPL_all_reduce * HPL_broadcast HPL_pdlamch * HPL_fprintf HPCC_InputFileInit ReadInts PTRANS e+07 MaxMem * iceil_ * ilcm_ param_dump Cblacs_get Cblacs_gridmap * Cblacs_pinfo * Cblacs_gridinfo NCAR SEA 11

MPI Auto-instrumentation To enable MPI auto-instrumentation, in macros.make set this: – ENABLE_PMPI=yes NCAR SEA 12

MPI Auto-instrumentation (cont’d) Stats for thread 0: Called Wallclock max min AVG_MPI_BYTES MPI_Init_thru_Finalize e e e-04 - MPI_Send e e e e+03 MPI_Recv e e e e+03 MPI_Ssend e e e e+03 MPI_Issend e e e e+03 MPI_Sendrecv e e e e+03 MPI_Irecv e e e e+03 MPI_Isend e e e e+03 MPI_Wait e e e-06 - MPI_Waitall e e e+00 - MPI_Barrier e e e-05 - MPI_Bcast e e e e+03 NCAR SEA 13

Induced Overhead GPTL estimates its own overhead: overhead of 1 GPTLstart or GPTLstop call=1.28e-07 seconds Components are as follows: Fortran layer: 1.0e-09 = 1.5% of total Get thread number: 1.7e-08 = 13.3% of total Generate hash index: 1.9e-08 = 14.8% of total Find hashtable entry: 1.5e-08 = 11.7% of total Underlying timing routine: 7.0e-08 = 53.2% of total Misc start/stop functions: 7.0e-09 = 5.5% of total NCAR SEA 14

Induced Overhead (cont’d) Stats for thread 0: Called Wallclock max min self_OH parent_OH total x1e x1e e e x1e e e x1e e e e4x e e e5x e e e6x10 1.0e e e e7x1 1.0e e e NCAR SEA 15

Underlying timing routine Default is gettimeofday() For Intel arch’s change to register read which has better granularity and much lower overhead: – C or Fortran: GPTLsetutr(GPTLnanotime); – Fortran: utr = ‘nanotime’ in namelist &gptlnl – May cause problems on machines with variable clock rate (e.g. “turbo mode”) NCAR SEA 16

PAPI details handled by GPTL This call: GPTLsetoption (PAPI_FP_OPS, 1); Implies: PAPI_library_init (PAPI_VER_CURRENT)); PAPI_thread_init ((unsigned long (*)(void(pthread_self)); PAPI_create_eventset (&EventSet[t])); PAPI_assign_eventset_component (EventSet[t], 0); PAPI_multiplex_init (); PAPI_set_multiplex (EventSet[t]); PAPI_add_event (EventSet[t], PAPI_FP_OPS)); PAPI_start (EventSet[t]); PAPI multiplexing handled automatically, enabled only if needed NCAR SEA 17

timing.summary file generated by GPTLpr_summary(comm) name ncalls nranks mean_time std_dev wallmax (rank ) wallmin (rank ) Diag ( 0) ( 1) MainLoop ( 0) ( 1) ZeroTendencies ( 0) ( 1) SaveFlux ( 0) ( 1) RHStendencies ( 0) ( 1) Vdtotal ( 0) ( 1) Vdm ( 0) ( 1) vdmfinish ( 0) ( 1) Vdn ( 0) ( 1) Flux ( 1) ( 0) Force ( 1) ( 0) RKdiff ( 0) ( 1) TimeDiff ( 0) ( 1) Sponge ( 0) ( 1) pre_trisol ( 0) ( 1) Trisol ( 1) ( 0) post_trisol ( 0) ( 1) Vdmints ( 0) ( 1) Pstadv ( 1) ( 0) NCAR SEA 18

Utility functions To print current memory usage at any point in your code: – ret = GPTLprint_memusage (“user string”) Produces e.g. – GPTLprint_memusage: user string size=19.5 MB rss=2.1 MB datastack=1.5 MB To auto-profile current memory usage (at both function entry and exit points) : – ret = GPTLsetoption (GPTLdopr_memusage, 1); Retrieve wallclock, usr, sys timestamps to user code: – ret = GPTLstamp (&wallclock, &usr, &sys); NCAR SEA 19

Future Work XML output Port to GPU Dynamic thread allocation for PTHREADS option Autoconf? NCAR SEA 20

Source and Documentation Source: – git clone Web-based documentation: – jmrosinski.github.io/GPTL Feel free to me: NCAR SEA 21