Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL.

Similar presentations


Presentation on theme: "GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL."— Presentation transcript:

1 GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL

2 Outline Motivation and Basic Usage Auto-instrumentation Auto-profiling MPI routines Summary across threads and tasks Induced overhead Choice of underlying timing routine PAPI interface Utility functions Future work NCAR SEA 2

3 Motivation Needed something to simplify, for an arbitrary number of regions to be timed: time = 0; for (i = 0; i < 10; i++) { gettimeofday (tp1,0); compute (); gettimeofday (tp2,0); delta = tp2.tv_sec - tp1.tv_sec + 1.e6*(tp2.tv_usec - tp1.tv_usec); time += delta; } printf (“compute took %g seconds\n”, time); NCAR SEA 3

4 Solution #include... ret = GPTLinitialize () ret = GPTLstart (“total”); for (i = 0; i < 10; i++) { ret = GPTLstart (“compute”); compute (); ret = GPTLstop (“compute”);... } ret = GPTLstop (“total”); ret = GPTLpr (0); NCAR SEA 4

5 Results Output file timing.0 contains: Called Wallclock total 1 3.983 compute 10 3.877 NCAR SEA 5

6 Most of the API #include... ret = GPTLsetoption (PAPI_FP_OPS, 1); // Enable a PAPI counter ret = GPTLsetutr (GPTLnanotime); // Better wallclock timer... ret = GPTLinitialize (); // Once per process ret = GPTLstart (“total”); // Start a timer ret = GPTLstart (“compute”); // Start another timer compute (); // Do work ret = GPTLstop (“compute”); // Stop a timer... ret = GPTLstop (“total”); // Stop a timer ret = GPTLpr (iam); // Print results ret = GPTLpr_summary (MPI_COMM_WORLD); // Print results summary // across threads and tasks NCAR SEA 6

7 Set options via Fortran namelist Avoid recoding/recompiling by using Fortran namelist option: call gptlprocess_namelist (‘my_namelist’, unitno, ret) Example contents of ‘my_namelist’: &gptlnl utr = ‘nanotime’ eventlist = ‘GPTL_CI’,’PAPI_FP_OPS‘ / NCAR SEA 7

8 Auto-instrumentation Works with Intel, GNU, Pathscale, PGI, AIX # icc –g –finstrument-functions *.c –lgptl # gfortran –g –finstrument-functions *.f90 –lgptl # pgcc –g –Minstrument:functions *.c –lgptl Inserts automatically at function start: __cyg_profile_func_enter (void *this_fn, void *call_site); And at function exit: __cyg_profile_func_exit (void *this_fn, void *call_site); NCAR SEA 8

9 Auto-instrumentation (cont’d) GPTL handles these entry points with: void __cyg_profile_func_enter (void *this_fn, void *call_site)‏ { (void) GPTLstart_instr (this_fn); } void __cyg_profile_func_exit (void *this_fn, void *call_site)‏ { (void) GPTLstop_instr (this_fn); } NCAR SEA 9

10 Auto-instrumentation (cont’d) After running the app, convert addresses to names with: hex2name.pl [-demangle] NCAR SEA 10

11 Dynamic call tree from auto- instrumentation Stats for thread 0: Called Wallclock max min FP_OPS total 1 64.021 64.021 64.021 3.50e+08 HPCC_Init 11 0.157 0.157 0.000 95799 * HPL_pdinfo 120 0.019 0.018 0.000 96996 * HPL_all_reduce 7 0.043 0.036 0.000 448 * HPL_broadcast 21 0.041 0.036 0.000 126 HPL_pdlamch 2 0.004 0.004 0.000 94248 * HPL_fprintf 240 0.001 0.000 0.000 1200 HPCC_InputFileInit 41 0.001 0.001 0.000 194 ReadInts 2 0.000 0.000 0.000 12 PTRANS 21 22.667 22.667 0.000 4.19e+07 MaxMem 5 0.000 0.000 0.000 796 * iceil_ 132 0.000 0.000 0.000 792 * ilcm_ 14 0.000 0.000 0.000 84 param_dump 18 0.000 0.000 0.000 84 Cblacs_get 5 0.000 0.000 0.000 30 Cblacs_gridmap 35 0.005 0.001 0.000 225 * Cblacs_pinfo 7 0.000 0.000 0.000 40 * Cblacs_gridinfo 60 0.000 0.000 0.000 260 NCAR SEA 11

12 MPI Auto-instrumentation To enable MPI auto-instrumentation, in macros.make set this: – ENABLE_PMPI=yes NCAR SEA 12

13 MPI Auto-instrumentation (cont’d) Stats for thread 0: Called Wallclock max min AVG_MPI_BYTES MPI_Init_thru_Finalize 1 8.70e-04 8.70e-04 8.70e-04 - MPI_Send 1 5.10e-05 5.10e-05 5.10e-05 4.096e+03 MPI_Recv 3 2.63e-04 2.32e-04 1.50e-05 4.096e+03 MPI_Ssend 1 2.40e-05 2.40e-05 2.40e-05 4.096e+03 MPI_Issend 1 1.00e-05 1.00e-05 1.00e-05 4.096e+03 MPI_Sendrecv 1 1.80e-05 1.80e-05 1.80e-05 8.192e+03 MPI_Irecv 2 1.00e-05 9.00e-06 1.00e-06 4.096e+03 MPI_Isend 2 6.00e-06 4.00e-06 2.00e-06 4.096e+03 MPI_Wait 2 1.80e-05 1.70e-05 1.00e-06 - MPI_Waitall 2 1.10e-05 1.10e-05 0.00e+00 - MPI_Barrier 1 2.20e-05 2.20e-05 2.20e-05 - MPI_Bcast 1 9.00e-06 9.00e-06 9.00e-06 4.096e+03 NCAR SEA 13

14 Induced Overhead GPTL estimates its own overhead: overhead of 1 GPTLstart or GPTLstop call=1.28e-07 seconds Components are as follows: Fortran layer: 1.0e-09 = 1.5% of total Get thread number: 1.7e-08 = 13.3% of total Generate hash index: 1.9e-08 = 14.8% of total Find hashtable entry: 1.5e-08 = 11.7% of total Underlying timing routine: 7.0e-08 = 53.2% of total Misc start/stop functions: 7.0e-09 = 5.5% of total NCAR SEA 14

15 Induced Overhead (cont’d) Stats for thread 0: Called Wallclock max min self_OH parent_OH total 1 0.910 0.910 0.910 0.000 0.000 1x1e7 1 0.022 0.022 0.022 0.000 0.000 10x1e6 10 0.015 1.55e-03 1.36e-03 0.000 0.000 100x1e5 100 0.014 1.80e-04 1.11e-04 0.000 0.000 1000x1e4 1000 0.015 2.01e-05 1.11e-05 0.000 0.000 1e4x1000 10000 0.015 1.04e-05 1.12e-06 0.000 0.001 1e5x100 100000 0.015 9.05e-06 1.22e-07 0.001 0.006 1e6x10 1.0e+06 0.026 8.74e-06 1.67e-08 0.011 0.062 1e7x1 1.0e+07 0.180 8.74e-06 1.11e-08 0.108 0.618 NCAR SEA 15

16 Underlying timing routine Default is gettimeofday() For Intel arch’s change to register read which has better granularity and much lower overhead: – C or Fortran: GPTLsetutr(GPTLnanotime); – Fortran: utr = ‘nanotime’ in namelist &gptlnl – May cause problems on machines with variable clock rate (e.g. “turbo mode”) NCAR SEA 16

17 PAPI details handled by GPTL This call: GPTLsetoption (PAPI_FP_OPS, 1); Implies: PAPI_library_init (PAPI_VER_CURRENT)); PAPI_thread_init ((unsigned long (*)(void(pthread_self)); PAPI_create_eventset (&EventSet[t])); PAPI_assign_eventset_component (EventSet[t], 0); PAPI_multiplex_init (); PAPI_set_multiplex (EventSet[t]); PAPI_add_event (EventSet[t], PAPI_FP_OPS)); PAPI_start (EventSet[t]); PAPI multiplexing handled automatically, enabled only if needed NCAR SEA 17

18 timing.summary file generated by GPTLpr_summary(comm) name ncalls nranks mean_time std_dev wallmax (rank ) wallmin (rank ) Diag 1002 2 4.371 3.453 6.812 ( 0) 1.929 ( 1) MainLoop 2 2 53.364 0.007 53.369 ( 0) 53.359 ( 1) ZeroTendencies 200 2 0.086 0.030 0.107 ( 0) 0.065 ( 1) SaveFlux 200 2 0.149 0.048 0.183 ( 0) 0.115 ( 1) RHStendencies 800 2 0.421 0.148 0.526 ( 0) 0.317 ( 1) Vdtotal 1600 2 25.702 1.361 26.665 ( 0) 24.740 ( 1) Vdm 800 2 23.851 1.118 24.642 ( 0) 23.060 ( 1) vdmfinish 800 2 2.794 1.010 3.508 ( 0) 2.080 ( 1) Vdn 800 2 1.848 0.246 2.022 ( 0) 1.674 ( 1) Flux 800 2 4.818 1.135 5.620 ( 1) 4.015 ( 0) Force 800 2 1.901 0.110 1.979 ( 1) 1.823 ( 0) RKdiff 800 2 1.247 0.415 1.540 ( 0) 0.953 ( 1) TimeDiff 800 2 0.736 0.182 0.865 ( 0) 0.608 ( 1) Sponge 800 2 0.364 0.092 0.429 ( 0) 0.299 ( 1) pre_trisol 200 2 0.112 0.027 0.131 ( 0) 0.093 ( 1) Trisol 200 2 0.667 0.078 0.722 ( 1) 0.612 ( 0) post_trisol 200 2 0.082 0.012 0.090 ( 0) 0.073 ( 1) Vdmints 200 2 3.603 0.135 3.699 ( 0) 3.508 ( 1) Pstadv 200 2 0.849 0.044 0.880 ( 1) 0.817 ( 0) NCAR SEA 18

19 Utility functions To print current memory usage at any point in your code: – ret = GPTLprint_memusage (“user string”) Produces e.g. – GPTLprint_memusage: user string size=19.5 MB rss=2.1 MB datastack=1.5 MB To auto-profile current memory usage (at both function entry and exit points) : – ret = GPTLsetoption (GPTLdopr_memusage, 1); Retrieve wallclock, usr, sys timestamps to user code: – ret = GPTLstamp (&wallclock, &usr, &sys); NCAR SEA 19

20 Future Work XML output Port to GPU Dynamic thread allocation for PTHREADS option Autoconf? NCAR SEA 20

21 Source and Documentation Source: https://github.com/jmrosinski/GPTLhttps://github.com/jmrosinski/GPTL – git clone git@github.com:jmrosinski/GPTL.gitgit@github.com:jmrosinski/GPTL.git Web-based documentation: – jmrosinski.github.io/GPTL Feel free to email me: james.rosinski@noaa.gov james.rosinski@noaa.gov NCAR SEA 21


Download ppt "GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL."

Similar presentations


Ads by Google