Presentation is loading. Please wait.

Presentation is loading. Please wait.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Similar presentations

Presentation on theme: "SE-292 High Performance Computing Profiling and Performance R. Govindarajan"— Presentation transcript:

1 SE-292 High Performance Computing Profiling and Performance R. Govindarajan govind@serc

2 2 Performance Measurement and Tuning Tools to help you measure the performmance of program Determining program execution time % time a.out real 0m0.019s user 0m0.014s system 0m0.002s Gives elapse time, user, and system Tools to identify the important parts of your program for perf. Improvement Concentrate optimization efforts on those parts

3 3 Amdahls Law Which part of the program to optimize? Amdahls Law: Speedup is limited by the part of program which does not benefit by the optimization IOW, Sp 1/s ! Implies concentrate on part of the program where maximum time is spent!

4 4 Timing Timing: measuring the time spent in specific parts of your program Examples of `parts: Functions, loops, … Recall: Different kinds of time that can be measured (real/wallclock/elapsed vs virtual/CPU) 1.Decide which time you are interested in measuring at what granularity 2.Find out what mechanisms are available and their granularity of measurement

5 5 Timing Mechanisms gettimeofday Real time in seconds and microseconds since 00:00 1/1/1970 Q: Overflow of 32b second value? getrusage times system call High resolution timers Example: gethrtime

6 6 Profiling Profiler: tool that helps you identify the `important parts of your program to concentrate your optimization efforts Profile: breakup (of execution time) across different parts of the program Can be done by adding statements to your program (instrumentation) -- so that during execution, data is gathered, outputted and possibly processed later Automation: where a profiling tool adds those instructions into your program for you

7 7 Profiling Mechanisms Levels of Granularity typically supported Function level Statement level Basic block level: A basic block is a sequence of contiguous instructions in a program with a single entry point (the first instruction in the basic block) and a single exit point (the last instruction in the basic block) Two kinds of profile data execution time execution counts We will look at examples of profiling mechanisms at the function and basic block level

8 8 Prof: UNIX Function Level Profiling Usage % cc –p program.c /generates instrumented a.out % a.out / execution; instrumentation / generates data and mon.out % prof / processing of profile data Output gives a function by function breakup of execution time Useful in identifying which functions to concentrate optimization efforts on

9 9 Output: %TimeSecondsCumSecs#Calls Name 56.8 0.500.501000 _baz 27.4 0.240.741000 _bar 15.9 0.140.88500 _foo … 0.0 0.000.88 1 _main 0.0 0.000.88 3 _strcpy

10 10 Prof: How it Works Instrumentation does three things 1. At entry of each function: increment an execution count for that function 2. At program entry: make a call to system call profil to get execution times 3. At program exit: write profile data to output file that can later be processed by prof profil(): execution time profiler Generates an execution time histogram, execution time in each function

11 11 Profil: What it does One of the parameters in call to profil is a buffer Used as an array of counters initialized to 0 Array elements are associated with contiguous regions of program text During execution, PC value is sampled (once every clock tick, default: 10 msec); triggered on timer interrupt Corresponding buffer element is incremented Later associated with a function; time weight of 10 msec used to estimate CPU times

12 12 Using prof From how it works, we understand that Granularity is at best 10 msec Generated profile could differ for multiple runs of a program with same input! Could be completely wrong; observe that there could be a particular function that just happens to be running each time the timer interrupt occurs Some usage guidelines Run under light load conditions Run a few times and see if results vary a lot Note that function execution counts are exact, while execution times are estimates

13 13 Pixie: Basic Block Level Profiling Available on MIPS, Alpha machines Usage % cc program.c / a.out % pixie a.out / instrumented a.out.pixie % a.out.pixie / profile output file % prof / report on profile data Output is based on basic block level execution counts Useful for all kinds of things

14 14 What is a Basic Block? A section of program that does not cross any conditional branches, loop boundaries or other transfers of control A sequence of instructions with a single entry point, single exit point, and no internal branches A sequence of program statements that contains no labels and no branches A basic block can only be executed completely and in sequence

15 15 Pixie: How it works 1.Identification of basic blocks Q: How can basic blocks be identified? Pixie uses heuristics where necessary 2.Instrumentation Increment a counter for the basic block On program entry and exit: initialization of data structures; writing profile output file

16 16 How intrusive are these mechanisms? Issue: Does the instrumented program behave enough like the original program? If not, the profile generated might mislead the direction of program optimization efforts Pixie: instrumented executable can be several times the size of the original Does not matter; basic block execution counts are accurate Prof: gathers more than just execution counts Instrumentation is not very large

17 17 Performance Tuning Tools Performance Counters provided in hardware Event-based or sampled counters Measure various events (e.g., CPU cycles, L1 Cache misses, TLB misses, loads, instrn. Count, … ) Counters may be accessible to user-level or kernel level. Accessible through command-line (user level) or through Performance tools! %perfex executable [arguments] Accesses MIPS R10000 Counters

18 18 Other tools : Vtune Use Sampling to gain an accurate representation of your software's actual performance, with negligible overhead. Gather CPU snapshots to identify problems such as cache misses. No special builds or instrumentation are required. Produce a picture of program flow to quickly identify critical functions and call sequences using Call Graph Profiling. Gain a high-level, algorithmic view of program execution.

19 19 Other Tools: Pin Uses dynamic instrumentation Does not need source code, recompilation, post-linking Programmable Instrumentation: Provides rich APIs to write in C/C++ your own instrumentation tools (called Pintools) Instrumentation done on executable (binary) and can be attached statically or dynamically Launch and instrument an application $pin –t pintool –- application Instrumentation Engine (provided) Instrumentation tool (write your own or provided)

20 20 Assignment #2 (contd.) 5. Use any of the performance tuning tools to measure various performance metrics (cache misses, exec. Time, etc.) and reason the performance of different versions of the matrix multiplication program. (Due: Oct. 14, 2010)

Download ppt "SE-292 High Performance Computing Profiling and Performance R. Govindarajan"

Similar presentations

Ads by Google