Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia.

Similar presentations


Presentation on theme: "Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia."— Presentation transcript:

1 Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia J. Teller What are performance counters? Performance counters are a special set of registers, most commonly found on the processor’s chip, which count a number of different events that occur inside the computer when an application is being executed. Why are performance counters important? The events counted provide data that can be used to assess the performance of an application executed on a particular processor. Performance analysis helps programmers find bottlenecks in their programs, allowing them to build more efficient applications. What tools are used in our research? PAPI: An API (Application Program Interface) is used to access performance counters. It allows the user to dynamically select and display performance counter data, i.e., count specific events. Micro-benchmarks: Small programs designed to use specific computer resources which, in turn, generate events recorded by performance counters. Due to their simplicity, we can, with sufficient knowledge of the target processor architecture, predict event counts, which are used to understand the behavior of the performance counters and the underlying architecture. RECENT WORK DoD applications programmers working on the IBM Power 3 computer architecture had some questions pertaining to performance counter data for that specific architecture. The questions were forwarded to our team by collaborators at UTK (University of Tennessee- Knoxville), and we took on the task of answering any questions not answered by UTK. LIST OF QUESTIONS 1. Exactly how are floating-point (FP) operations counted? (What is counted?) We have observed that FP loads & stores are not counted, and that FMAs (FP Multiply-Adds) are counted as one FP op. Also FP round-to-single is counted as one FP op. Are divides and SQRTs (square roots) included? Are SQRTs counted as one FP op? 2. Kevin London said that PAPI_L1_DCH will return L1 data cache (DC) hits; however, it (that event) is not available on the Power3. We can derive L1 DC hits using "total references to L1 DC" minus number of L1 misses (PAPI_L1_DCM). How do we get total L1 references? Obviously, we should include number of loads (PAPI_LD_CMPL) plus number of stores (PAPI_ST_CMPL), but do we count prefetches (data fetched speculatively)? 3. Are prefetches already part of the load count? (Probably shouldn't be since the result goes to cache, but not to a register.) Are prefetches part of the L1 miss count? Apparently there is a counter for prefetch hits (PM_DC_PREF_HIT). Should the hit rate be calculated PAPI_L1_DCH / (PAPI_LD_INS + PAPI_ST_INS) or (PAPI_L1_DCH + PM_DC_PREF_HIT) / (PAPI_LD_INS + PAPI_ST_INS + number of prefetches). If the latter, how do you count number of prefetches? FPU 0 (Floating-point Unit 0) FPU 1 (Floating-point Unit 1) If FLOP = SQRT routine Counter = Counter + (21or 22) Yes No Counter = Counter +1 SQRT Hardware FMA Hardware FPU (Floating-point Unit ) POWER 3 processor die What is being counted in floating-point operations? Simple math operations, +,-,*,/, all count as 1 FLOP (floating-point operation). A multiply followed by an add, called an FMA instruction, is handled by special hardware and only counts as 1 FLOP. A square root operation (sqrt), when handled by a software routine, counts as 21 or 22 FLOPs. The Power3 has special hardware to handle sqrt operations, but the compiler does not always use the hardware. Other operations counted: Rounding operations and register moves. Operations not counted: Floating-point data loads and stores. Assembler micro-benchmark example. # Set up input parameters. lfd fp1,64(SP) # loading a value into a register lfd fp2,72(SP) # loading a value into a register fa fp3,fp1,fp2 # performing a floating-point operation on a number of values stfd fp3,72(SP) # storing the result of the floating-point operation in a register.. # More operations SPONSORS Department of Defense (DOD), MIE (Model Institutions for Excellence) REU (Research Experiences for Undergraduates) Program, and The Dodson Endowment Which floating-point operations contribute to total the total floating-point operations completed? The micro-benchmarks used had to be written in assembler due to the difficulty of triggering specific events on a high- level language such as C. A different micro-benchmark was written for each of the operations tested. The micro-benchmarks revealed that the equation previously thought to give the number of total floating-point operations does not hold. PM_FPU0_FMOV_FEST must be added to the equation. (The fres instruction, which gives an estimate of the reciprocal of a floating-point operand, will be counted by the PM_FPU0_FMOV_FEST event as well as by the PM_FPU_FEST event. Thus, when using any kind of estimate instruction the proposed equation will count fres instructions twice.) Division and square root floating-point operations are counted as FMA operations. (STILL UNDER INVESTIGATION) 4. Same question as 3, except for L2 cache. This question also is complicated by the fact that the L2 cache is unified (data and instruction) (I think). If this is true, how do instruction prefetches fit into the calculation? 5. What (on earth) is the difference between the events PM_LD_MISS_L1, PM_LD_MISS_EXCEED_L2, and PM_LS_MISS_EXCEED_NO_L2? Also, the latter two events take a "threshold" as an argument; how do you specify this to PAPI? 6. On the POWER3 SP, does the sum of PM_FPU_FADD_FMUL + PM_FPU_FCMP + PM_FPU_FDIV + PM_FPU_FEST + PM_FPU_FMA + PM_FPU_FPSCR + PM_FPU_FRSP + PM_FPU_FSQRT equal PM_FPU0_CMPL + PM_FPU1_CMPL? 7. On the POWER3 SP, does PM_IC_HIT + PM_IC_MISS equal PM_INST_CMPL or PM_INST_DISP? 8. More generally, on a speculative processor, there will be more instructions dispatched than completed, and at some point some instructions will be cancelled (is this correct?). Are instructions cancelled before or after they touch cache? This is important in calculating the cache miss rate, since (hopefully) the miss rate is either (PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_CMPL + PM_ST_CMPL) or (PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_DISP + PM_ST_DISP). Which is it? How to calculate cache-miss rate with performance counters on the POWER 3? Available events: Load miss @ L1 Load dispatched Load completed Store miss @ L1 Store dispatched Store completed (Speculative Execution: Not all dispatched instructions may get completed) DISPATCHED PIPELINECOMPLETED Cache-Miss Rate Overestimated Cache-Miss Rate Underestimated (LOAD MISS @ L1 + STORE MISS @ L1) (LOAD CMPL + STORE CMPL) or (LOAD MISS @ L1 + STORE MISS @ L1) (LOAD DISP + STORE DISP) Should be better approximation (Recommended by PCAT) Load/Store Machine L1 Instruction Cache Hits Many times an in-depth understanding of the architecture being studied is required to correctly analyze performance data. Assuming that the commonly used definition for an event holds for any platform may lead to misinterpretation of the performance data. For instance, in the Power 3, the instruction cache hit event is trigged when a block of instructions (up to 8) is fetched to the instruction buffer from the cache instead of being triggered on every single fetch of an instruction. By experimentation and trying to answer question 7, we found that in the Power 3, the following relation holds for a sequential program: ((PM_IC_HIT - IC_PREF_USED) + PM_IC_MISS) * 8  PM_INST_CMPL Miss rate on L1 and L2 data caches L1 and L2 data cache miss rates are not easy to estimate because of the prefetching mechanism present in almost all modern processors. There is a need to research indirect methods to measure L1 and L2 data cache miss rates Prefetching reduces the miss rate for sequential data access Complement of the miss rate can be computed as follows: L1 hit rate = (100) * (1- (Load misses in L1 + Store misses) / Total Loads and Stores) L2 hit rate = (100) * (1 - (Load misses in L2 + Store misses in L2) / Total L1 Misses) This metrics were obtained at: http://www.sdsc.edu/SciApps/IBM_tools/ibm_toolkit/HPM_2_4_3.html


Download ppt "Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia."

Similar presentations


Ads by Google