Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.

Similar presentations


Presentation on theme: "Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore."— Presentation transcript:

1 Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore University of Tennessee-Knoxville

2 PCAT - The University of Texas at El Paso PCAT Team Dr. Patricia Teller Dr. Patricia Teller Alonso Bayona - Undergraduate Alonso Bayona - Undergraduate Alexander Sainz - Undergraduate Alexander Sainz - Undergraduate Trevor Morgan - Undergraduate Trevor Morgan - Undergraduate Leonardo Salayandia – M.S. Student Leonardo Salayandia – M.S. Student Michael Maxwell – Ph.D. Student Michael Maxwell – Ph.D. Student

3 PCAT - The University of Texas at El Paso Credits (Financial) DoD PET Program DoD PET Program NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program UTEP Dodson Endowment UTEP Dodson Endowment

4 PCAT - The University of Texas at El Paso Motivation Facilitate performance-tuning efforts that employ aggregate event counts Facilitate performance-tuning efforts that employ aggregate event counts When possible provide calibration data When possible provide calibration data Identify unexpected results, errors Identify unexpected results, errors Clarify misunderstandings of processor functionality Clarify misunderstandings of processor functionality

5 PCAT - The University of Texas at El Paso Road Map Scope of Research Scope of Research Methodology Methodology Results Results Future Work and Conclusions Future Work and Conclusions

6 PCAT - The University of Texas at El Paso Processors Under Study MIPS R10K and R12K: 2 counters, 32 events MIPS R10K and R12K: 2 counters, 32 events IBM Power3: 8 counters, 100+ events IBM Power3: 8 counters, 100+ events Linux/IA-64: 4 counters, 150 events Linux/IA-64: 4 counters, 150 events Linux/Pentium: 2 counters, 80+ events Linux/Pentium: 2 counters, 80+ events

7 PCAT - The University of Texas at El Paso Events Studied So Far Number of load and store instructions executed Number of load and store instructions executed Number of floating-point instructions executed Number of floating-point instructions executed Total number of instructions executed (issued/committed) Total number of instructions executed (issued/committed) Number of L1 I-cache and L1 D-cache misses Number of L1 I-cache and L1 D-cache misses Number of L2 cache misses Number of L2 cache misses Number of TLB misses Number of TLB misses Number of branch mispredictions Number of branch mispredictions

8 PCAT - The University of Texas at El Paso PAPI Overhead Extra instructions Extra instructions Read counter before and after workload Read counter before and after workload Processing of counter overflow interrupts Processing of counter overflow interrupts Cache pollution Cache pollution TLB pollution TLB pollution

9 PCAT - The University of Texas at El Paso Methodology Validation micro-benchmark Validation micro-benchmark Configuration micro-benchmark Configuration micro-benchmark Prediction via tool, mathematical model, and/or simulation Prediction via tool, mathematical model, and/or simulation Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) Comparison/analysis Comparison/analysis Report findings Report findings

10 PCAT - The University of Texas at El Paso Validation Micro-benchmark Simple, usually small program Simple, usually small program Stresses a portion of the microarchitecture or memory hierarchy Stresses a portion of the microarchitecture or memory hierarchy Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated

11 PCAT - The University of Texas at El Paso Validation Micro-benchmark Basic types: Basic types: array array loop loop in-line in-line floating-point floating-point Scalable w.r.t. granularity, i.e., number of generated events Scalable w.r.t. granularity, i.e., number of generated events

12 PCAT - The University of Texas at El Paso Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit

13 PCAT - The University of Texas at El Paso Configuration Micro-benchmark Program designed to provide insight into microarchitecture organization and/or the algorithms that control it Program designed to provide insight into microarchitecture organization and/or the algorithms that control it Examples Examples Page size used – for TLB miss counts Page size used – for TLB miss counts Cache prefetch algorithm Cache prefetch algorithm Branch prediction buffer size/organization Branch prediction buffer size/organization

14 PCAT - The University of Texas at El Paso Some Results

15 PCAT - The University of Texas at El Paso Reported Event Counts: Expected, Consistent and Quantifiable Results Overhead related to PAPI and other sources is consistent and quantifiable Overhead related to PAPI and other sources is consistent and quantifiable Reported Event Count – Predicted Event Count = Overhead Reported Event Count – Predicted Event Count = Overhead

16 Example 1: Number of Loads Itanium, Power3, and R12K

17 Example 2: Number of Stores Itanium, Power3, and R12K

18 PCAT - The University of Texas at El Paso Example 2: Number of Stores Power3 and Itanium PlatformMIPS R12K IBM Power3 Linux/IA- 64 Linux/ Pentium Loads462886N/A Stores31129N/A Multiplicative

19 PCAT - The University of Texas at El Paso Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent ProcessorAccurateConsistent Pentium II Pentium II MIPS R10K, R12K MIPS R10K, R12K Itanium Itanium Even when counters overflow. No overhead due to PAPI.

20 PCAT - The University of Texas at El Paso Reported Event Counts: Unexpected and Consistent Results --Errors? The hardware-reported counts are multiples of the predicted counts The hardware-reported counts are multiples of the predicted counts Reported Event Count / Multiplier = Predicted Event Count Reported Event Count / Multiplier = Predicted Event Count Cannot identify overhead for calibration Cannot identify overhead for calibration

21 Example - Total Number of Floating-Point Operations – Power3 AccurateConsistent

22 PCAT - The University of Texas at El Paso Reported Counts: Expected (Not Quantifiable) Results Predictions: only possible under special circumstances Predictions: only possible under special circumstances Reported event counts seem reasonable Reported event counts seem reasonable But are they useful without knowing more about the algorithm used by the vendor? But are they useful without knowing more about the algorithm used by the vendor?

23 PCAT - The University of Texas at El Paso Example 1: Total Data TLB Misses Replacement policy can (unpredictably) affect event counts Replacement policy can (unpredictably) affect event counts PAPI may (unpredictably) affect event counts PAPI may (unpredictably) affect event counts Other processes may (unpredictably) affect event counts Other processes may (unpredictably) affect event counts

24 Example 2: L1 D-Cache Misses # misses relatively constant as # of array references increase

25 Example 2 Enlarged

26 Example 3: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled -150.0 -100.0 -50.0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0 0.050.0100.0150.0200.0250.0300.0 % of cache filled % Error Power3 R12k Pentium

27 Example 4: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss

28 PCAT - The University of Texas at El Paso Reported Event Counts: Unexpected but Consistent Results Predicted counts and reported counts differ significantly but in a consistent manner Predicted counts and reported counts differ significantly but in a consistent manner Is this an error? Is this an error? Are we missing something? Are we missing something?

29 Example: Compulsory Data TLB Misses % difference per no. of references % difference per no. of references Reported counts are consistent Reported counts are consistent Vary between platforms Vary between platforms

30 PCAT - The University of Texas at El Paso Reported Event Counts: Unexpected Results Outliers Outliers Puzzles Puzzles

31 Example 1: Outliers L1 D-Cache Misses for Itanium

32 PCAT - The University of Texas at El Paso Example 1: Supporting Data Itanium L1 Data Cache Misses Mean Standard Deviation 90% of data 1M accesses 1,290170 10% of data 1M accesses 782,891566,370

33 Example 2: L1 I-Cache Misses and Instructions Retired - Itanium Both about 17% more than expected.

34 PCAT - The University of Texas at El Paso Future Work Extend events studied – include multiprocessor events Extend events studied – include multiprocessor events Extend processors studied – include Power4 Extend processors studied – include Power4 Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling

35 PCAT - The University of Texas at El PasoConclusions Performance counters provide informative data that can be used for performance tuning Performance counters provide informative data that can be used for performance tuning Expected frequency of event may determine usefulness of event counts Expected frequency of event may determine usefulness of event counts Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration The usefulness of some event counts is questionable without documentation of the related behavior The usefulness of some event counts is questionable without documentation of the related behavior

36 PCAT - The University of Texas at El Paso Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. CAUTION: The values in the performance counters may be greater than you think.

37 PCAT - The University of Texas at El Paso And should we attach the PCAT Seal of Approval on others? PCAT

38 PCAT - The University of Texas at El Paso Invitation to Vendors Help us understand what’s going on, when to attach the “warning”, and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we!

39 PCAT - The University of Texas at El Paso Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless?

40 PCAT - The University of Texas at El Paso

41 Example 1: Total Compulsory Data TLB Misses for R10K % difference per no. of references % difference per no. of references Predicted values consistently lower than reported Predicted values consistently lower than reported Small standard deviations Small standard deviations Greater predictability with increased no. of references Greater predictability with increased no. of references

42 Example 1: Compulsory Data TLB Misses for Itanium % difference per no. of references % difference per no. of references Reported counts consistently ~5 times greater than predicted Reported counts consistently ~5 times greater than predicted

43 Example 3: Compulsory Data TLB Misses for Power 3 % difference per no. of references % difference per no. of references Reported counts consistently ~5/~2 times greater than predicted for small/large counts Reported counts consistently ~5/~2 times greater than predicted for small/large counts

44 Example 3: L1 D-Cache Misses with Random Access – Itanium only when at array size = 10x cache size

45 Example 2: L1 D-Cache Misses On some of the processors studied, as the number of accesses increased, the miss rate approached 0 On some of the processors studied, as the number of accesses increased, the miss rate approached 0 Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word What’s going on? What’s going on?

46 Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted

47 Example 2: Assembler Code Analysis No optimization No optimization Same instructions Same instructions Different (expected) operands Different (expected) operands Three division instructions in both Three division instructions in both No reason for different FP counts No reason for different FP counts l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d


Download ppt "Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore."

Similar presentations


Ads by Google