Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture 6 Performance Measurement and Improvement.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia.
 Copyright, HiCLAS1 George Delic, Ph.D. HiPERiSM Consulting, LLC And Arney Srackangast, AS1MET Services
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Full and Para Virtualization
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
L2-Cache Miss Profiling on the p690 for a Large-scale Database Application Trevor Morgan, Diana Villa, Patricia J. Teller, and Jaime Acosta The University.
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture 10 Tomasulo’s Algorithm
Performance monitoring on HP Alpha using DCPI
What we need to be able to count to tune programs
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Understanding Performance Counter Data - 1
Sampoorani, Sivakumar and Joshua
15-740/ Computer Architecture Lecture 14: Prefetching
Adapted from the slides of Prof
Determining the Accuracy of Event Counts - Methodology
What Are Performance Counters?
Presentation transcript:

Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore University of Tennessee-Knoxville

PCAT - The University of Texas at El Paso PCAT Team Dr. Patricia Teller Dr. Patricia Teller Alonso Bayona - Undergraduate Alonso Bayona - Undergraduate Alexander Sainz - Undergraduate Alexander Sainz - Undergraduate Trevor Morgan - Undergraduate Trevor Morgan - Undergraduate Leonardo Salayandia – M.S. Student Leonardo Salayandia – M.S. Student Michael Maxwell – Ph.D. Student Michael Maxwell – Ph.D. Student

PCAT - The University of Texas at El Paso Credits (Financial) DoD PET Program DoD PET Program NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program UTEP Dodson Endowment UTEP Dodson Endowment

PCAT - The University of Texas at El Paso Motivation Facilitate performance-tuning efforts that employ aggregate event counts Facilitate performance-tuning efforts that employ aggregate event counts When possible provide calibration data When possible provide calibration data Identify unexpected results, errors Identify unexpected results, errors Clarify misunderstandings of processor functionality Clarify misunderstandings of processor functionality

PCAT - The University of Texas at El Paso Road Map Scope of Research Scope of Research Methodology Methodology Results Results Future Work and Conclusions Future Work and Conclusions

PCAT - The University of Texas at El Paso Processors Under Study MIPS R10K and R12K: 2 counters, 32 events MIPS R10K and R12K: 2 counters, 32 events IBM Power3: 8 counters, 100+ events IBM Power3: 8 counters, 100+ events Linux/IA-64: 4 counters, 150 events Linux/IA-64: 4 counters, 150 events Linux/Pentium: 2 counters, 80+ events Linux/Pentium: 2 counters, 80+ events

PCAT - The University of Texas at El Paso Events Studied So Far Number of load and store instructions executed Number of load and store instructions executed Number of floating-point instructions executed Number of floating-point instructions executed Total number of instructions executed (issued/committed) Total number of instructions executed (issued/committed) Number of L1 I-cache and L1 D-cache misses Number of L1 I-cache and L1 D-cache misses Number of L2 cache misses Number of L2 cache misses Number of TLB misses Number of TLB misses Number of branch mispredictions Number of branch mispredictions

PCAT - The University of Texas at El Paso PAPI Overhead Extra instructions Extra instructions Read counter before and after workload Read counter before and after workload Processing of counter overflow interrupts Processing of counter overflow interrupts Cache pollution Cache pollution TLB pollution TLB pollution

PCAT - The University of Texas at El Paso Methodology Validation micro-benchmark Validation micro-benchmark Configuration micro-benchmark Configuration micro-benchmark Prediction via tool, mathematical model, and/or simulation Prediction via tool, mathematical model, and/or simulation Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) Comparison/analysis Comparison/analysis Report findings Report findings

PCAT - The University of Texas at El Paso Validation Micro-benchmark Simple, usually small program Simple, usually small program Stresses a portion of the microarchitecture or memory hierarchy Stresses a portion of the microarchitecture or memory hierarchy Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated

PCAT - The University of Texas at El Paso Validation Micro-benchmark Basic types: Basic types: array array loop loop in-line in-line floating-point floating-point Scalable w.r.t. granularity, i.e., number of generated events Scalable w.r.t. granularity, i.e., number of generated events

PCAT - The University of Texas at El Paso Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit

PCAT - The University of Texas at El Paso Configuration Micro-benchmark Program designed to provide insight into microarchitecture organization and/or the algorithms that control it Program designed to provide insight into microarchitecture organization and/or the algorithms that control it Examples Examples Page size used – for TLB miss counts Page size used – for TLB miss counts Cache prefetch algorithm Cache prefetch algorithm Branch prediction buffer size/organization Branch prediction buffer size/organization

PCAT - The University of Texas at El Paso Some Results

PCAT - The University of Texas at El Paso Reported Event Counts: Expected, Consistent and Quantifiable Results Overhead related to PAPI and other sources is consistent and quantifiable Overhead related to PAPI and other sources is consistent and quantifiable Reported Event Count – Predicted Event Count = Overhead Reported Event Count – Predicted Event Count = Overhead

Example 1: Number of Loads Itanium, Power3, and R12K

Example 2: Number of Stores Itanium, Power3, and R12K

PCAT - The University of Texas at El Paso Example 2: Number of Stores Power3 and Itanium PlatformMIPS R12K IBM Power3 Linux/IA- 64 Linux/ Pentium Loads462886N/A Stores31129N/A Multiplicative

PCAT - The University of Texas at El Paso Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent ProcessorAccurateConsistent Pentium II Pentium II MIPS R10K, R12K MIPS R10K, R12K Itanium Itanium Even when counters overflow. No overhead due to PAPI.

PCAT - The University of Texas at El Paso Reported Event Counts: Unexpected and Consistent Results --Errors? The hardware-reported counts are multiples of the predicted counts The hardware-reported counts are multiples of the predicted counts Reported Event Count / Multiplier = Predicted Event Count Reported Event Count / Multiplier = Predicted Event Count Cannot identify overhead for calibration Cannot identify overhead for calibration

Example - Total Number of Floating-Point Operations – Power3 AccurateConsistent

PCAT - The University of Texas at El Paso Reported Counts: Expected (Not Quantifiable) Results Predictions: only possible under special circumstances Predictions: only possible under special circumstances Reported event counts seem reasonable Reported event counts seem reasonable But are they useful without knowing more about the algorithm used by the vendor? But are they useful without knowing more about the algorithm used by the vendor?

PCAT - The University of Texas at El Paso Example 1: Total Data TLB Misses Replacement policy can (unpredictably) affect event counts Replacement policy can (unpredictably) affect event counts PAPI may (unpredictably) affect event counts PAPI may (unpredictably) affect event counts Other processes may (unpredictably) affect event counts Other processes may (unpredictably) affect event counts

Example 2: L1 D-Cache Misses # misses relatively constant as # of array references increase

Example 2 Enlarged

Example 3: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled % of cache filled % Error Power3 R12k Pentium

Example 4: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss

PCAT - The University of Texas at El Paso Reported Event Counts: Unexpected but Consistent Results Predicted counts and reported counts differ significantly but in a consistent manner Predicted counts and reported counts differ significantly but in a consistent manner Is this an error? Is this an error? Are we missing something? Are we missing something?

Example: Compulsory Data TLB Misses % difference per no. of references % difference per no. of references Reported counts are consistent Reported counts are consistent Vary between platforms Vary between platforms

PCAT - The University of Texas at El Paso Reported Event Counts: Unexpected Results Outliers Outliers Puzzles Puzzles

Example 1: Outliers L1 D-Cache Misses for Itanium

PCAT - The University of Texas at El Paso Example 1: Supporting Data Itanium L1 Data Cache Misses Mean Standard Deviation 90% of data 1M accesses 1, % of data 1M accesses 782,891566,370

Example 2: L1 I-Cache Misses and Instructions Retired - Itanium Both about 17% more than expected.

PCAT - The University of Texas at El Paso Future Work Extend events studied – include multiprocessor events Extend events studied – include multiprocessor events Extend processors studied – include Power4 Extend processors studied – include Power4 Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling

PCAT - The University of Texas at El PasoConclusions Performance counters provide informative data that can be used for performance tuning Performance counters provide informative data that can be used for performance tuning Expected frequency of event may determine usefulness of event counts Expected frequency of event may determine usefulness of event counts Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration The usefulness of some event counts is questionable without documentation of the related behavior The usefulness of some event counts is questionable without documentation of the related behavior

PCAT - The University of Texas at El Paso Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. CAUTION: The values in the performance counters may be greater than you think.

PCAT - The University of Texas at El Paso And should we attach the PCAT Seal of Approval on others? PCAT

PCAT - The University of Texas at El Paso Invitation to Vendors Help us understand what’s going on, when to attach the “warning”, and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we!

PCAT - The University of Texas at El Paso Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless?

PCAT - The University of Texas at El Paso

Example 1: Total Compulsory Data TLB Misses for R10K % difference per no. of references % difference per no. of references Predicted values consistently lower than reported Predicted values consistently lower than reported Small standard deviations Small standard deviations Greater predictability with increased no. of references Greater predictability with increased no. of references

Example 1: Compulsory Data TLB Misses for Itanium % difference per no. of references % difference per no. of references Reported counts consistently ~5 times greater than predicted Reported counts consistently ~5 times greater than predicted

Example 3: Compulsory Data TLB Misses for Power 3 % difference per no. of references % difference per no. of references Reported counts consistently ~5/~2 times greater than predicted for small/large counts Reported counts consistently ~5/~2 times greater than predicted for small/large counts

Example 3: L1 D-Cache Misses with Random Access – Itanium only when at array size = 10x cache size

Example 2: L1 D-Cache Misses On some of the processors studied, as the number of accesses increased, the miss rate approached 0 On some of the processors studied, as the number of accesses increased, the miss rate approached 0 Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word What’s going on? What’s going on?

Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted

Example 2: Assembler Code Analysis No optimization No optimization Same instructions Same instructions Different (expected) operands Different (expected) operands Three division instructions in both Three division instructions in both No reason for different FP counts No reason for different FP counts l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d