Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cameron & LuoISCA ‘99 May 1, 1999 1Performance Evaluation Using Hardware Performance Counters Kirk W. Cameron Yong Luo {kirk,

Similar presentations


Presentation on theme: "Cameron & LuoISCA ‘99 May 1, 1999 1Performance Evaluation Using Hardware Performance Counters Kirk W. Cameron Yong Luo {kirk,"— Presentation transcript:

1 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Kirk W. Cameron Yong Luo {kirk,

2 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Acknowledgements Harvey Wasserman Olaf Lubeck Adolfy Hoisie Federico Bassetti Fabrizio Petrini Pat Fay (Intel) Frank Levine (IBM) James Schwarzmeir (SGI) Brent Henderson (HP) Bill Freeman (Digital/Compaq)

3 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Tutorial Objectives Provide a systematic approach to the use of hardware counters and associated tools Provide examples of hardware counters over a determined problem set for various processors Give details of systematic approach on a subset of processors Provide simple derived formulas and advanced analytical formulas for performance measurement Provide further references for hardware counters

4 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Tutorial Overview Introduction –Background, motivations, trade-offs, problem set, test beds Tool Discussions & Examples –Overview, specific processors, platforms, API & GUI tools Interpretation, Derivations and Empirical Analysis –Non-measured inference, modeling techniques Advanced Topics –instruction-level workload characterization analysis

5 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Introduction What is a performance counter/monitor? Example: The Cray YMP Performance Goals and Objectives Hardware Counter Motivations (trade-offs) Performance Measurement Methodology Problem Set Test Beds Problem Set vs. Test Beds

6 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters What is a performance monitor? Processor Monitors –Typically a group of registers –Special purpose registers keep track of programmable events –Non-intrusive counts result in “accurate” measurement of processor events –Software API’s handle event programming and overflow –Further GUI interfaces are often built on top of API’s to provide higher level analysis

7 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters What is a performance monitor? Typical events counted are… –Instructions, floating point instr, cache misses, etc. –No standard set exists Problems –Provides only hard counts, analysis must be performed by user or tools –Made specifically for each processor (even processor families have different monitor interfaces) –Vendors don’t like to support b/c is not profit contributor

8 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters What is a performance monitor? System Level Monitors –Can be h/w or s/w –Intended to measure system activity –Examples: bus monitor: measures memory traffic, can analyze cache coherency issues in multiprocessor system network monitor: measures network traffic, can analyze web traffic internally and externally –Not covered in this tutorial, but methods apply equally

9 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Example: The Cray YMP Performance Tool: HPM Fairly accurate and reliable Automatic Summary (direct meas + analytic meas) –Utilization efficiency, vector length, flops –Hints to performance improvement Provides good basis for comparison (perhaps goal?) Made with user in mind, fairly easy to use and certainly provides more than today’s versions

10 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Performance Goals and Objectives Quantify performance for target machine –What measurements are necessary and possible? Compare different machines (benchmarking) –Are machines directly comparable? Analyze performance of machines for bottlenecks –How can this be done with only hard counts?

11 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Hardware Counter Motivations Why not use software? –Strength: simple, GUI interface –Weakness: large overhead, intrusive, higher level, abstraction and simplicity –However, this may provide what you need… How about using a simulator? –Strength: control, low-level, accurate –Weakness: not a true representation for real code, limit on size of code, difficult to implement –Again, might be all you need...

12 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Hardware Counter Motivations When should we directly use hardware counters? –Software and simulators not available or not enough –Strength: non-intrusive, instruction level analysis, moderate control, very accurate –Weakness: extra silicon, code not typically reusable, OS kernel support, documentation and support

13 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Performance Measurement Methodology Simple API’s or GUI Tools (use accordingly) 9 Step Process: –Identify goal of measurements accuracy desired? events required? analysis desired? –Identify subjects (codes) entire code vs. kernel (multiplexing vs. inserted calls) is I/O important? how large is code? (overhead concerns) measure subroutines separately?

14 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Performance Measurement Methodology –Identify target machine(s) and counters how accurate/ reliable are counters? any documentation support available? can s/w or simulator do the job? Are they available? is it necessary to know h/w specifics? –Determine counters necessary find minimum set of counters to achieve goal understand counters for machine (find equivalent meas) determine # runs per code and estimate time

15 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Performance Measurement Methodology –Hard code counters (if necessary) measure critical results only insert counters, write routine if necessary find appropriate libraries –Special considerations (examples) distinguish between operations and instructions try not to measure I/O ensure code in steady state (don’t measure startup) –Gather results use Perl scripts, shell scripts, etc. to gather port to spreadsheet

16 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Performance Measurement Methodology –Check for common errors Using correct counters? Counters working properly? (benchmark with lmbench) hard coded correctly? Printed correctly? No special considerations missed? (MADD, etc.) Scripts correct and ported correctly? –Analyze using special techniques

17 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set Should we collect all events all the time? –NO. Why? not necessary and wasteful So, must determine a fairly common problem set –Common enough to be present on most processors –Straightforward enough to be comparable across machines What counts should be used? –As many ideas as there are researchers –Safe to say gather only what YOU need –Attempts such as PerfAPI are being made –We present our version /opinion on what should be measured

18 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set Cycles Graduated Instructions Loads Stores L1 misses L2 misses Graduated fl pt instr* Branches Branch misses TLB misses Icache misses

19 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Test Beds MIPS R10K Intel PPRO, Pentium II Cyrix 6x86MX AMD K-6 IBM 604e Compaq Alpha EV4-6 Sun UltraSPARC IIi HP PA-RISC (PA-8000 V Class)

20 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set vs. Test Bed

21 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set vs. Test Bed

22 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set vs. Test Bed

23 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set vs. Test Bed EV4 events [+] EV5 events [#] EV6 events [*]

24 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set vs. Test Bed

25 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set vs. Test Bed

26 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Problem Set vs. Test Bed

27 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Tool Discussions and Examples Overview Tool Techniques Available Tools –MIPS R10K –IBM 604e –Intel PPRO, P II, PIII –Compaq Alpha EV4-6 –HP PA-RISC (PA 8000 V Class)

28 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Overview API Tools (low level) –Interface to counters to simplify use –Typically controls overflow of counters in software –Typically derived from performance monitor data structures that control specifics of hardware –Some require modifying data structures others use simple calls to functions that use underlying data structures –Some provide command line interface –Usually created for in-house use: no help/support –Advantage: you can always create your own API –Disadvantage: can be complicated

29 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Overview GUI Tools (high level) –Utilize underlying hardware counters for measurements –Often use some multiplexing of monitors (sampling) –May provide some analytical values (MFLOPS, cpi, etc.) –May have GUI interface for ease of use –Often help/support available –Advantage: increased speed, analysis, easier to use –Disadvantage: decreased accuracy and control

30 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Tool Techniques

31 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Tool Techniques

32 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters MIPS R10K Operating Systems –IRIX High Level Tools –Speedshop, Workshop Low Level Tools –perfex, libperfex, prof

33 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters MIPS R10K Prof run at: Tue Apr 28 15:50: Command line: prof suboptim.ideal.m : Total number of cycles s: Total execution time : Total number of instructions executed 1.405: Ratio of cycles / instruction 195: Clock rate in MHz R10000: Target processor modelled cycles(%) cum % secs instrns calls procedure (56.71) pdot (43.26) init 31767( 0.03) vsum 1069( 0.00) fflush : : : : : :

34 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters MIPS R10K Perfex (command line interface) –Multiplex: perfex -a >& –Non-multiplex: perfex -e evnt0 -e evnt1 >& –Env variables: T5_EVENT0, T5_EVENT1 libperfex (for inserted measurement) integer e0, e1 integer*8 c0, c1 call start_counters (e0, e1) call read_counters (e0, c0, e1, c1) call print_counters (e0, c0, e1, c1)

35 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters MIPS R10K (methodology) Goals: accuracy + events + analysis Subjects: ASCI codes, multiplexing & inserted, no I/O, large codes, subroutines ok Target Machine: –Origin 2000, IRIX OS –Counters within 10% accuracy (lmbench) –Can’t use simulator –h/w specifics necessary for analysis Necessary Counters: (see chart) –Approx 8 runs per code due to counter constraints

36 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters MIPS R10K (methodology) Hard code counters? –Use inserted code for small problem sizes –Use multiplexing for large problem sizes Special Considerations –MADD=OFF, O3 opt, 64 bit code, MIPS 4 instr set –Use libperfex when necessary –Certain counters, certain events Results gathered using shell scripts Check for common errors Analyze using methods

37 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters IBM 604e Operating Systems –AIX High Level Tools –xprofiler Low Level Tools –PMAPI (sPM604e) modify data structures to measure events unique threshold mechanism requires hardware switch and OS kernel modification –New PMAPI? New API? –prof, gprof

38 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters IBM 604e (methodology) Goals: accuracy + events + analysis Subjects: ASCI codes, inserted only, no I/O, large codes, subroutines ok Target Machine: –IBM SP2, AIX OS –Counters within 10% (lmbench) –Can’t use simulator –h/w specifics necessary for analysis Necessary Counters: (see chart) –Approx 6 runs per code due to counter constraints

39 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters IBM 604e (methodology) Hard code counters? –Use inserted code for all problem sizes Special Considerations –Threshold mechanism must be enabled –Certain counters, certain events Results gathered using shell scripts Check for common errors Analyze using methods

40 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Intel PPRO, Pentium II & III Operating Systems –Windows’, Linux, others High Level Tools –Vtune (Windows 95, 98, NT), etc Low Level Tools –pperf, mpperf, libpperf (Linux)

41 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Intel PPRO, Pentium II & III Vtune Performance Analyzer 4.0 (event based sampling) –Graphics profiles –System-wide monitoring –Hotspot analysis –Byte code accelerator –OS chronologies –Static assembly code analysis –Online help –Of course…$$

42 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Intel PPRO, Pentium II & III

43 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Intel PPRO, Pentium II & III

44 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Intel PPRO, Pentium II & III pperf (command line version on Linux) –Multiplex: mpperf –Non-multiplex: pperf libpperf (for inserted measurement on Linux) –Similar to perfex on MIPS R10K

45 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Compaq Alpha EV4-6 Operating Systems –DEC UNIX, Windows High Level Tools –DCPI, ProfileMe Low Level Tools –uprofile, prof* –May not be necessary

46 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Compaq Alpha EV4-6 Digital Continuous Profiling System (DCPI) –Multiplexing (event sampling) –Limited information about events –In-order processors only –Free* ProfileMe –Multiplexing (instruction sampling) –Uses performance monitors => low overhead –All benefits of performance monitors –Free*

47 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters HP PA-RISC (PA 8000 V Class) Operating Systems –HP-UX High Level Tools –CXperf Low Level Tools –Not openly available –Require additional hardware

48 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters HP PA-RISC (PA 8000 V Class) CXperf –Event based analysis tool –Integrated with compiler –Provides: wallclock, cpu time TLB misses, context switches instructions, cache misses, latency –Advantage: variety of metrics –Disadvantage: more intrusive

49 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Interpretation, derivations, and empirical analysis

50 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Basic Calculations IPC (Instructions Per Cycle) or CPI: MFLOPS:

51 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Basic Calculations L1 cache hit rate: L2 cache hit rate: Memory/Flops ratio:

52 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Basic Calculations Branch rate: Branch mis-pred rate: TLB miss rate:

53 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Advanced Calculations (for reference only) L1-to-L2 bandwidth used (MB/s): L2-to-Memory bandwidth used (MB/s):

54 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Empirical Analysis (Example: MIPS R10000) Motivations: –The contribution of cache misses to the total runtime –The overlap between computing and memory access Problem: CPU stall due to memory access not currently measurable (out of order, non-blocking cache). Solution: empirical inference

55 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Memory Hierarchy Model cpi = cpi o + cpi = cpi o + cpi stall Procedure: - Measure cpi, h 2, and h m (h 3 ) on several problem sizes - Measure cpi o (problem fits into L1 cache) - Fit data to model to t 2 and t m (t 3 )

56 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Network Processor L1 Data Cache (32 KB) L2 Cache (4 MB) L2 Cache (4 MB) Local Memory Remote Mem 1 Remote Mem 2 etc, etc... T m =69 T r1 =22cp T 1 =1cp T 2 =10cp T r2 =44cp T r =33cp Origin 2000 Memory Latencies

57 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters CPU-Memory Overlap Our model uses average or effective latencies, t i. If there were no overlap, every memory access would “see” the full R-T latency, T i. Define a measure of the CPU-memory overlap m 0 as where

58 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Model Validation Collect data & empirically fit on 1MB L2 Power Challenge (PC) Validate on different machines –2MB L2 PowerChallenge Use SGI R10000 processor simulator

59 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Validation on 2MB L2 SGI PC

60 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Model Shows Improved Memory Performance on O2K Overall cpi is much lower (~2x) on Origin system and stall time is proportionately lower. cpi stall bounds application’s performance gains (e.g.. Hydro)

61 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Origin-PC Memory Comparison Identical processors (200 MHz MIPS R10K) Memory Differences Memory Latency - 80cp vs 205cp Outstanding Misses - 4 vs 2 L2 Cache - 4MB vs 2MB Power Challenge P P M M

62 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Separating Architectural Effects Define 2 Virtual Machines Separate contributions of three architectural features: cache size, outstanding misses & latency Key: Ratios: - Power Challenge with a 4MB cache (PC*) - Origin with memory latency of a Power Challenge (O*) cache effect: f c = cpi PC / cpi PC* outstanding misses: f o = cpi PC* /cpi O* memory latency: f m = cpi O* /cpi O overall: F = f c *f o *f m

63 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters CPI of the O * Machine The Origin’s DSM architecture allows us to “dial in” a memory latency! Observed: CPI varies linearly with increasing latency. The maximum latency on a 32-PE O2K is roughly equal to PowerChallenge.

64 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters CPI of the PC * Machine Using the memory model we can combine hit rates from the 4MB L2 Origin with the stall times of the Power Challenge

65 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Observed and Calculated Performance Gain on the O2K *relative to PowerChallenge; single-processor

66 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Success and Limitations Success: –Proved validity of this kind of model –Explained improvements in Origin2000 performance –Given performance bounds for memory optimizations Limitations: this model primarily diagnostic, not predictive

67 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Instruction-level, pipeline-based application characterization analysis

68 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Outline Motivation: –Comparing performance at instruction-level Methodology: –Using performance counters –Abstracting pipeline queue related parameters Advantages: no-overhead, pinpoint architectural bottleneck Summary

69 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Motivations Processor Utilization Efficiency Utilization of architectural features Separation and quantification of other effects (data dependency etc)

70 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Methodology Counter-based average abstract parameters related to architectural pipeline queues Analysis Assumptions: –no data dependency, uniform instruction distribution, branches and Icache misses negligible Growth-rate based bottleneck analysis

71 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters General Pipeline Model (CPU Only) Multi Instruction fetch/decode BranchPred/ Register Rename F Q U E U E FPU I Q U E U E ALU M Q U E U E LD/ST Incoming Instruction Stream

72 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Abstract Parameters

73 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Formula

74 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Bottleneck Analysis Architectural Constraints: queue length limit, # of functional units, graduation rates, etc Positive Growth Rate: queue likely filled to cause a stall Multiple Positive Growth Rates: threshold of max. instructions in flight

75 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Scientific Examples Test bed: MIPS R10000 Benchmarks: ASCI application benchmark codes General Characteristics: –Branches  10%, miss-pred_branches  1% –Icache misses  0.1% –All s converge to constants (steady state)

76 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Multimedia Examples Test bed: MIPS R10000 Multimedia Applications: –RealPlayerTM, MPEG en/decoder, GSM Vocoder, and Rsynth (speech synthesizer) General Characteristics: –Fairly high L1 hit rates (some >99.9%) –Low FLOPS (some zero flops) –Most Icache hit rates > 99.9% (RealPlayer 95%) –Code may behave differently based on input

77 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Architectural Constraints (MIPS R10000) Queue Lengths: 16 entries for all three queues Max. Instructions in Flight: 32 Graduation Rates: 2/cycle for FP, 2/cycle for INT, 1/cycle for Mem Outstanding Misses: 4 on Origin2000,  2 on PowerChallenge

78 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Example Growth Rates (ASCI Benchmarks)

79 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Outstanding Misses Utilization (ASCI Benchmarks)

80 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Growth Rates (GSM Vocoder) L1 hit rate > 99.9%, Icache hit rate > 99.8% 2.4%, 4.0% branches for encoding, decoding No flops for decoding

81 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Growth Rates (MPEG Video Decoding) L1 hit rate >99.4%, icache hit rate >99.9% Branch rate < 9% MPEG1 CPI depends on data rate and quality

82 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Growth Rates (RealPlayer TM ) L1 hit rate > 98.2% Icache hit rate: audio > 95%, video > 98% Branch rate: audio < 22%, video < 11% CPI: audio > video

83 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Growth Rates (Rsynth) L1 hit rate > 99.7%, Icache hit rate nearly 100% Branch rate < 12.9%

84 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Growth Rates (MPEG Audio) L1 hit rate: encoding > 97%, decoding > 96% Icache hit rate > 99.9% Branch rate: encoding < 8%, decoding < 6.5%

85 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Summary Using abstract parameters collected from counter data to characterize workload Analyzing performance bottlenecks from architectural constraints Estimating the utilization of outstanding misses

86 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Future Trends The Fight for Silicon –Performance vs. engineering –Silicon = $$$, performance monitoring is costly –Yet, trend toward more, smarter counters The Winners (for now) –Alpha: 3 counters, many events, great tools –IBM: 4 counters, many events, not-so-great tools –Intel: 2 counters, many events, great Windows tools, not so great for academic/scientific computing

87 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Future Trends The Losers (for now) –AMD K-6no counters at all Why this trend? –More complicated processors –More complicated reasons for performance variations –Memory bottleneck still biggest problem –Latency hiding techniques complicate analysis –Low level diagnosis is becoming increasingly important

88 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Future Trends What’s being done to help? –Attempts (as discussed) to provide further analysis using counter output –Analytical & empirical models can answer interesting questions How effective are latency hiding techniques? Would adding some simple events increase ability to analyze? Where are the bottlenecks of performance? –Need to continue toward predictive models How will minor architectural advances effect performance? Examples: latency decreases, memory size increase, etc.

89 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Further References For electronic version of tutorial: tutorial questions: –Kirk W. –Yong See handout for detailed overall references

90 Cameron & LuoISCA ‘99 May 1, Performance Evaluation Using Hardware Performance Counters Review of Tutorial Objectives Provide a systematic approach to the use of hardware counters and associated tools Provide examples of hardware counters over a determined problem set for various processors Give details of systematic approach on a subset of processors Provide simple derived formulas and advanced analytical formulas for performance measurement Provide further references for hardware counters

91 Cameron & LuoISCA ‘99 Performance Evaluation Using Hardware Performance Counters Lmbench Use (lat_mem_rd)

92 Cameron & LuoISCA ‘99 Performance Evaluation Using Hardware Performance Counters Accuracy of MIPS R10K and IBM 604e

93 Cameron & LuoISCA ‘99 Performance Evaluation Using Hardware Performance Counters Multiplexing Saves Time Compromises accuracy Example: –MIPS R10K –Sweep 3d Code: Sn Transport Code –Multiplex for size >50 –Multiplex not accurate for sizes <50 Key: saves # runs necessary for measurements Multiplex whenever possible

94 Cameron & LuoISCA ‘99 Performance Evaluation Using Hardware Performance Counters Tutorial Schedule 2:00-3:10Kirk 3:10-3:25Break 3:25-4:25Kirk/Yong 4:25-4:40Break 4:40-5:30Yong 5:30-6:00Flex Time


Download ppt "Cameron & LuoISCA ‘99 May 1, 1999 1Performance Evaluation Using Hardware Performance Counters Kirk W. Cameron Yong Luo {kirk,"

Similar presentations


Ads by Google