Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available by Dr Mary Jane Irwin, Penn State University. CS.210 Computer Systems and Architecture and CS.305 Computer Architecture

Assessing and Understanding Performance  Purchasing perspective  given a collection of machines, which has the best performance ? least cost ? best cost/performance?  Design perspective  faced with design options, which has the best performance improvement ? least cost ? best cost/performance?  Both require  basis for comparison  metric for evaluation  Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors Performance Metrics CS210_305_05/2

Assessing and Understanding Performance Defining (Speed) Performance  Normally interested in reducing  Response time (aka execution time) – the time between the start and the completion of a task Important to individual users  Thus, to maximize performance, need to minimize execution time  Throughput – the total amount of work done in a given time Important to data center managers  Decreasing response time almost always improves throughput If X is n times faster than Y, then CS210_305_05/3

Assessing and Understanding Performance Performance Factors  Want to distinguish elapsed time and the time spent on our task  CPU execution time (CPU time) – time the CPU spends working on a task  Does not include time waiting for I/O or running other programs  Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program or CS210_305_05/4

Assessing and Understanding Performance Review: Machine Clock Rate  Clock rate (MHz, GHz) is inverse of clock cycle time (clock period): one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate CS210_305_05/5

Assessing and Understanding Performance Clock Cycles per Instruction (CPI)  Not all instructions take the same amount of time to execute  One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction  Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute  A way to compare two different implementations of the same ISA CPI for this instruction class ABC CPI123 CS210_305_05/6

Assessing and Understanding Performance Effective CPI  Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =  Where IC i is the count (percentage) of the number of instructions of class i executed  CPI i is the (average) number of clock cycles per instruction for that instruction class  n is the number of instruction classes  The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs CS210_305_05/7

Assessing and Understanding Performance THE Performance Equation  Our basic performance equation is then or  These equations separate the three key factors that affect performance  Can measure the CPU execution time by running the program  The clock rate is usually given  Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details  CPI varies by instruction type and ISA implementation for which we must know the implementation details CS210_305_05/8

Assessing and Understanding Performance Instruction count CPIClock cycle Algorithm Programming language Compiler ISA Processor organization Technology Determinates of CPU Performance CS210_305_05/9

Assessing and Understanding Performance Instruction count CPIClock cycle Algorithm Programming language Compiler ISA Processor organization Technology X XX XX XX X X X X X Determinates of CPU Performance CS210_305_05/10

Assessing and Understanding Performance A Simple Example  Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  Q2: How does this compare with using branch prediction to shave a cycle off the branch time?  Q3: What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2  = CS210_305_05/11

Assessing and Understanding Performance  Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  Q2: How does this compare with using branch prediction to shave a cycle off the branch time?  Q3: What if two ALU instructions could be executed at once? A Simple Example OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2  =.5 1.0.3.4 2.2 CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster 1.6.5.4.3.4.5 1.0.3.2 2.0 CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster.25 1.0.3.4 1.95 CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster Q1:Q2:Q3: CS210_305_05/12

Assessing and Understanding Performance Comparing and Summarizing Performance  Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))  How do we summarize the performance for benchmark set with a single number?  The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)  Where Time i is the execution time for the i th program of a total of n programs in the workload  A smaller mean indicates a smaller average execution time and thus improved performance CS210_305_05/13

Assessing and Understanding Performance SPEC Benchmarks www.spec.orgwww.spec.org Integer benchmarksFP benchmarks gzipcompressionwupwiseQuantum chromodynamics vprFPGA place & routeswimShallow water model gccGNU C compilermgridMultigrid solver in 3D fields mcfCombinatorial optimizationappluParabolic/elliptic pde craftyChess programmesa3D graphics library parserWord processing programgalgelComputational fluid dynamics eonComputer visualizationartImage recognition (NN) perlbmkperl applicationequakeSeismic wave propagation simulation gapGroup theory interpreterfacerecFacial image recognition vortexObject oriented databaseammpComputational chemistry bzip2compressionlucasPrimality testing twolfCircuit place & routefma3dCrash simulation fem sixtrackNuclear physics accel apsiPollutant distribution CS210_305_05/14

Assessing and Understanding Performance Example SPEC Ratings CS210_305_05/15

Assessing and Understanding Performance Other Performance Metrics  Power consumption – especially in the embedded market where battery life (and cooling) is important  For power-limited applications, the most important metric is energy efficiency CS210_305_05/16

Assessing and Understanding Performance Other Performance Metrics - (Native) MIPS  (Native) MIPS - and What Is Wrong with Them  The dangers of using metrics other than time in performance measurement can be shown by looking at several popular alternatives.  One such alternative is MIPS - Millions of Instructions Per Second  The following form is sometimes convenient since clock rate is fixed for a machine and CPI is usually a small number, unlike instruction count or execution time. Relating MIPS to time, CS210_305_05/17

Assessing and Understanding Performance The Problem With MIPS  The problem with using MIPS as a measure of comparison is threefold:  MIPS is dependent on the instruction set, making it difficult to compare MIPS of computers with different instruction sets;  MIPS varies between programs on the same computer; and most importantly,  MIPS can vary inversely to performance!  The classic example of the last case is the MIPS rating of a machine with optional floating-point hardware. Machines with the option yield faster executing programs yet have a lower MIPS rating. CS210_305_05/18

Assessing and Understanding Performance Peak MIPS  Beware of so-called peak MIPS. This type of rating is obtained by choosing an instruction mix that minimises the CPI even if that instruction mix is totally impractical. For instance:  A program composed entirely of arithmetic and logic operations but no jumps, branches, or load/stores!  Or, as in a famous case, a program comprising only of NOPs - No OPerations)  In other words: Peak MIPS - a level of performance that will never be attained ;-) CS210_305_05/19

Assessing and Understanding Performance Relative MIPS  MIPS can fail to give a true picture of performance in that it does not track execution time. An alternative type MIPS rating is relative MIPS - as opposed to native MIPS - derived by using a particular machine as a reference point:  The advantage of this form of MIPS is small since execution time, program, and program input still must be known to have meaningful information. Time reference = Execution time of a program on a reference machine Time unrated = Execution time of the same program on a machine to be rated MIPS reference = Agreed-upon MIPS rating of the reference machine CS210_305_05/20

Assessing and Understanding Performance Summary: Evaluating ISAs  Design-time metrics:  Can it be implemented, in how long, at what cost?  Can it be programmed? Ease of compilation?  Static Metrics:  How many bytes does the program occupy in memory?  Dynamic Metrics:  How many instructions are executed? How many bytes does the processor fetch to execute the program?  How many clocks are required per instruction?  How "lean" a clock is practical? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques. CS210_305_05/21

Assessing and Understanding Performance Fallacies and Pitfalls  Pitfall: Expecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement.  Example: Suppose a program takes 100 seconds to run and multiply operations account for 80 seconds of this time. How much do you need to improve the speed of multiplication to make the program run five times faster?  A: Using Amdahl’s Law: Execution time after improvement = CS210_305_05/22

Assessing and Understanding Performance Fallacies and Pitfalls  …A: Using Amdahl’s Law (and the problem set): Execution time after improvement = 20 seconds =  To get 5 times faster the new execution time must be 20 seconds 0 =  I.e there is no amount by which we can enhance multiply to achieve a fivefold improvement in execution time! Making the common case fast…  …will tend to enhance performance better than optimising the rare case. CS210_305_05/23

Assessing and Understanding Performance Fallacies and Pitfalls  Pitfall: Comparing computers using only one or two of three performance metrics: clock rate, CPI, and instruction count.  Pitfall: Using peak performance to compare machines.  Fallacy: Synthetic benchmarks predict performance.  Synthetic benchmarks are small artificial programs that attempt to represent the execution frequency of statements found in a larger set of benchmarks or in real-world programs. Whetstone and Dhrystone are examples. Since these are not natural programs they can (be used to) distort performance statistics by, e.g.: compilers discarding large sections of the code! compilers targeting other optimisation 'opportunities' specifically for a benchmark and, hence, artificially inflate the performance stats. E.g. 20% to 30% improvement by using a string copy 'optimisation' for Dhrystone that could not be applied in over 99% of normal programs!! CS210_305_05/24

Assessing and Understanding Performance Concluding Remarks  The task a computer designer faces is a complex one: Determine what attributes are important for a new machine, then design a machine to maximise performance while staying within cost constraints. Performance can be measured as throughput or response time - which depends on the environment/application and should be borne in mind.  Amdahl's Law is a valuable tool to determine what performance improvement an architectural enhancement may give.  Knowing what cases are the most frequent is critical to improving performance. Based on empirical studies of instruction sets, tradeoffs can be made by deciding which instructions are the most important and what cases to try to make fast.  Computer designs will always be measured by cost and performance and finding the best balance will always be the art of computer design, just as in any engineering task. CS210_305_05/25

Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Similar presentations

Presentation on theme: "Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Similar presentations

Presentation on theme: "Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available."— Presentation transcript:

Similar presentations

About project

Feedback