Presentation on theme: "Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations."— Presentation transcript:
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations This is a matter of taste... Computers all operate at different speeds Speed is the most important performance metric 2.1 The entire point of computer hardware is to “perform” Operate correctly Implement useful operations Do so as fast as possible
Measuring speed Raw speed Ferrari wins 2.1 Which is faster? School Bus: 57 MPH, 40 people Ferrari: 170 MPH, 2 people Throughput Ferrari: 340 passenger- MPH School Bus: 2280 passenger-MPH Other issues... Range, reliability, cost
Peformance of computers How long does it take to run my favorite program? 2.2 To compare two computers, we compare the execution time of the same program on the two computers Faster one wins Lower execution time is better Batch throughput CPU time Response time
The CPU interprets machine-language instructions nd xecutes them A little background... The compiler converts this code into machine- language instructions 2.2 Computer programs are (usually) written in a high- level language (e.g. C) The performance of a program depends on: The number and types of instructions executed How fast the CPU can execute those instructions
Tick-tock Almost all modern computers are based on a clock Period 2.2 All events are controlled by and synchronized to a regular clock Clocks are just regular periodic waveforms Cycle time: time for the waveform to repeat itself Also known as the clock period Frequency: 1/Period Example: 10ns clock cycle --> period = 10 -8 s Frequency 1/10ns = 1/10 -8 s = 10 8 cycles/sec
Execution time Performance can be improved by: Decreasing the cycle time Hardware solution: Use faster technology Decreasing the number of cycles for the program Software: Write a better program Hardware: Re-design CPU 2.3 Time = cycles * cycle time Time = cycles / clock frequency Since the cycle time of a computer is constant, we can express time in terms of CPU cycles
Instruction execution time Every instruction takes time to execute Some instructions may take more or less time than others The time for an instruction is expressed in terms of clock cycles InstructionCycles ADD1 MULT4 CMP1 SUB2 Example: The time to run a program depends on: How many instructions What type of instructions 30 ADDs and 4 MULTs --> 46 cycles 2.3
Average CPI The Cycles-Per-Instruction (CPI) varies depending on what instructions are used Take an Average CPI Cycles = Number of Instructions * Average CPI 2.3 Average CPI should reflect the mix of instructions in the program A large proportion of 4-cycle MULTs should raise the CPI, a large proportion of 1-cycle ADDs should lower it The average should be the weighted average
Weighing the average InstructionCycles% ADD140 MULT410 CMP120 SUB230 Average CPI = 1 * 40% + 4 * 10% + 1 * 20% + 2 * 30% =.4 +.4 +.2 +.6 = 1.6 Average CPI = 1 * 40% + 4 * 10% + 1 * 20% + 2 * 30% =.4 +.4 +.2 +.6 = 1.6 Notice: The average CPI depends on the code we’re executing! Example mix of instructions 2.3
How long? Remember, lower is better Reducing any one of the three components reduces execution time 2.3 Execution time = Cycles * Cycle Time Cycles = Average CPI * Instruction Count Execution time = Instruction Count * CPI * Cycle Time Cycle time - Reduced through technology change, change in CPU design CPI - Reduced through better code, better compiler, change in CPU design Instruction count - Reduced through better code, better compiler, change in CPU design
Examples 2.3 System A: 10s to run a program. Clock period is 20ns. System B: Change clock to 10ns, no other changes. How long does it take to run the same program on System B? --> Time D = CPI D x Period D x Instructions D = 1.10 x 22ns x 4 x 10 8 = 9.68s --> Time A = CPI A x Period A x Instructions A = 10s System D: 400,000,000 instr., 22ns clock and a CPI of 1.10. How long does it take to run the program on system D? --> Time B = CPI A x Period B x Instructions A = ? (Period B = Period A * 0.5) --> Time B = CPI A x Period A * 0.5 x Instructions A = Time A * 0.5 = 5s System C: 10s to run a program, 20ns clock, 400,000,000 instr. What is the CPI? --> CPI C = Time C / (Period C x Instr C ) = 10s / (20 x 10 -9 x 4 x 10 8 ) = 1.25
Examples 2.3 Assume an add takes 1 cycle, a mult 4 cycles, and a sub 2 cycles Two different compilers produce the following loops for the same code: add add mult sub add add mult add mult sub A:B: loop 1000000 times What’s the CPI? CPI A = (4 + 1 + 4 + 2)/4 = 2.75 CPI B = (1 + 1 + 4 + 2 + 1 + 1)/6 = 1.667 How long does it take to run each program on a 200MHz CPU? Time A = CPI A x Period A x Instructions A = 2.75 x 5ns x 4000000 =.0055s Time B = CPI B x Period B x Instructions B = 1.667 x 5ns x 6000000 =.0050s
Performance metrics I’m concerned with how long it takes to run my program Chances are, that number isn’t published with the specs for the computer 2.4 Standardized metrics Benchmarks (SPEC, etc.) MIPS MFLOPS
Benchmarks Run a suite of benchmark programs, average the performance Benchmarks - programs thought to be representative of commonly-used programs 2.5 Advantages Actually corresponds to execution time! Represents a wider range of programs Disadvantages Are they running your program? Who picks the benchmarks? Be wary if the manufacturer does!
New tests use SPEC CPU2000 CINT2000 - Performance on integer programs CFP2000 - Performance on floating-point programs Larger numbers indicate better performance Tests prior to 2000 used CPU95 CPU 2000 only has only a few years of data SPEC Benchmarks SPEC (System Performance Evaluation Cooperative) maintains a set of benchmark suites 2.6 SPEC Web Page (www.spec.org)
SPECint95 Results for Intel Processors Clock Speed (MHz) SPECint95 Note: Results depend on Cache size, memory system, and motherboard 100 200300400500 600 700800 Better cache design (On-chip vs Off-chip)
SPECfp95 Results for Intel Processors Note: Results depend on Cache size, memory system, and motherboard Clock Speed (MHz) SPECfp95 100 200300400500 600 700800
CINT2000 Results for Various Processors Clock Speed (GHz) CINT2000 Note: Results depend on Cache size, memory system, and motherboard Note: Athlon Part numbers are not the CPU MHz! Part numbers labeled on graph 1800+ 1600+ 1500+ 2200+ 2400+ 2600+ 2700+ 3200+
CFP2000 Results for Various Processors Note: Results depend on Cache size, memory system, and motherboard Clock Speed (GHz) CFP2000 Note: Athlon Part numbers are not the CPU MHz! Part numbers labeled on graph 1800+ 1600+ 1500+ 2200+ 2400+ 2600+ 2700+ 3200+
Limited benefits... Assume we’re running a program that spends 40% of its time accessing memory Now, we upgrade the processor from 200 MHz to 800 MHz How much faster does the program run? 2.7 We’ve reduced the time for 60% of the program by 4 But we haven’t touched the memory access time New total = Old * (40% + (60% / 4)) = Old * (40% + 15%) = Old * 55% Not even twice as fast!
Amdahl’s Law 2.7 Practical effect: “Make the common case fast” Corollary: “Forget about the rare case” New Execution time = Execution time affected by impr. + Unaffected Execution Time Amount of Improvement Example: 70% of my execution time is done on integer ADDs, and 6% on floating point ADDs. Total execution time is 100 seconds. What’s the effect of making integer ADDs twice as fast? New time = (100 *.70) / 2 + (100 *.30) = 35+30=65 seconds What’s the effect of making F.P. Adds twice as fast? New time = (100 *.06) / 2 + (100 *.94) = 3+94 = 97 seconds
(Native) MIPS 2.4 cycles second CPI = * 10 -6 cycles second CPI = 10 -6 * clock rate CPI = 10 -6 * Million Instructions Per Second Instructions second * 10 -6 MIPS = MIPS does not take into account how many instructions must be executed in a program 1. 1,000 instructions, CPI 1.2, 1.0 MHz clock Execution time = 1.2 ms, MIPS = 1/1.2 =.833 2. 500 instructions, CPI 2.0, 1.0 MHz clock Execution time = 1.0ms, MIPS = 1/2.0 =.500 Example: Same program, written two ways
Avoid MIPS (the metric, not the processor) Higher MIPS doesn’t always mean better performance Highest MIPS corresponds to using the smallest (fastest) instructions to lower CPI MIPS = clock rate / (CPI * 1,000,000) 2.4 Peak MIPS is pointless Peak MIPS is just what MIPS you get with smallest instructions Usually, CPI is 1.0 for this Just re-expressing clock rate in MHz
MFLOPS Million Floating-point Operations Per Second MFLOPS is similar to MIPS Measures floating-point operations (mult, divide, add,...) Suffers same problems as MIPS Different operations cost different amounts 2.4 Peak MFLOPS is especially bad
Performance Summary Execution time is the most important performance metric Basic formula for performance: Execution time = instructions * cycle time * CPI Amdahl’s law describes how making limited improvements affects the bottom line Only make improvements in areas that are commonly used Standard benchmarks help us to compare performance of various computers Beware of overly-simplified comparisons
Pitfalls and Fallacies Processors with the same ISA can be compared by clock rate or a single benchmark suite alone We don’t know the pipeline structure and memory system Peak performance tracks observed performance One processor may operate closer to peak performance most of the time than another MIPS is an accurate measure of performance
Example We wish to consider the performance of two different machines: M1 and M2. The clock frequencies for the two machines are as follows: M1 M2 Clock Frequency 300 MHz 200 MHz Two programs were run on both machines and the following measurements were made: Program Time on M1 Time on M2 1 06 seconds 04 seconds 2 08 seconds 10 seconds In addition, the following additional measurements were made: Program No. of Instructions No. of Instructions Executed on M1 Executed on M2 1 180x10^6 100x10^6 1. For each program, which machine is faster and by how much? 2. Find the clock cycles per instruction (CPI or average CPI) for Program 1 on both machines 3. On M1, each multiplication instruction involves 20 clock cycles. Suppose 20% of the instructions in Program 1 running on M1 are multiplications. What percentage of the CPU time is spent doing multiplications during the execution of Program 1 on M1? 4. Find the instruction execution rate (i.e., the number of instructions executed per second) for each machine when running Program 1 5. Assuming the CPI for the machines is constant, find the instruction count for Program 2 running on each machine using the execution times.
Solution 1. For program 1, M2 is 2sec or (6-4)/6 = 33% faster For program 2, M1 is 2 sec or (10-8)/10 = 20% faster 2. t M1P1 = INSTR M1P1 x CPI M1P1 x 1/f M1 => CPI M1P1 = (t M1P1 x f M1 )/INST M1P1 = (6 x 300)/180 = 10 Likewise CPI M2 = (4 x 200)/100 = 8 3. INSTR MULTM1P1 = 0.2 x 180x10^6 = 36x10^6 instructions t MULTM1P1 = INSTR MULTM1P1 x 20 x 1/(300x10^6) = 720/300 = 2.4 sec t MULTM1P1/ t M1P1 = 2.4/6 = 40% 4. MIPS M1P1 = (INSTR M1P1 / t M1P1 )*10^6 = 180/6 = 30 MIPS M2P1 = (INSTR M2P1 / t M2P1 )*10^6 = 100/4 = 25 5. t M1P2 = INSTR M1P2 x CPI M1P2 x 1/f M1 => INSTR M1P2 = (t M1P2 x f M1 )/ CPI M1P1 = (8 x 300x10^6)/10 = 240x10^6 INSTR M2P2 = (t M2P2 x f M2 )/ CPI M2P1 = (10 x 200x10^6)/8 = 250x10^6
Review Questions Is CPI constant for a given processor (does not change from one program to another)? Two processors with the same Instruction Set Architecture have the same CPI True False Is MIPS constant for a given processor (does not change from one program to another)? Two processors with the same Instruction Set Architecture have the same MIPS True False
Review Questions Which of the following performance metrics is generally easier for the programmer to improve? The instruction count The average CPI The clock frequency peak MIPS What would you consider as most important when selecting the fastest processor for a certain application domain? The operating clock frequency MIPS Peak MIPS Execution time for relative benchmarks How can you increase a processor’s clock frequency? Write a better program Use a better compiler Implement the processor in a faster VLSI technology Use a larger memory
Example We wish to consider the performance of two different machines: M1 and M2. The clock frequencies for the two machines are as follows: M1 M2 Clock Frequency: 800 MHz 1000 MHz A program was run on both machines and the following measurements were made: Time on M1 Time on M2 2.5 seconds 2 seconds In addition, the following additional measurements were made: No. of Instructions Executed on M1 Executed on M2 100x10^6 125x10^6 Finally, the frequency that instructions occur in the program for M1 and M2 are shown in the following table InstructionM1%M2% ADD4060 MULT10 8 CMP2012 SUB3020 1. Find the clock cycles per instruction (CPI or average CPI) for Program on both machines 2. How much faster will the program run on M1 and M2 respectively if we a) reduce the execution time of the ADD instruction by 20%, assuming that an ADD instruction requires 5 cycles on both machines b) reduce the execution time of the MULT instruction by 20%, assuming a MULT instructions requires 20 cycles on M1 and 25 cycles on M2 c) Which is better for M1 and which for M2?