PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics

PerformanceCS510 Computer ArchitecturesLecture 3 - 2 Measurement Tools Benchmarks, Traces, Mixes Cost, Delay, Area, Power Estimation Simulation (many levels) –ISA, RT, Gate, Circuit Queuing Theory Rules of Thumb Fundamental Laws

PerformanceCS510 Computer ArchitecturesLecture 3 - 3 The Bottom Line: Performance (and Cost) Time to run the task (ExTime) –Execution time, response time, latency Tasks per day, hour, week, sec, ns....(Performance) –Throughput, bandwidth 610 mph 1350 mph 470 132 286,700 178,200 6.5 hours 3.0 hours Plane Boeing 747 BAD/Sud Concorde Speed Time (DC-Paris) Passengers Throughput (pmph)

PerformanceCS510 Computer ArchitecturesLecture 3 - 4 The Bottom Line: Performance (and Cost) ExTime(Y) Performance(X) n = = ExTime(X) Performance(Y) “X is n times faster than Y” means:

PerformanceCS510 Computer ArchitecturesLecture 3 - 5 Performance Terminology “X is n% faster than Y” means: 100 x (Performance(X) - Performance(Y)) 100 x (Performance(X) - Performance(Y)) n = n = Performance(Y) Performance(Y) ExTime(Y) Performance(X) n = = 1 + ExTime(X) Performance(Y) 100

PerformanceCS510 Computer ArchitecturesLecture 3 - 6 Example 15 10 = 1.5 1.0 = Performance (X) Performance (Y) ExTime(Y) ExTime(X) =n= 100 (1.5 - 1.0) 1.0 n=50% Example: Y takes 15 seconds to complete a task, X takes 10 seconds. X takes 10 seconds. What % faster is X? What % faster is X?

PerformanceCS510 Computer ArchitecturesLecture 3 - 7 Programs to Evaluate Processor Performance (Toy) Benchmarks –10~100-line program –e.g.: sieve, puzzle, quicksort Synthetic Benchmarks –Attempt to match average frequencies of real workloads –e.g., Whetstone, dhrystone Kernels –Time critical excerpts of real programs –e.g., Livermore loops Real programs –e.g., gcc, spice

PerformanceCS510 Computer ArchitecturesLecture 3 - 8 Benchmarking Games Differing configurations used to run the same workload on two systems Compiler wired to optimize the workload Workload arbitrarily picked Very small benchmarks used Benchmarks manually translated to optimize performance

PerformanceCS510 Computer ArchitecturesLecture 3 - 9 Common Benchmarking Mistakes Only average behavior represented in test workload Ignoring monitoring overhead Not ensuring same initial conditions “Benchmark Engineering” –particular optimization –different compilers or preprocessors –runtime libraries

PerformanceCS510 Computer ArchitecturesLecture 3 - 10 SPEC: System Performance Evaluation Cooperative First Round 1989 –10 programs yielding a single number Second Round 1992 –SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs) –VAX-11/780 Third Round 1995 –Single flag setting for all programs; new set of programs “benchmarks useful for 3 years” –SPARCstation 10 Model 40

PerformanceCS510 Computer ArchitecturesLecture 3 - 11 SPEC First Round One program: 99% of time in single line of code New front-end compiler could improve dramatically

PerformanceCS510 Computer ArchitecturesLecture 3 - 12 How to Summarize Performance Arithmetic Mean (weighted arithmetic mean) –tracks execution time:  (T i )/n or  W i *T i Harmonic Mean (weighted harmonic mean) of execution rates (e.g., MFLOPS) –tracks execution time: n/  1/R i or n/  W i /R i Normalized execution time is handy for scaling performance But do not take the arithmetic mean of normalized execution time, use the geometric mean  (R i ) 1/n, where R i =1/T i

PerformanceCS510 Computer ArchitecturesLecture 3 - 13 Comparing and Summarizing Performance For program P1, A is 10 times faster than B, For program P2, B is 10 times faster than A, and so on... The relative performance of computer is unclear with Total Execution Times Computer A Computer B Computer C P1(secs) 1 10 20 P2(secs) 1,000 100 20 Total time(secs) 1,001 110 40

PerformanceCS510 Computer ArchitecturesLecture 3 - 14 Summary Measure Arithmetic Mean n  Execution Time i i=1 1n1n n  1 / Rate i i=1 Rate i =ƒ(1 / Execcution Time i ) Good, if programs are run equally in the workload Harmonic Mean(When performance is expressed as rates)

PerformanceCS510 Computer ArchitecturesLecture 3 - 15 Unequal Job Mix Weighted Arithmetic Mean Weighted Harmonic Mean n  Weight i x Execution Time i i=1 n  Weight i / Rate i i=1 Relative Performance Normalized Execution Time to a reference machine Arithmetic Mean Geometric Mean n  Execution Time Ratio i i=1 n Normalized to the reference machine Weighted Execution Time

PerformanceCS510 Computer ArchitecturesLecture 3 - 16 Weighted Arithmetic Mean  W(i) j x Time j j=1 n WAM(i) = A B C W(1) W(2) W(3) P1 (secs) 1.00 10.00 20.00 0.50 0.909 0.999 P2(secs) 1,000.00 100.00 20.00 0.50 0.091 0.001 1.0 x 0.5 + 1,000 x 0.5 WAM(1) 500.50 55.00 20.00 WAM(2) 91.91 18.19 20.00 WAM(3) 2.00 10.09 20.00

PerformanceCS510 Computer ArchitecturesLecture 3 - 17 Normalized Execution Time P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0 Normalized to ANormalized to BNormalized to C A B C A B C A B C Geometric Mean = n  Execution time ratio i I=1 n Arithmetic mean 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0 Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0 Total time 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0 A B C P1 1.00 10.00 20.00 P2 1,000.00 100.00 20.00

PerformanceCS510 Computer ArchitecturesLecture 3 - 18 Disadvantages of Arithmetic Mean Performance varies depending on the reference machine 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0 B is 5 times slower than A A is 5 times slower than B C is slowest C is fastest Normalized to ANormalized to BNormalized to C A B C A B C A B C P1 P2 Arithmetic mean

PerformanceCS510 Computer ArchitecturesLecture 3 - 19 The Pros and Cons Of Geometric Means Independent of running times of the individual programs Independent of the reference machines Do not predict execution time –the performance of A and B is the same : only true when P1 ran 100 times for every occurrence of P2 P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0 P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 1(P1) x 100 + 1000(P2) x 1 = 10(P1) x 100 + 100(P2) x 1 Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0 Normalized to ANormalized to BNormalized to C A B C A B C A B C

PerformanceCS510 Computer ArchitecturesLecture 3 - 20

PerformanceCS510 Computer ArchitecturesLecture 3 - 21

PerformanceCS510 Computer ArchitecturesLecture 3 - 22 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: ExTime(E) = Speedup(E) =

PerformanceCS510 Computer ArchitecturesLecture 3 - 23 Amdahl’s Law Speedup = ExTime ExTime E = 1 (1 - Fraction E ) + Fraction E Speedup E ExTime E = ExTime x (1 - Fraction E ) + Speedup E Fraction E 1 (1 - F) + F/S =

PerformanceCS510 Computer ArchitecturesLecture 3 - 24 Amdahl’s Law Floating point instructions are improved to run 2 times(100% improvement); but only 10% of actual instructions are FP Speedup = 1 (1-F) + F/S =1.053 5.3% improvement 1 (1-0.1) + 0.1/2 0.95 = 1 =

PerformanceCS510 Computer ArchitecturesLecture 3 - 25 Corollary(Amdahl): Make the Common Case Fast All instructions require an instruction fetch, only a fraction require a data fetch/store –Optimize instruction access over data access Programs exhibit locality Spatial Locality Reg’s Cache Memory Disk / Tape Temporal Locality Access to small memories is faster Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.

PerformanceCS510 Computer ArchitecturesLecture 3 - 26 Locality of Access Spatial Locality: There is a high probability that a set of data, whose address differences are small, will be accessed in small time difference. Temporal Locality: There is a high probability that the recently referenced data will be referenced in near future.

PerformanceCS510 Computer ArchitecturesLecture 3 - 27 The simple case is usually the most frequent and the easiest to optimize! Do simple, fast things in hardware(faster) and be sure the rest can be handled correctly in software Rule of Thumb

PerformanceCS510 Computer ArchitecturesLecture 3 - 28 Metrics of Performance Compiler Programming Language Application ISA Datapath Control TransistorsWiresPins Function Units (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Answers per month Operations per second Cycles per second (clock rate) Megabytes per second

PerformanceCS510 Computer ArchitecturesLecture 3 - 29 Aspects of CPU Performance Seconds Instructions Cycles Seconds CPU time= = x x Program Program Instruction Cycle Program X Compiler X (X) Inst. Set. X X Organization X Technology X X Inst Count CPI Clock Rate

PerformanceCS510 Computer ArchitecturesLecture 3 - 30 Marketing Metrics MIPS = Instruction Count / Time x 10 6 = Clock Rate / CPI x 10 6 Machines with different instruction sets ? Programs with different instruction mixes ? – Dynamic frequency of instructions Not correlated with performance MFLOP/s = FP Operations / Time x 10 6 Machine dependent Often not where time is spent Normalized: add,sub,compare, mult 1 divide, sqrt 4 exp, sin,... 8 Normalized: add,sub,compare, mult 1 divide, sqrt 4 exp, sin,... 8

PerformanceCS510 Computer ArchitecturesLecture 3 - 31 Cycles Per Instruction CPU time = Cycle Time x  CPI x I i = 1 n i i Instruction Frequency CPI =  CPI x F,where F = I i = 1 n ii ii Instruction Count CPI = (CPU Time x Clock Rate) / Instruction Count = Cycles / Instruction Count Average cycles per instruction Invest resources where time is spent !

PerformanceCS510 Computer ArchitecturesLecture 3 - 32 Organizational Trade-offs Instruction Mix Cycle Time CPI ISA Datapath Control TransistorsWiresPins Function Units Compiler Programming Language Application

PerformanceCS510 Computer ArchitecturesLecture 3 - 33 Example: Calculating CPI Typical Mix Base Machine (Reg / Reg) OpFreqCPI(i)CPI(% Time) ALU50%1.5 (33%) Load20%2.4 (27%) Store10%2.2 (13%) Branch20%2.4 (27%) 1.5

PerformanceCS510 Computer ArchitecturesLecture 3 - 34 Example Add register / memory operations: R/M – One source operand in memory – One source operand in register – Cycle count of 2 Branch cycle count to increase to 3 What fraction of the loads must be eliminated for this to pay off? Base Machine (Reg / Reg) Some of LD instructions can be eliminated by having R/M type ADD instruction [ADD R1, X] Typical Mix OpFreq i CPI i ALU50% 1 Load20% 2 Store10% 2 Branch20% 2

PerformanceCS510 Computer ArchitecturesLecture 3 - 35 Example Solution Exec Time = Instr Cnt x CPI x Clock OpFreq i CPI i CPI ALU.50 1.5 Load.20 2.4 Store.10 2.2 Branch.20 2.4 Total1.00 1.5

PerformanceCS510 Computer ArchitecturesLecture 3 - 36 Example Solution Exec Time = Instr Cnt x CPI x Clock CPI NEW must be normalized to new instruction frequency New Freq i CPI i CPI NEW.5 - X 1.5 - X.2 - X 2.4 - 2X.1 2.2.2 3.6 X 22X 1 - X (1.7 - X)/(1 - X) Old OpFreq i CPI i CPI ALU.50 1.5 Load.20 2.4 Store.10 2.2 Branch.20 2.4 Reg/Mem 1.001.5

PerformanceCS510 Computer ArchitecturesLecture 3 - 37 Example Solution Exec Time = Instr Cnt x CPI x Clock All LOADs must be eliminated for this to be a win ! 1.00 x 1.5 = (1 - X) x (1.7 - X)/(1 - X) 1.5 = 1.7 - X 0.2 = X Instr Cnt Old x CPI Old x Clock = Instr Cnt New x CPI New x Clock Op Freq Cycles CPI Old FreqCycles CPI NEW ALU.50 1.5.5 - X 1.5 - X Load.20 2.4.2 - X 2.4 - 2X Store.10 2.2.1 2.2 Branch.20 2.4.2 3.6 Reg/MemX 22X 1.00 1.5 1 - X (1.7 - X)/(1 - X) Old New

PerformanceCS510 Computer ArchitecturesLecture 3 - 38 Fallacies and Pitfalls MIPS is an accurate measure for comparing performance among computers –dependent on the instruction set –varies between programs on the same computer –can vary inversely to performance MFLOPS is a consistent and useful measure of performance –dependent on the machine and on the program –not applicable outside the floating-point performance –the set of floating-point operations is not consistent across the machines

PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Similar presentations

Presentation on theme: "PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.

Similar presentations

Presentation on theme: "PerformanceCS510 Computer ArchitecturesLecture 3 - 1 Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics."— Presentation transcript:

Similar presentations

About project

Feedback