Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.

Similar presentations


Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2."— Presentation transcript:

1 1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2

2 2Outline  Quantitative Principles of Computer Design Amdahl’s law (make the common case fast) Amdahl’s law (make the common case fast)  Performance Metrics MIPS, FLOPS, and all that… MIPS, FLOPS, and all that…  Examples

3 3 Quantitative Principles of Computer Design Execution time Response time Latency Execution time Response time Latency Performance Rate of producing results Throughput Bandwidth Performance Rate of producing results Throughput Bandwidth

4 4Comparison “Y is n times larger than X” “Y is n% larger than X”

5 5 “Validity of the single processor approach to achieving large scale computing capabilities”, G. M. Amdahl, AFIPS Conference Proceedings, pp. 483-485, April 1967 Amdahl’s Law (1967)  Historical context Amdahl was demonstrating “the continued validity of the single processor approach and of the weaknesses of the multiple processor approach” Amdahl was demonstrating “the continued validity of the single processor approach and of the weaknesses of the multiple processor approach” Paper contains no mathematical formulation, just arguments and simulation Paper contains no mathematical formulation, just arguments and simulation  “The nature of this overhead appears to be sequential so that it is unlikely to be amenable to parallel processing techniques.”  “A fairly obvious conclusion which can be drawn at this point is that the effort expended on achieving high parallel performance rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude.”  Nevertheless, it is of widespread applicability in all kinds of situations

6 6 Amdahl’s Law Fraction of results generated at this rate Average execution rate (performance) Weighted harmonic mean Note: Not “fraction of time spent working at this rate” Note: Not “fraction of time spent working at this rate” “Bottleneckology: Evaluating Supercomputers”, Jack Worlton, COMPCOM 85, pp. 405-406

7 7 Example of Amdahl’s Law 30% of results are generated at the rate of 1 MFLOPS, 20% at 10 MFLOPS, 50% at 100 MFLOPS. What is the average performance? What is the bottleneck? 30% of results are generated at the rate of 1 MFLOPS, 20% at 10 MFLOPS, 50% at 100 MFLOPS. What is the average performance? What is the bottleneck? Bottleneck: the rate that consumes most of the time

8 8 Amdahl’s Law (HP3 book, pp. 40-41) Fraction enhanced Speedup enhanced Speedup overall Speedup enhanced Fraction enhanced

9 9 Implications of Amdahl’s Law  The performance improvements provided by a feature are limited by how often that feature is used  As stated, Amdahl’s Law is valid only if the system always works with exactly one of the rates If a non-blocking cache is used, or there is overlap between CPU and I/O operations, Amdahl’s Law as given here is not applicable If a non-blocking cache is used, or there is overlap between CPU and I/O operations, Amdahl’s Law as given here is not applicable  Bottleneck is the most promising target for improvements “Make the common case fast” “Make the common case fast” Infrequent events, even if they consume a lot of time, will make little difference to performance Infrequent events, even if they consume a lot of time, will make little difference to performance  Typical use: Change only one parameter of system, and compute effect of this change The same program, with the same input data, should run on the machine in both cases The same program, with the same input data, should run on the machine in both cases

10 10 “Make The Common Case Fast”  All instructions require an instruction fetch, only a fraction require a data fetch/store Optimize instruction access over data access Optimize instruction access over data access  Programs exhibit locality Spatial Locality Spatial Locality  items with addresses near one another tend to be referenced close together in time Temporal Locality Temporal Locality  recently accessed items are likely to be accessed in the near future  Access to small memories is faster Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Memory Disk / Tape

11 11 “Make The Common Case Fast” (2)  What is the common case? The rate at which the system spends most of its time The rate at which the system spends most of its time The “bottleneck” The “bottleneck”  What does this statement mean precisely? Make the common case faster, rather than making some other case faster Make the common case faster, rather than making some other case faster Make the common case faster by a certain amount, rather than making some other case faster by the same amount Make the common case faster by a certain amount, rather than making some other case faster by the same amount  Absolute amount?  Relative amount?  This principle is merely an informal statement of a frequently correct consequence of Amdahl’s Law

12 12 “Make The Common Case Fast” (3a) A machine produces 20% and 80% of its results at the rates of 1 and 3 MFLOPS, respectively. What is more advantageous: to improve the 1 MFLOPS rate, or to improve the 3 MFLOPS rate? A machine produces 20% and 80% of its results at the rates of 1 and 3 MFLOPS, respectively. What is more advantageous: to improve the 1 MFLOPS rate, or to improve the 3 MFLOPS rate? Generalize problem: Assume rates are x and y MFLOPS At ( x,y ) = (1,3), this indicates that it is better to improve x, the 1 MFLOPS rate, which is not the common case. So, the 3 MFLOPS rate is the common case in this example.

13 13 “Make The Common Case Fast” (3b) Let’s say that we want to make the same relative change to one or the other rate, rather than the same absolute change. At ( x,y ) = (1,3), this indicates that it is better to improve y, the 3 MFLOPS rate, which is the common case. If there are two different execution rates, making the common case faster by the same relative amount is always more advantageous than the alternative. However, this does not necessarily hold if we make absolute changes of the same magnitude. For three or more rates, further analysis is needed.

14 14 Basics of Performance

15 15 Details of CPI

16 16MIPS  Machines with different instruction sets?  Programs with different instruction mixes? Dynamic frequency of instructions Dynamic frequency of instructions  Uncorrelated with performance Marketing metric Marketing metric  “Meaningless Indicator of Processor Speed”

17 17MFLOP/s  Popular in supercomputing community  Often not where time is spent  Not all FP operations are equal “Normalized” MFLOP/s “Normalized” MFLOP/s  Can magnify performance differences A better algorithm (e.g., with better data reuse) can run faster even with higher FLOP count A better algorithm (e.g., with better data reuse) can run faster even with higher FLOP count DGEQRF vs. DGEQR2 in LAPACK DGEQRF vs. DGEQR2 in LAPACK

18 18 Aspects of CPU Performance

19 19 Example 1 (see HP3 pp. 42-45 for more examples) Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) F sqrt = fraction of FP sqrt results R sqrt = rate of producing FP sqrt results F non-sqrt = fraction of non-sqrt results R non-sqrt = rate of producing non-sqrt results F fp = fraction of FP results R fp = rate of producing FP results F non-fp = fraction of non-FP results R non-fp = rate of producing non-FP results R before = average rate of producing results before enhancement R after = average rate of producing results after enhancement

20 20 Example 1 (Soln. using Amdahl’s Law) Improve FP sqrt only Improve all FP ops

21 21 Example 2 Which CPU performs better? Why?

22 22 Example 2 (Solution) If clock cycle time of A was only 1.1x clock cycle time of B, then CPU B would be about 9% higher performance.

23 23 Example 3 A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase? A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?

24 24 Example 3 (Solution) Before change After change Since CPU time increases, change will not improve performance.

25 25 Example 4 A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time? A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time?

26 26 Example 4 (Solution) Without optimization With optimization Performance increases, but MIPS decreases!

27 27 Performance of (Blocking) Caches no cache misses! with cache misses!

28 28Example Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Why?

29 29Means

30 30 Weighted Means

31 31 Relations among Means Equality holds if and only if all the elements are identical.

32 32 Summarizing Computer Performance “Characterizing Computer Performance with a Single Number”, J. E. Smith, CACM, October 1988, pp. 1202-1206  The starting point is universally accepted “The time required to perform a specified amount of computation is the ultimate measure of computer performance” “The time required to perform a specified amount of computation is the ultimate measure of computer performance”  How should we summarize (reduce to a single number) the measured execution times (or measured performance values) of several benchmark programs?  Two required properties A single-number performance measure for a set of benchmarks expressed in units of time should be directly proportional to the total (weighted) time consumed by the benchmarks. A single-number performance measure for a set of benchmarks expressed in units of time should be directly proportional to the total (weighted) time consumed by the benchmarks. A single-number performance measure for a set of benchmarks expressed as a rate should be inversely proportional to the total (weighted) time consumed by the benchmarks. A single-number performance measure for a set of benchmarks expressed as a rate should be inversely proportional to the total (weighted) time consumed by the benchmarks.

33 33 Arithmetic Mean for Times Smaller is better for execution times

34 34 Harmonic Mean for Rates Larger is better for execution rates

35 35 Avoid the Geometric Mean  If benchmark execution times are normalized to some reference machine, and means of normalized execution times are computed, only the geometric mean gives consistent results no matter what the reference machine is (see Figure 1.17 in HP3, pg. 38) This has led to declaring the geometric mean as the preferred method of summarizing execution time (e.g., SPEC) This has led to declaring the geometric mean as the preferred method of summarizing execution time (e.g., SPEC)  Smith’s comments “The geometric mean does provide a consistent measure in this context, but it is consistently wrong.” “The geometric mean does provide a consistent measure in this context, but it is consistently wrong.” “If performance is to be normalized with respect to a specific machine, an aggregate performance measure such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.” “If performance is to be normalized with respect to a specific machine, an aggregate performance measure such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.”

36 36 Programs to Evaluate Performance  (Toy) Benchmarks 10-100 line program 10-100 line program  sieve, puzzle, quicksort  Synthetic Benchmarks Attempt to match average frequencies of real workloads Attempt to match average frequencies of real workloads  Whetstone, Dhrystone  Kernels Time-critical excerpts of real programs Time-critical excerpts of real programs  Livermore loops  Real programs  gcc, compress “The principle behind benchmarking is to model a real job mix with a smaller set of representative programs.” J. E. Smith


Download ppt "1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2."

Similar presentations


Ads by Google