CS2100 Computer Organisation

Slides:



Advertisements
Similar presentations
CS1104: Computer Organisation School of Computing National University of Singapore.
Advertisements

CS1104: Computer Organisation School of Computing National University of Singapore.
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
1  1998 Morgan Kaufmann Publishers Chapter 2 Performance Text in blue is by N. Guydosh Updated 1/25/04*
Evaluating Performance
100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
Computer Organization and Architecture 18 th March, 2008.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
1 Introduction Rapidly changing field: –vacuum tube -> transistor -> IC -> VLSI (see section 1.4) –doubling every 1.5 years: memory capacity processor.
1 CSE SUNY New Paltz Chapter 2 Performance and Its Measurement.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
Computer ArchitectureFall 2007 © September 17, 2007 Karem Sakallah CS-447– Computer Architecture.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Chapter 4 Assessing and Understanding Performance
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Lecture 3: Computer Performance
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
1 ECE3055 Computer Architecture and Operating Systems Lecture 2 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Computer Organization and Design Performance Montek Singh Mon, April 4, 2011 Lecture 13.
1 Computer Performance: Metrics, Measurement, & Evaluation.
1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
1 Acknowledgements Class notes based upon Patterson & Hennessy: Book & Lecture Notes Patterson’s 1997 course notes (U.C. Berkeley CS 152, 1997) Tom Fountain.
Performance.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Chapter 4. Measure, Report, and Summarize Make intelligent choices See through the marketing hype Understanding underlying organizational aspects Why.
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
EGRE 426 Computer Organization and Design Chapter 4.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
COD Ch. 1 Introduction + The Role of Performance.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
CPEN Digital System Design Assessing and Understanding CPU Performance © Logic and Computer Design Fundamentals, 4 rd Ed., Mano Prentice Hall © Computer.
Measuring Performance II and Logic Design
Computer Architecture & Operations I
CS161 – Design and Architecture of Computer Systems
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
Defining Performance Which airplane has the best performance?
Computer Architecture & Operations I
Prof. Hsien-Hsin Sean Lee
Morgan Kaufmann Publishers
CSCE 212 Chapter 4: Assessing and Understanding Performance
CS2100 Computer Organisation
Defining Performance Section /14/2018 9:52 PM.
Computer Performance He said, to speed things up we need to squeeze the clock.
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
January 25 Did you get mail from Chun-Fa about assignment grades?
Computer Performance Read Chapter 4
Performance.
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CS161 – Design and Architecture of Computer Systems
Computer Organization and Design Chapter 4
Presentation transcript:

CS2100 Computer Organisation http://www.comp.nus.edu.sg/~cs2100/ Lecture #24 Performance

Lecture #24: Performance CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Lecture #24: Performance Performance Definition Factors Affecting Performance Amdahl’s Law

1. Performance: Two Viewpoints CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Performance: Two Viewpoints “Computer X is faster than computer Y” is an ambiguous statement “Fast” can be interpreted in different ways Fast = Response Time The duration of a program execution is shorter Fast = Throughput More work can be done in the same duration We focus on the first viewpoint in this section

1. Execution Time: Comparison CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Execution Time: Comparison Performance is in “units of things per second” Bigger is better Response time is in “number of seconds” Smaller is better 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑥 = 1 𝑡𝑖𝑚𝑒 𝑥 Speedup n, between x and y is 𝑆𝑝𝑒𝑒𝑑𝑢𝑝= 𝑡𝑖𝑚𝑒 𝑥 𝑡𝑖𝑚𝑒 𝑦 = 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑦 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑥

1. Execution Time: Refining Definition CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Execution Time: Refining Definition There are different measures of execution time in computer performance: Elapsed time (aka wall-clock time) Counts everything (including disk and memory accesses, I/O, etc.) Not so good for comparison purposes CPU time Doesn’t include I/O or time spent running other programs Can be broken up into system time and user time Our focus: User CPU Time Time spent executing the lines of code in the program

1. Execution Time: Clock Cycles CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Execution Time: Clock Cycles Instead of reporting execution time in seconds, we often use clock cycles (basic time unit in machine). Clock Cycle time Cycle time (or cycle period or clock period) Time between two consecutive rising edges, measured in seconds. Clock rate (or clock frequency) = 1 / cycle-time = number-of-cycles/second (unit is Hz; 1 Hz = 1 cycle/second)

1. Execution Time: Version 1.0 CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Execution Time: Version 1.0 Therefore, to improve performance (everything else being equal), you can do the following:  Reduce the number of cycles for a program, or  Reduce the clock cycle time, or equivalently,  Increase the clock rate

Exercise 1: Clock Cycle & Clock Rate CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 1: Clock Cycle & Clock Rate Program P runs in 10 seconds on machine A, which has a 400 MHz clock. Suppose we are trying to build a new machine B that will run this program in 6 seconds. Unfortunately, the increase in clock rate has an averse effect on the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we target at to hit our goal? Answer: Let C be the number of clock cycles required for that program. For A: Time = 10 sec. = C  1/400MHz For B: Time = 6 sec. = (1.2  C)  1/clock_rateB Therefore, clock_rateB = 10  400  1.2/6 MHz = 800 MHz 

1. Clock Cycles: Proportional? CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Clock Cycles: Proportional? Each instruction takes n clock cycles An execution consists of I number of instructions Does it mean we need I  n cycles to finish it?  Number of cycles is proportional to number of instructions? 1st instruction 2nd instruction 3rd instruction 4th 5th 6th ... Clock

1. Clock Cycles: Inst Type Dependent CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Clock Cycles: Inst Type Dependent Different instructions take different amount of time to finish: Clock For example: Multiply instruction may take longer than an Add instruction. Floating-point operations take longer than integer operations. Accessing memory takes more time than accessing registers

1. Execution Time: Introducing CPI CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Execution Time: Introducing CPI A given program will require Some number of instructions (machine instructions)  average Cycle per Instruction (CPI) Some number of cycles  cycle time Some number of seconds We use the average CPI as different instructions take different number of cycles to finish

1. Execution Time: Version 2.0 CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 1. Execution Time: Version 2.0 Average Cycle Per Instruction (CPI) CPI = (CPU time  Clock rate) / Instruction count = Clock cycles / Instruction count CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle 𝐶𝑃𝐼= 𝑘=1 𝑛 𝐶𝑃𝐼 𝑘 × 𝐹 𝑘 where 𝐹 𝑘 = 𝐼 𝑘 Instruction count 𝐼 𝑘 =Instruction frequency

2. Performance: Influencing Factors CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 2. Performance: Influencing Factors Program compiles into Binary Binary executes on Machine Hardware Organization Cycle Time CPI Process Factors Performance aspects influenced by factors Compiler ISA #instructions Average CPI

2. Program  Binary: Factors CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 2. Program  Binary: Factors Compiler: Different compilers may generate different binary codes e.g. gnu vs intel c/c++ compiler Different optimization may generate different binary codes e.g. different optimization level in gnu c compiler Instruction Set Architecture: The same high level statement can be translated differently depending on the ISA e.g. same C program under Intel machine vs Sunfire server

Exercise 2: Impact of Compiler CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 2: Impact of Compiler Given a program P, a compiler can generate 2 different binaries on a target machine. On that machine, there are 3 classes of instructions: Class A, Class B, and Class C, and they require 1, 2, and 3 cycles respectively. First binary has 5 instructions: 2 of A, 1 of B, and 2 of C. Second binary has 6 instructions: 4 of A, 1 of B, and 1 of C. 1. Which code is faster? By how much? 2. What is the (average) CPI for each code? Answer: Let T be the cycle time. Time(code1) = (21 + 12 + 23)  T = 10T Time(code2) = (41 + 12 + 13)  T = 9T Time(code1)/Time(code2) = CPI(code1) = CPI(code2) = 10/9 = 1.1  Code2 is 1.1 times faster (21 + 12 + 23) / (2 + 1 + 2) = 2 (41 + 12 + 13) / (4+ 1 + 1) = 1.5 

2. Execution of Binary Code: Factors CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 2. Execution of Binary Code: Factors Machine More accurately the hardware implementation Determine cycle time and cycle per instruction Cycle Time: Different clock frequencies (e.g. 2GHz vs 3.6GHz) Cycles Per Instruction: Design of internal mechanism (e.g. specific accelerator to improve floating point performance)

Exercise 3: Impact of Machine CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 3: Impact of Machine Suppose we have 2 implementations of the same ISA, and a program is run on these 2 machines. Machine A has a clock cycle time of 10 ns and a CPI of 2.0. Machine B has a clock cycle time of 20 ns and a CPI of 1.2. 1. Which machine is faster for this program? By how much? Answer: Let N be the number of instructions. Machine A: Time = N  2.0  10 ns Machine B: Time = Performance(A)/Performance(B) = Time(B)/Time(A) = N  1.2  20 ns (1.2  20) / (2.0  10) = 1.2 

Exercise 4: All Factors (1/4) CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 4: All Factors (1/4) You are given 2 machine designs M1 and M2 for performance benchmarking. Both M1 and M2 have the same ISA, but different hardware implementations and compilers. Assuming that the clock cycle times for M1 and M2 are the same, performance study gives the following measurements for the 2 designs. Instruction class For M1 For M2 CPI No. of instructions executed A 1 3,000,000,000,000 2 2,700,000,000,000 B 2,000,000,000,000 3 1,800,000,000,000 C D 4 1,000,000,000,000 900,000,000,000

Exercise 4: All Factors (2/4) CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 4: All Factors (2/4) What is the CPI for each machine? Let Y = 1,000,000,000,000 CPI(M1) = (3Y1 + 2Y2 + 2Y3 + Y4) / (3Y + 2Y + 2Y + Y) = 17Y / 8Y = 2.125 CPI(M2) = = [(3Y2 + 2Y3 + 2Y3 + Y2)  0.9] / [(3Y + 2Y + 2Y + Y)  0.9] 20Y / 8Y = 2.5 Which machine is faster? By how much? Let C be cycle time. Time(M1) = 2.125  (8Y  C) Time(M2) = M1 is faster than M2 by 2.5  (8Y  0.9  C) = 2.25  (8Y  C) 2.25 / 2.125 = 1.0588 

Exercise 4: All Factors (3/4) CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 4: All Factors (3/4) To further improve the performance of the machines, a new compiler technique is introduced. The compiler can simply eliminate all class D instructions from the benchmark program without any side effects. (That is, there is no change to the number of class A, B and C instructions executed in the 2 machines.) With this new technique, which machine is faster? By how much? Let Y = 1,000,000,000,000; Let C be cycle time. CPI(M1) = (3Y1 + 2Y2 + 2Y3) / (3Y + 2Y + 2Y) = 13Y / 7Y = 1.857 CPI(M2) = = Time(M1) = 1.857  (7Y  C) Time(M2) = M1 is faster than M2 by [(3Y2 + 2Y3 + 2Y3)  0.9] / [(3Y + 2Y + 2Y)  0.9] 18Y / 7Y = 2.571 2.571  (0.9  7Y  C) = 2.314  (7Y  C) 2.314 / 1.857 = 1.246 

Exercise 4: All Factors (4/4) CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 4: All Factors (4/4) Alternatively, to further improve the performance of the machines, a new hardware technique is introduced. The hardware can simply execute all class D instructions in zero times without any side effects. (There is still execution for class D instructions.) With this new technique, which machine is faster? By how much? Let Y = 1,000,000,000,000; Let C be cycle time. CPI(M1) = (3Y1 + 2Y2 + 2Y3 + Y0) / (3Y + 2Y + 2Y + Y) = 13Y / 8Y = 1.625 CPI(M2) = = Time(M1) = 1.625  (8Y  C) Time(M2) = M1 is faster than M2 by [(3Y2 + 2Y3 + 2Y3 + Y0)  0.9] / [(3Y + 2Y + 2Y + Y)  0.9] 18Y / 8Y = 2.25 2.25  (0.9  8Y  C) = 2.025  (8Y  C) 2.025 / 1.625 = 1.246 

CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Summary: Key Concepts Performance is specific to a particular program on a specific machine Total execution time is a consistent summary of performance For a given architecture, performance increase comes from: Increase in clock rate (without adverse CPI effects) Improvement in processor organization that lowers CPI Compiler enhancement that lowers CPI and/or instruction count Pitfall: Expecting improvement in one aspect of a machine’s performance to affect the total performance.

CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 3. Amdahl’s Law (1/3) Pitfall: Expecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement. Example: Suppose a program runs in 100 seconds on a machine, with multiply operations responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? 100 (total time) = 80 (for multiply) + UA (unaffected)  UA = 20 100/4 (new total time) = Speedup = 80/Speedup (for multiply) + UA 80/5 = 16 

CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 3. Amdahl’s Law (2/3) Example (continued): How about making it 5 times faster? 100 (total time) = 80 (for multiply) + UA (unaffected) 100/5 (new total time) = Speedup = 80/Speedup (for multiply) + UA 80/0 = ??? (impossible!) There is no way we can enhance multiply to achieve a fivefold increase in performance, if multiply accounts for only 80% of the workload. 

CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance 3. Amdahl’s Law (3/3) Amdahl’s law: Performance is limited to the non-speedup portion of the program Execution time after improvement = Execution time of unaffected part + (Execution time of affected part / Speedup) Corollary of Amdahl’s law: Optimize the common case first!

Exercise 5: Amdahl’s Law CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 5: Amdahl’s Law Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 12 seconds, what will the speedup be if half of the 12 seconds is spent executing floating-point instructions? Time = Speedup = 6 (UA) + 6 (fl-pt) / 5 = 7.2 sec. 12 / 7.2 = 1.67 

Exercise 6: Amdahl’s Law CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance Exercise 6: Amdahl’s Law We are looking for a benchmark to show off the new floating-point unit described in the previous example, and we want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 100 seconds with the old floating-point hardware. How much of the execution time would floating-point instructions have to account for in this program in order to yield our desired speedup on this benchmark? Speedup = Time_FI = 3 = 100 / (Time_fl / 5 + 100 – Time_fl) 83.33 sec. 

CS2100 Computer Organisation Aaron Tan, NUS Lecture #24: Performance End of File