ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute.

Slides:



Advertisements
Similar presentations
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Advertisements

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
CS 6290 Evaluation & Metrics. Performance Two common measures –Latency (how long to do X) Also called response time and execution time –Throughput (how.
2-1 ECE 361 ECE C61 Computer Architecture Lecture 2 – performance Prof. Alok N. Choudhary
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
ENGS 116 Lecture 21 Performance and Quantitative Principles Vincent H. Berk September 26 th, 2008 Reading for today: Chapter , Amdahl article.
Chapter 4 Assessing and Understanding Performance Bo Cheng.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
Computer Architecture Lecture 2 Instruction Set Principles.
Chapter 4 Assessing and Understanding Performance
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.
Lecture 3: Computer Performance
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
1 ECE3055 Computer Architecture and Operating Systems Lecture 2 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
Operation Frequency No. of Clock cycles ALU ops % 1 Loads 25% 2
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
1 Measuring and Discussing Computer System Performance or “My computer is faster than your computer” Reading: 2.4, Peer Instruction Lecture Materials.
B0111 Performance Anxiety ENGR xD52 Eric VanWyk Fall 2012.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
1 Acknowledgements Class notes based upon Patterson & Hennessy: Book & Lecture Notes Patterson’s 1997 course notes (U.C. Berkeley CS 152, 1997) Tom Fountain.
Performance.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Computer Architecture
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
Morgan Kaufmann Publishers
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
EGRE 426 Computer Organization and Design Chapter 4.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
B0111 Performance Anxiety ENGR xD52 Eric VanWyk Fall 2012.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
Performance. Moore's Law Moore's Law Related Curves.
Measuring Performance II and Logic Design
COSC6385 Advanced Computer Architecture
Lecture 2: Performance Evaluation
4- Performance Analysis of Parallel Programs
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
How do we evaluate computer architectures?
Defining Performance Which airplane has the best performance?
COSC3330 Computer Architecture Lecture 7. Datapath and Performance
Morgan Kaufmann Publishers
CSCE 212 Chapter 4: Assessing and Understanding Performance
Performance of computer systems
Performance of computer systems
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Parameters that affect it How to improve it and by how much
Computer Organization and Design Chapter 4
CS2100 Computer Organisation
Presentation transcript:

ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 Performance Execution/Response time (Latency) –Elapsed time between start and completion of an event –How long my job takes? Throughput (Bandwidth) –Total amount of work done within a given period of time –How many jobs done per unit time on a system?

3 CPU Performance Execution Time = Seconds / Program Programmer Algorithms ISA Compilers Microarchitecture System architecture Microarchitecture, pipeline depth Circuit design Technology

4 Pipeline Stage Combinational Logic F/F Optimal FO4 per pipe –6 to 8 [UT/Compaq, ISCA-29] –18 (15+3 latch) [IBM, MICRO-35] P4 pipe stage~ 16 FO4 1 FO4 Slide from Lecture 1 Pipelining

5 Architecture Comparison Many architecture research just make the following assumptions Instructions / program is fixed –Same binary (  ) –Same compiler (  ) –Same benchmark Seconds per cycle is constant (  ) –Same frequency –Same pipeline depth –Typically a bad assumption today Focus on IPC or CPI It is more complicated for today’s architects !

6 Example: Calculating CPI Typical Mix of instruction types in program Base Machine (Reg / Reg) OpFreqCyclesCPI(i)(% Time) ALU50%1.5(33%) Load20%2.4(27%) Store10%2.2(13%) Branch20%2.4(27%) 1.5 Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks. Run benchmark and collect workload characterization (simulate, machine counters, or sampling)

7 Performance Comparison For some program running on machine X, Performance X = 1 / Execution time X n n"X is n times faster than Y" Performance X / Performance Y = n = speedup of X over Y Problem: –machine A runs a program in 20 seconds –machine B runs the same program in 25 seconds

8 Performance Evaluation: Benchmark (Real) Programs –In the form of collection of programs –E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC Kernels: –Small key pieces of real programs –E.g., Livermore Fortran Loops Kernels (LFK), Linpack Modified (or scripted) –To focus on some particular aspects (e.g. remove I/O, focus on CPU) (Toy) Benchmarks –Produce expected results Synthetic Benchmarks: –Representative instruction mix –E.g., Dhrystone, Whetstone Important for –Architectural and microarchitectural design trade-off –Competitive analysis of real products

9 Performance Summary Measurement Average of total execution time Arithmetic Mean (Weighted Arithmetic Mean)This is Arithmetic Mean (Weighted Arithmetic Mean)

10 Performance Summary Measurement Rate i is a function of 1/Time i Used to represent the average “rate” such as instruction per cycle (IPC)

11 Why Harmonic Mean? 30 mph for the first 10 miles 90 mph for the next 10 miles Average speed? (30+90)/2 = 60 mph?? Wrong! Average speed = total distance / total time (10+10)/(10/ /90) = 45 mph

12 New Breed of Metrics Performance / Watt –Performance achievable at the same cooling capacity Performance / Joule (Energy) –Achievable performance at the lifetime of the same energy source (i.e., battery = energy) –Equivalent to reciprocal of energy-delay product (ED product)

13 Amdahl’s Law (Law of Diminishing Returns) Make the common case faster Speedup = Perf new / Perf old = T old / T new = Performance improvement from using faster mode is limited by the fraction the faster mode can be applied. f (1 - f) T old (1 - f) T new f / P

14 Amdahl’s Law Analogy Driving from Orlando to Atlanta –60 miles/hr from Orlando to Macon –120 miles/hr from Macon to Atlanta –How much time you can save compared against driving all the way at 60 miles/hr from Orlando to Atlanta? 6hr 45min vs. 7hr 30min = ~11% speedup Key is to speed up the biggie portion, i.e. speed up frequently executed blocks

15 Parallelism vs. Speedup 1.11x 1.97x 1.33x Speed-up Code portion in Faster mode (f) Amdahl's Law speed-up as a function of parallelism P=1P=2P=4P=8 P=16P=32P=64

16 Gustafson’s Law Amdahl’s Law killed massive parallel processing (MPP) Gustafson came to rescue Seq T new Parallel T old SeqP * Parallel Time Assume: Seq + Parallel = 1 (T new )  Speedup = Seq + p * (1 – Seq) where p=parallel factor If Seq diminishes with increased problem size, Speedup  p

17 Amdahl versus Gustafson Who is right?

18 The Principle of Locality Knuth made the original observation about program locality in –… less than 4 percent of a program generally accounts for more than half of its running time. 90/10 rule: a program spends 90% of its execution time in only 10% of the code Two types of locality –Temporal locality (locality in time) –Spatial locality (locality in space) Memory subsystem design heavily leverages the locality concept for better performance

19 Example of Performance Evaluation (I) OperationFrequencyClock cycle count ALU Ops (reg-reg)43%1 Loads21%2 Stores12%2 Branches24%2 Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ?

20 Example of Performance Evaluation (I) OperationFrequencyClock cycle count ALU Ops (reg-reg)43%1 Loads21%2 Stores12%2 Branches24%2 Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ?

21 Example of Performance Evaluation (II) FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2% of all instructions, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2 Design Option 2: decease the average CPI of all FP instructions to 2.5

22 Example of Performance Evaluation (II) FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2% of all instructions, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2 Design Option 2: decease the average CPI of all FP instructions to 2.5 Original CPI = 0.25* *(1-0.25) = 2.0 Option 1 CPI = 2.0 – 2%*(20-2) = 1.64 Option 2 CPI = 0.25* *(1-0.25) = Speedup of Option 1 = 2/1.64 = Speedup of Option 2 = 2/1.625 =

23 Example of Performance Evaluation (III) Clock freq = 1.4 GHz FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz

24 Example of Performance Evaluation (III) Clock freq = 1.4 GHz FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s