Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.

Slides:

Advertisements

Similar presentations

CS1104: Computer Organisation School of Computing National University of Singapore.

Advertisements

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.

1  1998 Morgan Kaufmann Publishers Chapter 2 Performance Text in blue is by N. Guydosh Updated 1/25/04*

Evaluating Performance

Performance COE 308 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.

100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

Computer Organization and Architecture 18 th March, 2008.

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

1 Introduction Rapidly changing field: –vacuum tube -> transistor -> IC -> VLSI (see section 1.4) –doubling every 1.5 years: memory capacity processor.

Chapter 4 Assessing and Understanding Performance Bo Cheng.

EECC550 - Shaaban #1 Lec # 3 Spring Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing.

1 CSE SUNY New Paltz Chapter 2 Performance and Its Measurement.

Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Computer ArchitectureFall 2007 © September 17, 2007 Karem Sakallah CS-447– Computer Architecture.

1  1998 Morgan Kaufmann Publishers and UCB Performance CEG3420 Computer Design Lecture 3.

Chapter 4 Assessing and Understanding Performance

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

1 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why.

1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.

1 ECE3055 Computer Architecture and Operating Systems Lecture 2 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Computer Organization and Design Performance Montek Singh Mon, April 4, 2011 Lecture 13.

1 Computer Performance: Metrics, Measurement, & Evaluation.

CENG 450 Computer Systems & Architecture Lecture 3 Amirali Baniasadi

Performance Chapter 4 P&H. Introduction How does one measure report and summarise performance? Complexity of modern systems make it very more difficult.

B0111 Performance Anxiety ENGR xD52 Eric VanWyk Fall 2012.

1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.

1 Acknowledgements Class notes based upon Patterson & Hennessy: Book & Lecture Notes Patterson’s 1997 course notes (U.C. Berkeley CS 152, 1997) Tom Fountain.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

Computer Architecture

1 Seoul National University Performance. 2 Performance Example Seoul National University Sonata Boeing 727 Speed 100 km/h 1000km/h Seoul to Pusan 10 hours.

Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.

CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.

1 COMS 361 Computer Organization Title: Performance Date: 10/02/2004 Lecture Number: 3.

Lecture2: Performance Metrics Computer Architecture By Dr.Hadi Hassan 1/3/2016Dr. Hadi Hassan Computer Architecture 1.

1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.

Performance Performance

TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.

1  1998 Morgan Kaufmann Publishers Lectures for 2nd Edition Note: these lectures are often supplemented with other materials and also problems from the.

Lec2.1 Computer Architecture Chapter 2 The Role of Performance.

L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study

Performance COE 301 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals.

EGRE 426 Computer Organization and Design Chapter 4.

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,

Performance COE 301 / ICS 233 Computer Organization Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum.

BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.

June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.

Computer Organization

September 2 Performance Read 3.1 through 3.4 for Tuesday

Assessing and Understanding Performance

Defining Performance Which airplane has the best performance?

Prof. Hsien-Hsin Sean Lee

CSCE 212 Chapter 4: Assessing and Understanding Performance

CS2100 Computer Organisation

Performance COE 301 Computer Organization

Computer Performance He said, to speed things up we need to squeeze the clock.

Performance ICS 233 Computer Architecture and Assembly Language

Arrays versus Pointers

January 25 Did you get mail from Chun-Fa about assignment grades?

Computer Performance Read Chapter 4

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

Computer Organization and Design Chapter 4

CS2100 Computer Organisation

Presentation transcript:

Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s of your Four Project team members by Mon Sept 15. If not on a team, send your address to Pramod.

CPI CPI = Clock Cycles / Instruction “Average clock cycles per instruction”

CPI CPI = Clock Cycles / Instruction “Average clock cycles per instruction” CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time

CPI CPI = Clock Cycles / Instruction “Average clock cycles per instruction” CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count

CPI Invest Resources where time is Spent! CPI = Clock Cycles / Instruction Count = (CPU Time * Clock Rate) / Instruction Count “Average clock cycles per instruction” CPU Time = Instruction Count x CPI / Clock Rate = Instruction Count x CPI x Clock Cycle Time Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count Average CPI = SUM of CPI(i) * F(i) for i = 1, n F(i) is the Instruction Frequency

Suppose we have two implementations of the same instruction set For some program, Machine A has: a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? CPI Example

Suppose we have two implementations of the same instruction set For some program, Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? CPI Example

Suppose we have two implementations of the same instruction set For some program, Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? CPI Example

Suppose we have two implementations of the same instruction set For some program, Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B CPI Example

Suppose we have two implementations of the same instruction set For some program, Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B Note: CPI is Smaller for B CPI Example

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1 Load20%5 Store10%3 Branch20%2

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.5 Load20%5 1.0 Store10%3.3 Branch20% = CPI ave

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.5 Load20%5 1.0 Store10%3.3 Branch20% = CPI ave CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2.418% 2.2 = CPI ave CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2.418% 2.2 = CPI ave How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.523% Load20%5 (2) 1.0 (.4)45% Store10%3.314% Branch20%2.418% 2.2 (1.6) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? 2.2/1.6 = CPU Time = Inst Cnt * CPI ave * Clk Cycle Time

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time?

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2 (1).4 (.2)18% 2.2 (2.0) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once?

Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesF(i)CPI(i)% Time ALU50%1 (.5).5 (.25)23% Load20% % Store10%3.314% Branch20%2.418% 2.2 (1.95) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once? CPI=1.95

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence?

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? What is the CPI for each sequence?

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11 What is the CPI for each sequence?

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11 What is the CPI for each sequence? 10/5 = 2 9/6 = 1.5

MIPS = Instruction Count Execution time x 10 6 A popular performance metric is MIPS, the number of millions of instructions per second. For a given program,

MIPS = Instruction Count Execution time x 10 6 A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, 1.Cannot compare if instruction set is different

MIPS = Instruction Count Execution time x 10 6 A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, 1.Cannot compare if instruction set is different 2.Highly dependent on the program

MIPS = Instruction Count Execution time x 10 6 A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, 1.Cannot compare if instruction set is different 2.Highly dependent on the program 3.Can be inversely proportional to performance

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A has 1 cycle,Class B has 2 cycles, Class C has 3 cycles Instruction counts ( billions) Code fromABC Compiler 1511 Compiler Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time? MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI Instruction counts ( billions) Code fromABCTotal Compiler Compiler CPU cyclesExec TimeMIPS Compiler 1 5+1x2+1x3=10 billion Compiler 2 MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI Instruction counts ( billions) Code fromABCTotal Compiler Compiler CPU cyclesExec TimeMIPS Compiler 1 10 billion Compiler 2 15 billion MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI Instruction counts ( billions) Code fromABCTotal Compiler Compiler CPU cyclesExec TimeMIPS Compiler 1 10 billion10 10 x10 -8 =100 Compiler 2 15 billion MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI Instruction counts ( billions) Code fromABCTotal Compiler Compiler CPU cyclesExec TimeMIPS Compiler 1 10 billion 100 sec Compiler 2 15 billion 150 sec MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI Instruction counts ( billions) Code fromABCTotal Compiler Compiler CPU cyclesExec TimeMIPS Compiler 1 10 billion 100 sec7x10 3 /100 Compiler 2 15 billion 150 sec MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI Instruction counts ( billions) Code fromABCTotal Compiler Compiler CPU cyclesExec TimeMIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 12x10 3 /150 MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI Instruction counts ( billions) Code fromABCTotal Compiler Compiler CPU cyclesExec TimeMIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 80 MIPS example

Performance best determined by running a real application –Use programs typical of expected workload –Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Benchmarks

Performance best determined by running a real application –Use programs typical of expected workload –Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks –nice for architects and designers –easy to standardize –can be abused Benchmarks

SPEC (System Performance Evaluation Cooperative) Benchmarks

SPEC (System Performance Evaluation Cooperative) –companies have agreed on a set of real program and inputs Benchmarks

SPEC (System Performance Evaluation Cooperative) –companies have agreed on a set of real program and inputs –can still be abused Benchmarks

SPEC (System Performance Evaluation Cooperative) –companies have agreed on a set of real program and inputs –can still be abused –valuable indicator of performance (and compiler technology) Benchmarks

SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload

SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer applications –go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer applications –go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive applications –tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5

SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer applications –go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive applications –tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags –eliminate special undocumented incantations that may not even generate working code for real programs

SPEC fp

Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Amdahl's Law

Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Amdahl's Law

Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Improved Time = (100 – 80) + 80/n = 100/4 Amdahl's Law

Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Improved Time = (100 – 80) + 80/n = 100/ /n = = 5n ; n = 16 Amdahl's Law

Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? How about making it 5 times faster? Amdahl's Law

Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? How about making it 5 times faster? Improved Time = (100 –80) + 80/n = 100/5 Amdahl's Law

Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? How about making it 5 times faster? Improved Time = (100 –80) + 80/n = 100/ /n = 20 80/n = 0 Impossible! Amdahl's Law

Performance is specific to a particular program/s –Total execution time is a consistent summary of performance Remember

Performance is specific to a particular program/s –Total execution time is a consistent summary of performance For a given architecture performance increases come from: –increases in clock rate (without adverse CPI affects) Remember

Performance is specific to a particular program/s –Total execution time is a consistent summary of performance For a given architecture performance increases come from: –increases in clock rate (without adverse CPI affects) –improvements in processor organization that lower CPI Remember

Performance is specific to a particular program/s –Total execution time is a consistent summary of performance For a given architecture performance increases come from: –increases in clock rate (without adverse CPI affects) –improvements in processor organization that lower CPI –compiler enhancements that lower CPI and/or instruction count Remember