Lecture 2 Quantifying Performance

Slides:



Advertisements
Similar presentations
Ch1. Fundamentals of Computer Design 3. Principles (5) ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of Massachusetts.
Advertisements

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative.
Computer Architecture Lecture 2 Instruction Set Principles.
Chapter 4 Assessing and Understanding Performance
Appendix A Pipelining: Basic and Intermediate Concepts
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Lecture 2 Quantifying Performance
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Performance Performance
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Measuring Performance II and Logic Design
CS203 – Advanced Computer Architecture
Lecture 2: Performance Today’s topics:
Computer Organization
CS161 – Design and Architecture of Computer Systems
Lecture 3: MIPS Instruction Set
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Ch1. Fundamentals of Computer Design 3. Principles (5)
Defining Performance Which airplane has the best performance?
Central Processing Unit Architecture
Performance of Single-cycle Design
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers
CSCE 212 Chapter 4: Assessing and Understanding Performance
Chapter 1 Fundamentals of Computer Design
Lecture 3 Instruction Level Parallelism (Pipelining)
Defining Performance Section /14/2018 9:52 PM.
Systems Architecture II
Serial versus Pipelined Execution
CMSC 611: Advanced Computer Architecture
CSC 4250 Computer Architectures
Guest Lecturer TA: Shreyas Chand
Lecture 3: MIPS Instruction Set
Instruction Set Principles
ECE 445 – Computer Organization
CMSC 611: Advanced Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Performance Models And Evaluation
Systems Architecture II
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Guest Lecturer: Justin Hsia
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CS161 – Design and Architecture of Computer Systems
Presentation transcript:

Lecture 2 Quantifying Performance CSCE 513 Computer Architecture Lecture 2 Quantifying Performance Topics Speedup Amdahl’s law Execution time Readings: Chapter 1 August 30, 2017

Overview Last Time New Overview: Speed-up Power wall, ILP wall,  to multicore Def Computer Architecture Lecture 1 slides 1-29? New Syllabus and other course pragmatics Website (not shown) Dates Figure 1.9 Trends: CPUs, Memory, Network, Disk Why geometric mean? Speed-up again Amdahl’s Law

Instruction Set Architecture (ISA) “Myopic view of computer architecture” ISAs – appendices A and K 80x86 ARM MIPS

MIPS Register Usage Figure 1.4 Ref. CAAQA

MIPS Instructions Fig 1.5 Data Transfers Ref. CAAQA

MIPS Instructions Fig 1.5 Arithmetic/Logical Most significant bit is bit zero; lsb #63 Ref. CAAQA

MIPS Instructions Fig 1.5 Control Condition Codes set by ALU operations PC Relative branches Jumps JumpAndLink Return address on function call? Return Address Ref. CAAQA

MIPS Instruction Format (RISC) Ref. CAAQA

Fig 1.7 Requirement Challenges for Computer Architects Level of software compatibility Operating system requirements Standards Ref. CAAQA

Fig 1.10 Performance over last 25-40 years Processors Ref. CAAQA

Fig 1.10 Performance over last 25-40 years Memory Ref. CAAQA

Fig 1.10 Performance over last 25-40 years Networks Disk Ref. CAAQA

Fig 1.10 Performance over last 25-40 years Processors Ref. CAAQA

Quantitative Principles of Design Take advantage of Parallelism Principle of locality Temporal locality Spatial locality Focus on the common case Amdahl’s Law Ref. CAAQA

Taking Advantage of Parallelism Logic parallelism – carry lookahead adder Word parallelism – SIMD Instruction pipelining – overlap fetch and execute Multithreads – executing independent instructions at the same time Speculative execution - Ref. CAAQA

Principle of Locality Rule of thumb – (Zipf’s law?? Not really) A program spends 90% of its execution time in only 10% of the code. So what do you try to optimize? Locality of memory references Temporal locality Spatial locality

Execution Time of enhanced systems Suppose you have an enhancement or improvement in a design component. The improvement in the performance of the system is limited by the % of the time the enhancement can be used

Amdahl’s Law Suppose you have an enhancement or improvement in a design component. The improvement in the performance of the system is limited by the % of the time the enhancement can be used Ref. CAAQA

Amdahl’s with Fractional Use Factor Example: Suppose we are considering an enhancement to a web server. The enhanced CPU is 10 times faster on computation but the same speed on I/O. Suppose also that 60% of the time is waiting on I/O Ref. CAAQA

Amdahl’s Law revisited Speedup = (execution time without enhance.) / (execution time with enhance.) = (time without) / (time with) = Two / Twith Notes The enhancement will be used only a portion of the time. If it will be rarely used then why bother trying to improve it Focus on the improvements that have the highest fraction of use time denoted Fractionenhanced. Note Fractionenhanced is always less than 1. Then Ref. CAAQA

Amdahl’s with Fractional Use Factor Ref. CAAQA

Amdahl’s with Fractional Use Factor Example: Suppose we are considering an enhancement to a web server. The enhanced CPU is 10 times faster on computation but the same speed on I/O. Suppose also that 60% of the time is waiting on I/O Ref. CAAQA

Graphics Square Root Enhancement p 40 NewDesign1 FPSQRT 20% speed up FPSQR 10 times NewDesign2 FP improve all FP by 1.6; FP=50% of exec time Ref. CAAQA

Geometric Means vs Arithmetic Means   Ref. CAAQA

Comparing 2 computers Spec_Ratios Ref. CAAQA

Performance Measures Response time (latency) -- time between start and completion Throughput (bandwidth) -- rate -- work done per unit time Processor Speed – e.g. 1GHz When does it matter? When does it not? Ref. CAAQA

Availability Ref. CAAQA

MTTF Example   Ref. CAAQA

Comparing Performance fig 1.15 Comparing three program executing on three machines Computer A Computer B Computer C Program P1 1 10 20 Program P2 1000 100 Total Times 1001 110 40 Faster than relationships A is 10 times faster than B on program 1 B is 10 times faster than A on program 2 C is 50 times faster than A on program 2 … 3 * 2 comparisons (3 choose 2 computers * 2 programs) So what is the relative performance of these machines??? Ref. CAAQA

fig 1.15 Total Execution times Comparing three program executing on three machines Computer A Computer B Computer C Program P1 1 10 20 Program P2 1000 100 Total times 1001 110 40 So now what is the relative performance of these machines??? B is 1001/110 = 9.1 times as fast as A Arithmetic mean execution time = Ref. CAAQA

Weighted Execution Times fig 1.15 Computer A Computer B Computer C Program P1 1 10 20 Program P2 1000 100 Program P3 1001 110 40 Now assume that we know that P1 will run 90%, and P2 10% of the time. So now what is the relative performance of these machines??? timeA = .9*1 + .1*1000 = 100.9 timeB = .9*10 +.1*100 = 19 Relative performance A to B = 100.9/19 = 5.31 Ref. CAAQA

Geometric Means Compare ratios of performance to a standard Using A as the standard program 1 B ratio = 10/1 = 10 C ratio = 20/1 = 20 program 2 Br = 100/1000 = .1 Cr = 20/1000 = .02 B is “twice as fast” as C using A as the standard Using B as the standard program 1 Ar = 1/10 = .1 Cr = program 2 Br = 1000/100 = 10 Cr = So now compare A and B ratios to each other you get the same 10 and .1, so what? Same ? Ref. CAAQA

Geometric Means fig 1.17 Normalized to A Normalized to B Measure performance ratios to a standard machine Normalized to A Normalized to B Normalized to C A B C P1 1.0 10.0 20.0 .1 2.0 .05 .5 P2 .02 10 .2 50. 5.0 Arithmetic mean 5.05 10.01 1.1 25.03 2.75 Geometric Mean .63 1.58 Total Time .11 .4 9.1 .36 Ref. CAAQA

CPU Performance Equation Almost all computers use a clock running at a fixed rate. Clock period e.g. 1GHz Instruction Count (IC) – CPI = CPUclockCyclesForProgram / InstructionCount CPUtime = IC * ClockCycleTime * CyclesPerInstruction Ref. CAAQA

CPU Performance Equation CPUtime = Instruction Count CPI Clock cycle time Ref. CAAQA

Fallacies and Pitfalls Pitfall: Falling prey to Amdahl’s law. Pitfall: A single point of failure. Fallacy: the cost of the processor dominates the cost of the system. Fallacy: Benchmarks remain valid indefinitely. The rated mean time to failure of disks is 1,2000,000 hours or almost 140 years, so disks practically never fail. Fallacy Peak performance tracks observed performance. Pitfall: Fault detection can lower availability. Ref. CAAQA

List of Appendices Ref. CAAQA

Homework Set #1 Due Friday Sept 6 (Dropbox ) 1.5 1.8 a-d (Change 2015 throughout the question  2017) 1.9 1.18 George K. Zipf (1949) Human Behavior and the Principle of Least Effort. Addison-Wesley

1.8 [10/ 15/ 15/ 10/ 10] < 1.4, 1.5 > One challenge for architects is that the design created today will require several years of implementation, verification, and testing before appearing on the market. This means that the architect must project what the technology will be like several years in advance. Sometimes, this is difficult to do. [10] < 1.4 > According to the trend in device scaling observed by Moore’s law, the number of transistors on a chip in 2015 should be how many times the number in 2005? b. [15] < 1.5 > The increase in clock rates once mirrored this trend. Had clock rates continued to climb at the same rate as in the 1990s, approximately how fast would clock rates be in 2015? c. [15] < 1.5 > At the current rate of increase, what are the clock rates now projected to be in 2015? d. [10] < 1.4 > What has limited the rate of growth of the clock rate, and what are architects doing with the extra transistors now to increase performance? Patterson, David A.; Hennessy, John L. (2011-08-01). Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Locations 2203-2217). Elsevier Science (reference). Kindle Edition.

Zipf's law Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. in the Brown Corpus of American English text, "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.[4]

Stages of Classical 5-stage pipeline Instruction Fetch Cycle IR  Mem[PC] NPC  PC + 4 Decode A  Regs[rs] B  Imm  sign-extend of Execute . Memory Write Back

Clock cycle number (time ) Simple RISC Pipeline Clock cycle number (time ) Instruction Instruction n Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 1 2 3 4 5 6 7 8 9 IF ID EX MEM WB

Performance Analysis in Perfect World Assuming S stages in the pipeline. At each cycle a new instruction is initiated. To execute N instructions takes: N cycles to start-up instructions (S-1) cycles to flush the pipeline TotalTime = N + (S-1) Example for S=5 from previous slide N=100 instructions Time to execute in non-pipelined = 100 * 5 = 500 cycles Time to execute in pipelined version = 100 + (5-1) = 104 cycles SpeedUp = …

Implement Pipelines Supp. Fig C.4

Pipeline Example with a problem (A.5 like) DADD DSUB AND OR XOR Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 1 2 3 4 5 6 7 8 9 IM ID EX DM WB

Inserting Pipeline Registers into Data Path fig A’.18

Major Hurdle of Pipelining Consider executing the code below DADD R1, R2, R3 /* R1 R2 + R3 */ DSUB R4, R1, R5 /* R4 R1 + R5 */ AND R6, R1, R7 /* R6 R1 + R7 */ OR R8, R1, R9 /* R8 R1 | R9 */ XOR R10, R1, R11 /* R10  R1 ^ R11 */

RISC Pipeline Problems Clock cycle number (time ) DADD DSUB AND OR XOR Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 1 2 3 4 5 6 7 8 9 IM ID EX DM WB So what’s the problem?

Hazards Data Hazards – a data value computed in one stage is not ready when it is needed in another stage of the pipeline Simple Solution: stall until it is ready but we can do better Control or Branch Hazards Structural Hazards – arise when resources are not sufficient to completely overlap instruction sequence e.g. two floating point add units then having to do three simultaneously