CS1104 2001/02 Semester II Help Session IIA Performance Measures Colin Tan S15-04-05

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
CS1104: Computer Organisation School of Computing National University of Singapore.
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Computer Organization and Architecture 18 th March, 2008.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CIS429.S00: Lec3 - 1 CPU Time Analysis Terminology IC = instruction count = number of instructions in the program CPI = cycles per instruction (varies.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
9/16/2004Comp 120 Fall September 16 Assignment 4 due date pushed back to 23 rd, better start anywayAssignment 4 due date pushed back to 23 rd, better.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
Appendix A Pipelining: Basic and Intermediate Concepts
Lecture: Pipelining Basics
Lecture 3: Computer Performance
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
CPU Performance using Different Parameters CS 250: Andrei D. Coronel, MS,CEH,PhD Cand.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
CS1104 Help Session I Memory Semester II 2001/02 Colin Tan, S ,
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
CS 1104 Help Session V Performance Analysis Colin Tan, S
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
UNIT III -PIPELINE.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Additional Examples CSE420/598, Fall 2008.
COSC3330 Computer Architecture
Cache Memory and Performance
CS2100 Computer Organization
CS 286 Computer Architecture & Organization
Defining Performance Which airplane has the best performance?
Performance of Single-cycle Design
Pipeline Implementation (4.6)
CS2100 Computer Organisation
Morgan Kaufmann Publishers The Processor
CS 101 – Sept. 25 Continue Chapter 5
Set-Associative Cache
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Performance Cycle time of a computer CPU speed speed = 1 / cycle time
Designing a Pipelined CPU
Pipelining.
CS2100 Computer Organisation
Presentation transcript:

CS /02 Semester II Help Session IIA Performance Measures Colin Tan S

Basic Concepts Instruction Execution Cycles Processors execute instructions in several steps: –Instruction fetch (IF), instruction decode (ID), execute (EX), memory read (MEM), write result (WB). –Previous step must complete before next step can proceed correctly! –Coordination between steps relies on a series of “ticks” called “clock cycles” (CC). Clock cycle n is denoted by CCn –So in our processor: –CC1: IF –CC2: ID –CC3: EX –CC4: MEM –CC5: WB

Basic Concepts Instruction Execution Cycles So each instruction takes a certain number of cycles to execute. If processor is NOT pipelined, then an instruction may skip some stages and hence may have fewer cycles. The average number of cycles required for a particular instruction is called the instruction CPI. –E.g. ADD may require 2 cycles, SUB may require 3 cycles. Instruction CPI of ADD is therefore 2, and SUB is 3.

Basic Concepts Instruction Frequency A program (e.g. Microsoft Word) is made up of many instructions coming from each of the different types of instructions. –The number of instructions in each class is called the “instruction frequency” of that class –E.g. there may be 1017 ADDs, 763 MUL, SUB etc. –This is often expressed as a percentage or as a fraction.

Basic Concepts Average Cycles Per Instruction The instruction frequency and the number of cycles an instruction requires (instruction CPI) can be used to compute what the average Cycles Per Instruction, or simply CPI of a particular program. –Each type of instruction would take a different number of clock cycles. –A program consists of several different types of instructions. –The average CPI is the average number of cycles required to execute each instruction, across all types of instructions.

Calculating Average CPI Find the overall CPI of a program running on a processor with the class CPIs and instruction frequencies shown here: TypeCPIInstruction Frequency Add30.4 Sub20.25 Mul40.15 Div50.20

Calculating Average CPI –Let’s assume that the total number of instructions is IC. Then there are 0.4IC ADD instructions, 0.25IC SUB instructions, 0.15IC MUL instructions and 0.2 DIV instructions. Total number of clock cycles used by ADD instructions is 0.4IC x 3, SUB is 0.25IC x 2, MUL is 0.15IC x 4, DIV is 0.2IC x 5 cycles. –Hence total number of clock cycles used by this program is 0.4IC x IC x IC x IC x 5 –Number of instructions is IC. Hence average number of cycles per instruction (average CPI) is (0.4IC x IC x IC x IC x 5)/1.0IC IC cancels off, leaving 0.4 x x x x 5, final answer is 2.7. Hence for this program, each instruction requires, on average, 2.7 cycles.

Exercise Find the average CPI of the following program:

Exercise Ratio of instructions is shown below: This gives us the following relative frequencies:

Exercise Hence our average CPI is: –0.36 x x x x 12 = 3.56 Thus, on average, each instruction will take 3.56 clock cycles.

Why is this useful? Each cycle that an instruction takes consumes time. If the clock rate of a CPU is 500 MHz, then each second there will be 500,000,000 cycles (note: 1 MHz is 10 6 cycles, NOT 2 20 cycles!) Therefore each cycle requires 1/(500 x 10 6 ) seconds –This works out to 2 ns per cycle.

But still.. Why is this useful? If there are I C instructions in a program (called the instruction count of the program), and if the average CPI is C, then the total number of cycles used by this program is I C x C. Each cycle requires 2ns. So therefore the program will require (I C x C x 2) ns to execute. This is called the execution time of the program, and forms the basis for performance comparison. –We take a program and run it on machine M1. Take the execution time T M1, then run the same program on machine M2, taking the execution time T M2. If T M1 > T M2, then machine M2 is faster by M1, and it is faster by T M1 / T M2.

Exercise Find i) average CPI, ii) Execution time of the program below for machines M1 and M2, then find the speedup of M2 over M1.

How Caches Affect Performance Sometimes the instruction/data required is not present in the cache –This is a cache miss! –Cache system needs to go to main memory to remedy the miss. This will take many many cycles! If execution proceeds, the results will be meaningless –Either the required instruction is not loaded yet because of the cache miss, or the data is not loaded. CPU responds by freezing the instruction for many cycles –This is to give memory time to produce the instruction/data for the cache When cache miss is remedied, CPU re-reads the cache. Hence cache misses adds cycles to the instruction, and thus affects the instruction CPI.

How Caches Affect Performance Eqn given in lecture notes is: CPI memory = Instruction Frequency * L1 Miss rate * (L1 miss penalty + L2 miss rate * L2 miss penalty) + Data Access Frequency * L1 Miss rate * (L1 miss penalty + L2 miss rate * L2 miss penalty) Note that we do not use the cache hit figures because the basic instruction CPI already factors this in –The basic instruction CPI includes reading from the instruction cache assuming a cache hit, or reading from data cache assuming a cache hit. Hence here we are only concerned with cycles added because of a cache miss.

Exercise Given the following program and machine, assume that L1 miss rate is 0.05, L1 miss penalty is 12 cycles, L2 miss rate is 0.03, L2 miss penalty is 40 cycles, find the average CPI.

One Last Exercise

Moral: Always ensure that the frequencies add up to 1.0 (100%), otherwise you need to normalize the answer by dividing by the total frequency.

Summary Instructions are timed using a central clock. Each tick of the clock is called a clock cycle, or simply a cycle. Each instruction will require a certain number of cycles on average to operate. This is the instruction CPI. Different instructions within a program will have different CPI, however we can compute the average CPI across all instructions in a given program. Performance can be measured by running the same program on different machines. If execution time on M1 is T M1, on M2 is T M2, then the speedup of M1 over M2 is T M2 /T M1, and vice-versa.

Summary Cache misses cause the CPI of an instruction, and the overall CPI of a program to go up. –Processor needs to freeze instruction to allow memory to deliver missing instruction/data to cache. Remember to normalize your CPI if the total frequency adds up to >1.0!

Further Reading Please read Dr. Ankush’s notes as well!