1/16/99CS520S99 IntroductionC. Edward Chow Page 1 Why study computer architecture?  To learn the principles for designing processors and systems  To.

Slides:



Advertisements
Similar presentations
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
Advertisements

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
ENGS 116 Lecture 11 ENGS 116 / COSC 107 Computer Architecture Introduction Vincent H. Berk September 21, 2005 Reading for Friday: Chapter 1.1 – 1.4, Amdahl.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Chapter 4 Assessing and Understanding Performance
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Lecture 1: Course Introduction, Technology Trends, Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Where Has This Performance Improvement Come From? Technology –More transistors per chip –Faster logic Machine Organization/Implementation –Deeper pipelines.
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
1 CHAPTER 2 THE ROLE OF PERFORMANCE. 2 Performance Measure, Report, and Summarize Make intelligent choices Why is some hardware better than others for.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
Performance.
Computer Organization and Design Computer Abstractions and Technology
Digital System Architecture 1 28 ต.ค ต.ค ต.ค ต.ค ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Computer Architecture
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
Computer Architecture CPSC 350
CS252/Patterson Lec 1.1 1/17/01 CMPUT429/CMPE382 Winter 2001 Topic2: Technology Trend and Cost/Performance (Adapted from David A. Patterson’s CS252 lecture.
Morgan Kaufmann Publishers
Performance Performance
EEL5708/Bölöni Lec 2.1 Fall 2004 August 27, 2004 Lotzi Bölöni Fall 2004 EEL 5708 High Performance Computer Architecture Lecture 2 Introduction: the big.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
EGRE 426 Computer Organization and Design Chapter 4.
Compsci Today’s topics l Operating Systems  Brookshear, Chapter 3  Great Ideas, Chapter 10  Slides from Kevin Wayne’s COS 126 course l Performance.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
Lecture 2: Performance Evaluation
September 2 Performance Read 3.1 through 3.4 for Tuesday
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
How do we evaluate computer architectures?
Defining Performance Which airplane has the best performance?
Architecture & Organization 1
Computer Architecture CSCE 350
CS2100 Computer Organisation
CS775: Computer Architecture
Computer Architecture
Architecture & Organization 1
CMSC 611: Advanced Computer Architecture
Performance of computer systems
Performance of computer systems
COMS 361 Computer Organization
August 30, 2000 Prof. John Kubiatowicz
Performance of computer systems
CMSC 611: Advanced Computer Architecture
CS2100 Computer Organisation
Presentation transcript:

1/16/99CS520S99 IntroductionC. Edward Chow Page 1 Why study computer architecture?  To learn the principles for designing processors and systems  To learn the system configuration trade-off what size of caches/memory is enough what kind of buses to connect system components what size (speed) of disks to use  To choose a computer for a set of applications in a project.  To interpret the benchmark figures given by salespersons.  To decide which processor chips to use in a system  To design the system software (compiler, OS) for a new processor?  To be the leader of a processor design team?  To learn several machine’s assembly languages?

1/16/99CS520S99 IntroductionC. Edward Chow Page 2 The Basic Structure of a Computer

1/16/99CS520S99 IntroductionC. Edward Chow Page 3 Control and Data Flow in Processor Processor is made up of  Data operator (Arithmetic and Logic Unit, ALU)—D consumes and combines information into a new meaning  Control—K evokes operations of other components

1/16/99CS520S99 IntroductionC. Edward Chow Page 4 Control is often distributed

1/16/99CS520S99 IntroductionC. Edward Chow Page 5 Instruction Execution at Register Transfer Level (RTL) Consider the detailed execution of the instruction “move &100, %d0” (Moving constant 100 to register d0) Assume the instruction was loaded into memory location 1000 The op code of the move instruction and the register address d0 are encoded in byte1000 and 1001 The constant 100 in byte 1002 and 1003.

1/16/99CS520S99 IntroductionC. Edward Chow Page 6 RTL Instruction Execution Mpc is set to 1000 pointing at instruction in the meory Step 1: Mmar = Mpc; // put pc into mar; prepare to fetch instruction. 1000

1/16/99CS520S99 IntroductionC. Edward Chow Page 7 Update Program Counter Step 2: Mpc = Mpc+4; // update program counter; move Mpc value to D, D perform +4, move result back to Mpc

1/16/99CS520S99 IntroductionC. Edward Chow Page 8 Instruction Fetch Step 3: Mir = Mp[Mmar]; // fetch instruction send Mmar value to Mp, Mp retrieve move|d0, send back to Mir Steps3 and 2 can be done in parallel Move|d0 100

1/16/99CS520S99 IntroductionC. Edward Chow Page 9 Instruction Decoding Step 4: Decode Instruction in Mir Move|d0 100

1/16/99CS520S99 IntroductionC. Edward Chow Page 10 RTL Instruction Execution Step 5: Mgeneral[0] = Mp[Mir ];// execute the move of the constant into a general register named d0 Move|d0 100 Subscript denotes the 16th and 31th bits containing constant 100

1/16/99CS520S99 IntroductionC. Edward Chow Page 11 Computer Architecture The term “computer architecture” was coined by IBM in 1964 for use with IBM 360. Amdahl, Blaauw, and Brooks [1964] used the term to refer to the programmer-visible portion of the instruction set. They believe that a family of machines of the same architecture should be able to run the same software. Benefits: With a precise defined architecture, we can have many compatible implementations. The program written in the same instruction set can run in all the compatible implementations.

1/16/99CS520S99 IntroductionC. Edward Chow Page 12 Architecture & Implementation Single Architecture—multiple implementation  computer family Multiple Architecture—single implementation  microcode emulator

1/16/99CS520S99 IntroductionC. Edward Chow Page 13 Computer Architecture Topics Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP Addressing, Protection, Exception Handling L1 Cache L2 Cache DRAM Disks, WORM, Tape Coherence, Bandwidth, Latency Emerging Technologies Interleaving Bus protocols RAID VLSI Input/Output and Storage Memory Hierarchy Pipelining and Instruction Level Parallelism Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 14 Computer Architecture Topics M Interconnection Network S PMPMPMP ° ° ° Topologies, Routing, Bandwidth, Latency, Reliability Network Interfaces Shared Memory, Message Passing, Data Parallelism Processor-Memory-Switch Multiprocessors Networks and Interconnections Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 15 CS 520 Course Focus Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century Technology Programming Languages Operating Systems History Applications Interface Design (ISA) Measurement & Evaluation Parallelism Computer Architecture: Instruction Set Design Organization Hardware Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 16 Function Requirements faced by a computer designer Applications –general purpose balanced performance for a range of tasks –Scientific high performance floating points –Commercial support for COBOL (decimal arithmetic) database/transaction processing Level of software compatibility –Object code/binary level no software porting, more hw design cost –Programming Lang. Level avoid old architecture burden, require software porting

1/16/99CS520S99 IntroductionC. Edward Chow Page 17 Function Requirements faced by a computer designer Operating System Requirements –Size of address space –Memory management/Protection (e.g. garbage collection vs. realtime scheduling) –Interrupt/traps Standards –Floating Point (IEEE754) –I/O Bus –OS –Networks –Programming Languages

1/16/99CS520S99 IntroductionC. Edward Chow Page Computer Food Chain PCWork- station Mini- computer Mainframe Mini- supercomputer Supercomputer Massively Parallel Processors Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page Computer Food Chain PCWork- station Mainframe Supercomputer Mini- supercomputer Massively Parallel Processors Mini- computer Now who is eating whom? Server Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 20 Why Such Change in 10 years? Performance –Technology Advances CMOS VLSI dominates older technologies (TTL, ECL) in cost AND performance –Computer architecture advances improves low-end RISC, superscalar, RAID, … Price: Lower costs due to … –Simpler development CMOS VLSI: smaller systems, fewer components –Higher volumes CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units –Lower margins by class of computer, due to fewer services Function –Rise of networking/local interconnection technology Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 21 Technology Trends: Microprocessor Capacity CMOS improvements: Die size: 2X every 3 yrs Line width: halve / 7 yrs “Graduation Window” Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 22 Memory Capacity (Single Chip DRAM) year size(Mb) cycle time ns ns ns ns ns ns ns Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 23 Technology Trends (Summary) CapacitySpeed (latency) Logic2x in 3 years2x in 3 years DRAM4x in 3 years2x in 10 years Disk4x in 3 years2x in 10 years Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 24 Processor Performance Trends Year Microprocessors Minicomputers Mainframes Supercomputers Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 25 Processor Performance (1.35X before, 1.55X now) 1.54X/yr Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 26 Performance Trends (Summary) Workstation performance (measured in Spec Marks) improves roughly 50% per year (2X every 18 months) Improvement in cost performance estimated at 70% per year Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 27 Computer Engineering Methodology Technology Trends Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 28 Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Technology Trends Benchmarks Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 29 Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Simulate New Designs and Organizations Technology Trends Benchmarks Workloads Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 30 Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Simulate New Designs and Organizations Implement Next Generation System Technology Trends Benchmarks Workloads Implementation Complexity Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 31 Measurement and Evaluation Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems Creativity Good Ideas Mediocre Ideas Bad Ideas Cost / Performance Analysis Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 32 Measurement Tools Benchmarks, Traces, Mixes Hardware: Cost, delay, area, power estimation Simulation (many levels) –ISA, RT, Gate, Circuit Queuing Theory Rules of Thumb Fundamental “Laws”/Principles

1/16/99CS520S99 IntroductionC. Edward Chow Page 33 Metric of Computer Architecture Space measured in bits of representation Time measures in bit traffic (memory bandwidth) Many old frequency and benchmark studies focus on dynamic opcode (memory size concern) exponent differences of floating point operands (precision) length of decimal numbers in business files (memory size) Trend: space is not much a concern; speed/time is everything. Here we focus more on the following two performance metrics Response time = time between start and finish of an event — execution time — latency Throughput = total amount of work done in a given time — bandwidth (no. of bits or bytes moved per second)

1/16/99CS520S99 IntroductionC. Edward Chow Page 34 Metrics of Performance at Different Levels Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Operations per second Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB

1/16/99CS520S99 IntroductionC. Edward Chow Page 35 Quantitative principles Improve means increase performance decrease execution time “X is n% faster than Y”  Quantitative principles Make the common case fast — Amdahl’s Law Locality of reference — 90% of execution time in 10% of code

1/16/99CS520S99 IntroductionC. Edward Chow Page 36 Amdahl’s Law Law of diminishing returns 50 FractionInEnhancedMode=0.5 based on old system SpeedupOfEnhancedMode= Time old Time new 

1/16/99CS520S99 IntroductionC. Edward Chow Page 37 Amdahl’s Law Result FractionIn Enhancedmode OverallSpeedup When SpeedupOfEnhancedMode=2 OverallSpeedup When SpeedupOfEnhancedMode =

1/16/99CS520S99 IntroductionC. Edward Chow Page 38 Apply Amdahl’s Law: Example 1 Example1: Assume that the memory access accounts for 90% of the execution time. What is the speedup by replacing a 100ns memory with a 10ns memory? How much fast is the new system? Answer: FractionInEnhancedMode = 90%=0.9 SpeedupOfEnhancedMode = 100ns/10ns = 10 The new system is 426% faster than the old one. Is it worthwhile if the high speed memory costs 10 times more?

1/16/99CS520S99 IntroductionC. Edward Chow Page 39 Apply Amdahl’s Law: Example 2 Example 2: Assume that 40% of the time is spent on CPU task; the rest is spent on I/O. Assume we improve CPU and keep I/O speed unchanged. a) How much faster should new CPU be to have the overall speedup of 1.5? b) Is that possible to have an overall speedup of 2? Why? Solution: a)  x=6. 500% faster b) The maximum overall speedup that can be achieved is Therefore, it is not possible to achieve the overall speedup of 2.

1/16/99CS520S99 IntroductionC. Edward Chow Page 40 Apply Amdahl’s Law: Example 3 Example: A recent research on the bottleneck of a 10Mbps Ethernet network system showed that only 10% of the execution time of a distributed application was spent on transmitting messages and 90% of the time was on application/ protocol software execution at hosts’ computers. If we replace Ethernet with 100 Mbps FDDI, 900% faster than Ethernet, what will be speedup of this improvement? What if we use 900% faster hosts?

1/16/99CS520S99 IntroductionC. Edward Chow Page 41 Excution Time The first performance metric and the best metric. Measure the time it takes to execute the intended application(s) or the typical workload. The time command can measure an application. vlsia[93]: time ts u 27.2s 8:16 49% k 6+3io 26pf+0w Here is an example which shows how OS and I/O impact the execution time. For program 1, Elapsed Time = sum(t 1 :t 11 )-t 6 -t 8 System CPU time = t 1 +t 3 +t 5 +t 9 +t11 CPU time = t 1 + t 3 + t 4 + t 5 + t 9 +t 10 User CPU time = t 4 + t 10

1/16/99CS520S99 IntroductionC. Edward Chow Page 42 CPU Time CPI=(Clock cycles Per Instruction); I i is the frequency of instruction i in a program; IC=Instruction Count.; ClockCycleTime=1/ClockRate CPI figure gives insight into different styles of instruction sets & implementations. Interdependence among instruction count, CPI, and Clock rate Clock rate—Hardware technology and organization CPI—Organization and instruction set architecture Instruction count—Instruction set architecture and compiler technology We cannot measure the performance of a computer by single factor above alone.

1/16/99CS520S99 IntroductionC. Edward Chow Page 43 Evaluating Instruction Set Design ExampleExample Page 39: 1/4 of ALU and Load instructions replaced by new r->m inst. Assume that the clock cycle time is not changed. Is this a good idea? Frequency Before ClockcycleFrequency After ClockCycle ALU ops43%136.1%1 Loads21%211.4%2 Stores12%213.5%2 Braches24%226.9%3 New r->m12.1%2

1/16/99CS520S99 IntroductionC. Edward Chow Page 44 Evaluate Instruction Design CPI old = (0.43* * * *2) = 1.57 CPU time old = InstructionCount old * 1.57 * ClockCycleTime old CPInew= =1.908 CPU time new = (0.893*InstructionCount old ) * * ClockCycleTime old = * InstructionCountold * ClockCycleTime old With the assumptions, it is a bad idea to add register- memory instructions.

1/16/99CS520S99 IntroductionC. Edward Chow Page 45 Estimate CPU time by (  CPIi*InstructionCounti)*ClockCycleTime Program: f=(a-b)/(c-d*e) MIPS R MHz Instructions (op dst, src1, src2) lw$14, 20($sp) lw$15, 16($sp) subu$24, $14, $15 lw$25, 8($sp) lw$8, 4($sp) mul$9, $25, $8 lw$10, 12($sp) subu$11, $10, $9 div$12, $24, $11 sw$12, 0($sp) IC=InstructionCount=10 CPI=ClockcyclesPerInstruction CPIi=ClockcyclesOfInstructionType i Ii=number of Instructions of type i in a prog. ClockCycleTime =1/ClockRate=1/25*10 6 =40*10 -9 sec=40nsec CPIi can be obtained from processor handbook. Here we assume no cache misses.

1/16/99CS520S99 IntroductionC. Edward Chow Page 46 Estimate CPU time by ClockCycleTime*  CPIi*InstructionCounti) iInstruction Type Ii Count CPIiCPIi*ICi 1lw5210 2subu212 3mul111 4div111 5sw CPU Time = 16*40 nsec = 640 nsec

1/16/99CS520S99 IntroductionC. Edward Chow Page 47 Other Performance Measures The only reliable measure of performance is the execution time of real programs. Other attempts: 1. Depends on instruction set, hard to compare, MIPS varies with programs on the same computer. Example1: the impact of using Floating Point Hardware on MIPS. Example2: Impact of optimizing compiler usage on MIPS. What affects performance? input version of programs, compiler, OS, CPU optimizing level of compiler machine configurations — amount of cache, main memory, disks — the speed of cache, main memory, disks, and bus.

1/16/99CS520S99 IntroductionC. Edward Chow Page 48 Myth of MIPS Example: The effect of optimizing compiler on MIPS number. (Page45) A machine with the 500MHz clock rate and the following clock cycles for instructions. For a program, the relative frequencies of instructions before and after using an optimizing compiler are as shown in the table. Instruction Type IC Before Optimization CPIiIC After Optimization ALU ops86143 Loads422 Stores242 Branches482 CPI unoptimized = 86/200*1+42/200*2+24/200*2+48/200*2=1.57 MIPS unoptimized = 500/(1.57*10 6 )=318.5 CPI optimized = 43/157*1+42/157*2+24/157*2+48/157*2=1.73 MIPS optimized = 500/(1.73*10 6 )=289.0 CPU time unoptimized = 200*1.57*(2*10 -9 ) = 6.28*10 -7 CPU time optimized = 157*1.73*(2*10-9) = 5.43*10 -7

1/16/99CS520S99 IntroductionC. Edward Chow Page 49 MFLOPS For scientific computing MFLOPS is used as a metric: Here it emphasizes operations instead of instructions. Unfortunately, the set of floating-point operations is not consistent across machines. The rating changes with different mix ratio of integer-floating or floating-floating instructions. The solution is to use a canonical number of floating point operations for certain type of FP operations, e.g. 1 for (add, sub, compare, mul), 4 for (fdiv, fsqrt), 8 for (arctan, sin, exp)

1/16/99CS520S99 IntroductionC. Edward Chow Page 50 Programs to Evaluate Performance Real programs — The set of programs to be run forms the workload. Kernels — key pieces of real programs; isolate features of a machines; Livermore Loops (weighted ops); Linpack Toy Benchmarks — 10 to 100 lines of codes: e.g., quicksort, Sieve, Puzzle Synthetic Benchmarks — artificially created to match an average execution profile: e.g., Whetstone, Dhrystone SPEC (System Performance Evaluation Cooperation) Benchmarks 89, 92, 95. Perfect Club Benchmarks for parallel computations.

1/16/99CS520S99 IntroductionC. Edward Chow Page 51 SPEC: System Performance Evaluation Cooperative Benchmark First Round 1989: 10 programs yielding a single number (“SPECmarks”) Second Round 1992: SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) –Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= memcpy(b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas Third Round 1995 –new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) –“benchmarks useful for 3 years” –Single flag setting for all programs: SPECint_base95, SPECfp_base95

1/16/99CS520S99 IntroductionC. Edward Chow Page 52 Comparison of Machine Performance Single Program—execution time Collection of (n) Programs 1.Total execution time 2.Normalized to a reference machine, compute the TimeRatio of ith program TimeRatio i =Time i /Time i (ReferenceMachine) arithmetic mean= geometric mean= harmonic mean= Geometric mean is consistent independent of referenced machine. Harmonic mean decrease impact of outliers.

1/16/99CS520S99 IntroductionC. Edward Chow Page 53 Summarize Performance Results Example: Execution of two programs on three machines. Assume Program 1 has 10M floating point operations and Program 2 has 50M floating point operations ComputerAComputerBComputerC Program1(sec)11020 Program2(sec) TotalTime(sec) Native MFLOPS on Program 1 10/1=1010/10=110/20=0.5 Native MFLOPS on Program 2 50/100= /50=150/20=2.5 Arithmetic Mean(10+0.5)/2=5.25(1+1)/2=1( )/2=3 Geometric Mean

1/16/99CS520S99 IntroductionC. Edward Chow Page 54 Weighted Arithmetic Means For a set of n program, each takes Time i on one machine, the “equal-time” weights on that machine are Figure 1.12 W(3) [W(2)] are equal-time weights based on machineA [B]. This is used in Exercise 1.11 abcw(1)w(2)w(3) P1(sec) P2(sec) AM:W(1) AM:W(2) AM:W(3)

1/16/99CS520S99 IntroductionC. Edward Chow Page 55 Hints for Homework # 1 Exercise 1.7: 1. Whetstone consists of integer operations besides the floating- point operations. 2. When floating point processor is not used, all floating-point operations need to be emulated by integer operations (e.g. shift, and, add, sub, multiply, div...). 3. For different co-fp processors, we will have the same # of integer ops but different # of FP ops. Exercise 1.11: a.use the equal-time weightings formula in Page 26. b.DEC3000 execution time(ora) = VAX11 780Time(ora)/ DEC3000SPECRatio=7421/165

1/16/99CS520S99 IntroductionC. Edward Chow Page 56 FP Compilation Results depend on existence of FP coprocessor Exercise 1.7. Whetstone is a benchmark with both Integer and Floating Point (FP) operations.

1/16/99CS520S99 IntroductionC. Edward Chow Page 57 Compiling floating-point statement Here are the generated assembly instructions of a floating-point operation statement in C on DEC3100 (with R2010 floating point unit) using command cc -S Note that since the R2010 only implements simple floating point add, sub, mult, and div operations, sqrt, exp, and alog are translated as subroutine calls using jal instr. The floating-point division is translated as div.d and will be executed by R2010. # 7 x=sqrt(exp(alog(x)/t1)); s.d $f4, 48($sp)#load x to fp register f4 l.d $f12, 56($sp)#load t1 to fp register f12 jal alog#call subroutine alog move $16, $2 mtc1 $16, $f6 cvt.d.w $f8, $f6#f8 contains alog(x) l.d $f10, 48($sp) div.d $f12, $f8, $f10 jal exp mov.d $f20, $f0 mov.d $f12, $f20 jal sqrt s.d $f0, 56($sp)

1/16/99CS520S99 IntroductionC. Edward Chow Page 58 Homework #1 Problems 1.7 and 1.11 Problem A. Program segment: f=(a-b)/(a*b) is compiled into the following MIPS R2000 code. Instructions (op dst, src1, src2) lw $14, 20($sp)# a is allocated at M[sp+20] lw $15, 16($sp)# b is allocated at M[sp+16] subu $24, $14, $15 mul $9, $14, $15 div $12, $24, $9 sw $12, 0($sp)# f is allocated at M[sp+0]

1/16/99CS520S99 IntroductionC. Edward Chow Page 59 Homework #1 (Continue) Assume all the variables are already in the cache (i.e. does not have to go the main memory for data) and Table 1 contains the clock cycles for each types of instructions when data is in the cache. What is the execution time (in term of seconds) of the above segment using a R2000 chip with a 25 MHz clock? Problem B. Assume the CPU operation accounts for 70% of the time in a system. a) What is the overall speedup if we improve CPU speed by 100%? b) How much faster should the new CPU be in order to have the overall speedup of 1.7? c) Is it possible to have overall speedup of 3 by just improving the CPU?