Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Performance Read Section 1.4, Section 1.7 (pp. 48-49) Adapted from Slides by Prof. Mary Jane Irwin, Penn State University And Slides Supplied by the.

Similar presentations


Presentation on theme: "1 Performance Read Section 1.4, Section 1.7 (pp. 48-49) Adapted from Slides by Prof. Mary Jane Irwin, Penn State University And Slides Supplied by the."— Presentation transcript:

1 1 Performance Read Section 1.4, Section 1.7 (pp ) Adapted from Slides by Prof. Mary Jane Irwin, Penn State University And Slides Supplied by the Publisher

2 2 Outline  What is computer architecture? l Classes of Computers  Performance l Performance Metrics l The performance equation l Measuring performance l Improving performance: parallelism, locality, Amdahl's law

3 3 What is Computer Architecture?  Old definition of computer architecture = Instruction set architecture, ISA

4 4 Instruction set architecture, ISA  Organization of Programmable Storage  Data Types & Data Structures: Encodings & Representations  Instruction Formats  Instruction Set  Modes of Addressing and Accessing Data Items and Instructions  How are “Exceptional Conditions” Handled

5 5 ISA vs. Computer Architecture  Architect’s job much more than instruction set design l technical hurdles today more challenging than those in instruction set design

6 6 Definition of Computer Architecture  The science and art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals.  Computer architecture >> ISA

7 7 Classes of Computers  Desktop computers l Designed to deliver good performance to a single user at low cost usually executing 3 rd party software, usually incorporating a graphics display, a keyboard, and a mouse  Servers l Used to run larger programs for multiple, simultaneous users typically accessed only via a network and that places a greater emphasis on dependability and security  Supercomputers l A high performance, high cost class of servers with hundreds to thousands of processors, terabytes (2 40 ≈10 12 ) of memory and petabytes ( 2 50 ≈10 15 ) of storage that are used for high-end scientific and engineering applications  Embedded computers (processors) l A computer inside another device used for running one predetermined application

8 CPE Dr. W. Abu-Sufah, UJ Review: Some Basic Definitions  Kilobyte – 2 10 or 1,024 bytes  Megabyte– 2 20 or 1,048,576 bytes l sometimes “rounded” to 10 6 or 1,000,000 bytes  Gigabyte – 2 30 or 1,073,741,824 bytes l sometimes rounded to 10 9 or 1,000,000,000 bytes  Terabyte – 2 40 or 1,099,511,627,776 bytes l sometimes rounded to or 1,000,000,000,000 bytes  Petabyte – 2 50 or 1024 terabytes l sometimes rounded to or 1,000,000,000,000,000 bytes  Exabyte – 2 60 or 1024 petabytes l Sometimes rounded to or 1,000,000,000,000,000,000 bytes

9 9 Outline  What is computer architecture?  Performance l Performance metrics l The performance equation l Measuring performance l Improving performance: parallelism, locality, Amdahl's law

10 10 Performance Metrics  Purchasing perspective l given a collection of computers, which has the -best performance ? -least cost ? -best cost/performance?  Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best cost/performance?  Both require l basis for comparison l metric for evaluation

11 11 Response Time versus Throughput  Response time (execution time): the time between the start and the completion of a task l Important to individual users  Throughput (bandwidth) – the total amount of work done in a given time l Important to computer center managers  Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput

12 12 Defining (Speed) Performance  To maximize performance, we need to minimize execution time Consider a computer X performance X = 1 / execution_time X If computer X is n times faster than computer Y, then performance X execution_time Y = = n performance Y execution_time X  Decreasing response time almost always improves throughput

13 13 Relative Performance Example  If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B? We know that A is n times faster than B if performance A execution_time B = = n performance B execution_time A = The performance ratio is So A is 1.5 times faster than B

14 14 Performance Factors  CPU execution time of a program (CPU time): time the CPU spends running the program l Does not include time waiting for Input or output (I/O) operations or running other programs or  Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program or both. or the number of clock cycles required for a program or both.

15 15 Review: Machine Clock Rate  Clock rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec (10 -9 ) clock cycle => 1 GHz (10 9 ) clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate

16 16 Improving Performance Example  A program runs on computer A with a 2 GHz clock in 10 seconds.  What clock rate must computer B have to run this program in 6 seconds? l Note: Computer B will require 1.2 times as many clock cycles as computer A to run the program.

17 17 Improving Performance Example  A program runs on computer A with a 2 GHz clock in 10 seconds.  What clock rate must computer B have to run this program in 6 seconds? l Note: Computer B will require 1.2 times as many clock cycles as computer A to run the program. CPU time A CPU clock cycles A clock rate A = CPU clock cycles A = 10 sec x 2 x 10 9 cycles/sec = 20 x 10 9 cycles CPU time B 1.2 x 20 x 10 9 cycles clock rate B = clock rate B 1.2 x 20 x 10 9 cycles 6 seconds = = 4 GHz

18 18 Clock Cycles Per Instruction, CPI  Not all instructions take the same amount of time to execute l A program execution time is equals the number of instructions executed multiplied by the average time per instruction  Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute CPI for this instruction class ABC CPI123

19 19 Using the Performance Equation  Computers A and B implement the same ISA.  Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program  Computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program.  Which computer is faster and by how much?

20 20 Using the Performance Equation Both A and B execute the same number of instructions, I, so CPU time A = I x 2.0 x 250 ps = 500 x I ps CPU time B = I x 1.2 x 500 ps = 600 x I ps Clearly, A is faster … by the ratio of execution times performance A execution_time B 600 x I ps = = = 1.2 performance B execution_time A 500 x I ps  Computer A clock cycle time = 250 ps ; CPI A = 2.0  Computer B clock cycle time = 500 ps ; CPI B = 1.2  Which computer is faster and by how much? Clearly, A is faster … by the ratio of execution times

21 Dr. W. Abu-Sufah 21 CPI in More Detail If n different instruction types (classes) are executed by a program Each instruction of type i takes a different number of cycles to execute (CPI i ) Hence, the average # of clocks per instruction, CPI Relative frequency of instruction type i

22 22 Effective (Average) CPI  Hence, computing the overall effective CPI is done by looking at the different types (classes) of instructions executed by the program and their individual cycle counts and then averaging Overall effective CPI =  (CPI i x Freq i ) i = 1 n l Let n be the number of different instruction classes executed by a program l Freq i is the relative frequency {percentage} of class i instructions executed l CPI i is the average number of clock cycles per instruction for instruction class i Then we have:  The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

23 23 THE Performance Equation  Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction_count x CPI clock_rate CPU time = or  These equations separate the three key factors that affect performance: Instruction_coun, CPI, and clock_rate l Can measure the CPU execution time by running the program l The clock rate is usually given l Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details l CPI varies by instruction type and ISA implementation for which we must know the implementation details

24 24 Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction _count CPIclock_cycl e Algorithm Programming language Compiler ISA Core organization Technology

25 25 Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction _count CPIclock_cycle Algorithm Programming language Compiler ISA Core organization Technology X XX XX XX X X X X X

26 26 A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2  =

27 27 A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2  = CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

28 28 Before We Continue  Lecture notes are posted on the class web-site usually before the lecture.  Review them and try to get some understanding if possible before class.  Read assigned textbook sections  Homework1 has been posted since Friday, February 8. There are 10 problems to solve, several with many parts. l The homework is time consuming and can't be solved in one night. You should have started working on them by now. l Print your solution of the homework and bring it to class next Tuesday, February 19. l Do not cheat. Cheating will be pursued and it defeats the purpose of trying to solve the homework.

29 29 Workloads and Benchmarks  Benchmarks – a set of programs that form a “workload” specifically chosen to measure performance  SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floating- point benchmarks (CFP2006).  There are also benchmark collections for power workloads (SPECpower_ssj2008), for mail workloads (SPECmail2008), for multimedia workloads (mediabench), …

30 30 SPEC CINT2006 on Barcelona (CC = 0.4 x 10 9 ) NameICx10 9 CPIExTimeRefTimeSPEC ratio perl21, , bzip22, , gcc1, , mcf ,3459, go1, , hmmer2, , sjeng2, , libquantum1, ,04720, h264avc3, , omnetpp , astar1, , xalancbmk1, ,1436, Geometric Mean11.7 High cache miss rates See Slide notes

31 31 Comparing and Summarizing Performance  How do we summarize the performance for a benchmark set (e.g. SPEC CINT2006 ) with a single number? l First the execution times are normalized giving the “SPEC ratio” (bigger is faster, i.e., SPEC ratio is the inverse of execution time) l The SPEC ratios are then “averaged” using the geometric mean (GM) GM = n  SPEC ratio i i = 1 n

32 32 Comparing and Summarizing Performance (cont.)  Guiding principle in reporting performance measurements is reproducibility  List everything another experimenter would need to duplicate the experiment l version of the operating system l compiler settings l input set used l specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)

33 33 Summary: Evaluating ISAs  Design-time metrics: l Can the ISA be implemented, in how long, at what cost? l Can the ISA be programmed? Ease of compilation?  Static metrics: l How many bytes does the program occupy in memory?  Dynamic (run time) metrics: l How many instructions are executed? How many bytes does the processor fetch to execute the program? l How many clocks are required per instruction? l How small a clock cycle is practical? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques.

34 34 Performance Summary  Performance depends on l Algorithm: affects IC, possibly CPI l Programming language: affects IC, CPI l Compiler: affects IC, CPI l Instruction set architecture: affects IC, CPI, T clock


Download ppt "1 Performance Read Section 1.4, Section 1.7 (pp. 48-49) Adapted from Slides by Prof. Mary Jane Irwin, Penn State University And Slides Supplied by the."

Similar presentations


Ads by Google