Read Section 1.4, Section 1.7 (pp )

Read Section 1.4, Section 1.7 (pp. 48-49)
Performance Read Section 1.4, Section 1.7 (pp ) Adapted from Slides by Prof. Mary Jane Irwin, Penn State University And Slides Supplied by the Publisher

Outline What is computer architecture? Classes of Computers
Performance Performance Metrics The performance equation Measuring performance Improving performance: parallelism, locality, Amdahl's law

What is Computer Architecture?
Old definition of computer architecture = Instruction set architecture, ISA

Instruction set architecture, ISA
Organization of Programmable Storage Data Types & Data Structures: Encodings & Representations Instruction Formats Instruction Set Modes of Addressing and Accessing Data Items and Instructions How are “Exceptional Conditions” Handled

ISA vs. Computer Architecture
Architect’s job much more than instruction set design technical hurdles today more challenging than those in instruction set design

Definition of Computer Architecture
The science and art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals. Computer architecture >> ISA

Classes of Computers Desktop computers Servers Supercomputers
Designed to deliver good performance to a single user at low cost usually executing 3rd party software, usually incorporating a graphics display, a keyboard, and a mouse Servers Used to run larger programs for multiple, simultaneous users typically accessed only via a network and that places a greater emphasis on dependability and security Supercomputers A high performance, high cost class of servers with hundreds to thousands of processors, terabytes (240 ≈1012 ) of memory and petabytes (250 ≈1015 ) of storage that are used for high-end scientific and engineering applications Embedded computers (processors) A computer inside another device used for running one predetermined application

Review: Some Basic Definitions
Kilobyte – 210 or 1,024 bytes Megabyte– 220 or 1,048,576 bytes sometimes “rounded” to 106 or 1,000,000 bytes Gigabyte – 230 or 1,073,741,824 bytes sometimes rounded to 109 or 1,000,000,000 bytes Terabyte – 240 or 1,099,511,627,776 bytes sometimes rounded to 1012 or 1,000,000,000,000 bytes Petabyte – 250 or 1024 terabytes sometimes rounded to 1015 or 1,000,000,000,000,000 bytes Exabyte – 260 or 1024 petabytes Sometimes rounded to 1018 or 1,000,000,000,000,000,000 bytes

Outline What is computer architecture? Performance Performance metrics
The performance equation Measuring performance Improving performance: parallelism, locality, Amdahl's law

Performance Metrics Purchasing perspective Design perspective
given a collection of computers, which has the best performance ? least cost ? best cost/performance? Design perspective faced with design options, which has the best performance improvement ? Both require basis for comparison metric for evaluation Or smallest/lightest Longest battery life Most reliable/durable (in space)

Response Time versus Throughput
Response time (execution time): the time between the start and the completion of a task Important to individual users Throughput (bandwidth) – the total amount of work done in a given time Important to computer center managers Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput

Defining (Speed) Performance
To maximize performance, we need to minimize execution time Consider a computer X performanceX = 1 / execution_timeX If computer X is n times faster than computer Y, then performanceX execution_timeY = = n performanceY execution_timeX Increasing performance requires decreasing execution time Decreasing response time almost always improves throughput

Relative Performance Example
If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B? We know that A is n times faster than B if performanceA execution_timeB = = n performanceB execution_timeA The performance ratio is 15 = 1.5 10 So A is 1.5 times faster than B

Performance Factors CPU execution time of a program (CPU time):
time the CPU spends running the program Does not include time waiting for Input or output (I/O) operations or running other programs or Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program or both. or the number of clock cycles required for a program or both.

Review: Machine Clock Rate
Clock rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec (10-9) clock cycle => 1 GHz (109) clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate A clock cycle is the basic unit of time to execute one operation/pipeline stage/etc.

Improving Performance Example
A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must computer B have to run this program in 6 seconds? Note: Computer B will require 1.2 times as many clock cycles as computer A to run the program. For lecture

Improving Performance Example
A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must computer B have to run this program in 6 seconds? Note: Computer B will require 1.2 times as many clock cycles as computer A to run the program. CPU timeA CPU clock cyclesA clock rateA = For lecture CPU clock cyclesA = 10 sec x 2 x 109 cycles/sec = 20 x 109 cycles CPU timeB x 20 x 109 cycles clock rateB = clock rateB x 20 x 109 cycles 6 seconds = = 4 GHz

Clock Cycles Per Instruction, CPI
Not all instructions take the same amount of time to execute A program execution time is equals the number of instructions executed multiplied by the average time per instruction Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute CPI for this instruction class A B C CPI 1 2 3

Using the Performance Equation
Computers A and B implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program Computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much? For class handout

Using the Performance Equation
Computer A clock cycle time = 250 ps ; CPIA = 2.0 Computer B clock cycle time = 500 ps ; CPIB = 1.2 Which computer is faster and by how much? Both A and B execute the same number of instructions, I, so CPU timeA = I x 2.0 x 250 ps = 500 x I ps CPU timeB = I x 1.2 x 500 ps = 600 x I ps For lecture Clearly, A is faster … by the ratio of execution times Clearly, A is faster … by the ratio of execution times performanceA execution_timeB x I ps = = = 1.2 performanceB execution_timeA x I ps

Morgan Kaufmann Publishers
April 14, 2017 CPI in More Detail If n different instruction types (classes) are executed by a program Each instruction of type i takes a different number of cycles to execute (CPIi) Hence, the average # of clocks per instruction, CPI Relative frequency of instruction type i Dr. W. Abu-Sufah 21 Chapter 1 — Computer Abstractions and Technology

Effective (Average) CPI
Hence, computing the overall effective CPI is done by looking at the different types (classes) of instructions executed by the program and their individual cycle counts and then averaging Let n be the number of different instruction classes executed by a program Freqi is the relative frequency {percentage} of class i instructions executed CPIi is the average number of clock cycles per instruction for instruction class i Then we have: Overall effective CPI =  (CPIi x Freqi) i = 1 n The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

THE Performance Equation
Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle or Instruction_count x CPI clock_rate CPU time = These equations separate the three key factors that affect performance: Instruction_coun, CPI, and clock_rate Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details Note that instruction count is dynamic – i.e., its not the number of lines in the code, but THE NUMBER OF INSTRUCTIONS EXECUTED

Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle Instruction _count CPI clock_cycl e Algorithm Programming language Compiler ISA Core organization Technology For class handout

Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle Instruction _count CPI clock_cycle Algorithm Programming language Compiler ISA Core organization Technology X X X X X X For lecture X X X X X X

A Simple Example Op Freq CPIi Freq x CPIi ALU 50% 1 . Load 20% 5 Store 10% 3 Branch 2  = How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?

A Simple Example Op Freq CPIi Freq x CPIi ALU 50% 1 Load 20% 5 Store 10% 3 Branch 2  = .5 1.0 .3 .4 .5 .4 .3 .5 1.0 .3 .2 .25 1.0 .3 .4 2.2 1.6 2.0 1.95 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? For lecture CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

Before We Continue Lecture notes are posted on the class web-site usually before the lecture. Review them and try to get some understanding if possible before class. Read assigned textbook sections Homework1 has been posted since Friday, February 8. There are 10 problems to solve, several with many parts. The homework is time consuming and can't be solved in one night. You should have started working on them by now. Print your solution of the homework and bring it to class next Tuesday, February 19. Do not cheat. Cheating will be pursued and it defeats the purpose of trying to solve the homework.

Workloads and Benchmarks
Benchmarks – a set of programs that form a “workload” specifically chosen to measure performance SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floating- point benchmarks (CFP2006). There are also benchmark collections for power workloads (SPECpower_ssj2008), for mail workloads (SPECmail2008), for multimedia workloads (mediabench), …

SPEC CINT2006 on Barcelona (CC = 0.4 x 109)
See Slide notes Name ICx109 CPI ExTime RefTime SPEC ratio perl 21,118 0.75 637 9,770 15.3 bzip2 2,389 0.85 817 9,650 11.8 gcc 1,050 1.72 724 8,050 11.1 mcf 336 10.00 1,345 9,120 6.8 go 1,658 1.09 721 10,490 14.6 hmmer 2,783 0.80 890 9,330 10.5 sjeng 2,176 0.96 837 12,100 14.5 libquantum 1,623 1.61 1,047 20,720 19.8 h264avc 3,102 993 22,130 22.3 omnetpp 587 2.94 690 6,250 9.1 astar 1,082 1.79 773 7,020 xalancbmk 1,058 2.70 1,143 6,900 6.0 Geometric Mean 11.7 Perl – interpreted string processing Bzip2 – block-sorting compression Gcc – GNU C compiler Mcf – combinatorial optimization Go – Go came (AI) Hmmer – search gene sequence Sjeng – chess game (AI) Libquantum – quantum computer simulation H264avc – video compression Omnetpp – discret event simulation library Astar – games/path finding Xalancbmk – XML parsing High cache miss rates

Comparing and Summarizing Performance
How do we summarize the performance for a benchmark set (e.g. SPEC CINT2006 ) with a single number? First the execution times are normalized giving the “SPEC ratio” (bigger is faster, i.e., SPEC ratio is the inverse of execution time) The SPEC ratios are then “averaged” using the geometric mean (GM) GM = n  SPEC ratioi i = 1 n

Comparing and Summarizing Performance (cont.)
Guiding principle in reporting performance measurements is reproducibility List everything another experimenter would need to duplicate the experiment version of the operating system compiler settings input set used specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)

Summary: Evaluating ISAs
Design-time metrics: Can the ISA be implemented, in how long, at what cost? Can the ISA be programmed? Ease of compilation? Static metrics: How many bytes does the program occupy in memory? Dynamic (run time) metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? How many clocks are required per instruction? How small a clock cycle is practical? Best Metric: Time to execute the program! CPI Inst. Count Cycle Time depends on the instructions set, the processor organization, and compilation techniques.

Morgan Kaufmann Publishers
April 14, 2017 Performance Summary Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tclock Chapter 1 — Computer Abstractions and Technology

Read Section 1.4, Section 1.7 (pp )

Similar presentations

Presentation on theme: "Read Section 1.4, Section 1.7 (pp )"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Read Section 1.4, Section 1.7 (pp )

Similar presentations

Presentation on theme: "Read Section 1.4, Section 1.7 (pp )"— Presentation transcript:

Similar presentations

About project

Feedback