# Evaluating Performance

## Presentation on theme: "Evaluating Performance"— Presentation transcript:

Evaluating Performance
When we say one computer has better performance than another, what do we mean? If you run a program on 2 computers, you'd say the faster one is the one that gets the job finished first. If you were managing a computer center you'd say that the faster computer is the one that gets more jobs done during a day. The first metric is called the response time or execution time, the time between the start and end of a task. The second metric is called the throughput (הספק), the total amount of work done in a given time. For the rest of the lecture we will discuss response time.

Relative Performance To maximize performance we want to minimize response time so for a computer X: PX = 1/ETX (ET = Execution time, P = Performance). Computer X has a greater performance than computer Y if: PX > PY -> 1/ETX > 1/ETY -> ETY > ETX When discussing computers we will say that "X is n times faster than Y" which means: PX/PY = n If X is n times faster than Y then the execution time on Y is n times longer than it is on X: PX/PY = ETY/ETX = n If machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how faster is A than B? ETB/ETA = 15/10 = 1.5, A is 1.5 faster than B. If machine C runs the program in 13 seconds? 15/13 = 1.15, C is faster than B by 15%, 13/10 = 1.3, A is faster than C by 30%.

Measuring Performance
Time is the measure of computer performance. Program execution time is measured in seconds per program. But what time do we measure? The response time, elapsed time, or wall-clock time is the total time to complete a task. This includes disk accesses, memory accesses, I/O, OS overhead, waiting for other user's programs to run … everything. The time that the CPU is working on our program is called CPU execution time or simply CPU time. CPU time can be divided into 2: CPU time spent in the program called the user CPU time, and time spent in the OS, performing tasks the program requested of the OS, this is called system CPU time. CPU time = user CPU time + system CPU time. In this chapter we will use the term CPU performance to refer to CPU time.

Clock Cycles All computer are built using a clock that runs at a constant rate (קצב קבוע) and determines what events take place in the hardware. These time intervals (רווחי זמן) are called clock cycles. The clock period is the time it takes to complete a clock cycle (2ns, 3ns) and the clock rate is the inverse (הופכי) of the clock period (500MHz, 333MHz). The clock rate is the number of clock cycles per second. Thus we can write CPU execution time as: CPU clock cycles * clock cycle time. The formula makes it clear that in order to improve performance we must reduce the number of cycles or the cycle time. But reducing one may increase the other. Computer A runs a program in 4M cycles, computer B runs the same program in 3.5M cycles. Which is faster? We can't know without knowing the clock period for each computer.

Cycles Per Instruction (CPI)
If the clock cycle for computer A is 2ns and for B 2.2ns we can compare the computers: ETA = 4M*2ns = 8millisecs ETB = 3.5M*2.2ns = 7.7millisecs 8/7.7 = 1.038, thus Computer B is 4% faster than computer A. We can write the number of clock cycles as the number of instructions executed * average clock Cycles Per Instruction or in other words: Instruction Count X CPI Thus execution time can be expressed as: Instruction count * CPI * cycle time. Which computer is faster and by how much: A which has an instruction count of 10M and a CPI of 3.5, or B which has an instruction count of 8.8M and a CPI of 4.1. The cycle time of B is 10% longer than the cycle time of A: ETA = 10M*3.5*1.0 = 35M (A clock cycles) ETB = 8.8M*4.1*1.1 = 39.7M (A clock cycles) 39.7/35 = 1.13, thus A is 13% faster than B

Computing the CPI As mentioned the CPI is the number of clock cycles divided by the instruction count. Another way to compute the CPI is to sum up the products of the frequency of each instruction type times the number of cycles each instruction type takes . CPI = S (Freqi * Cyclei). For instance if on machine A the following instruction types, frequencies and number of cycles are: Type Frequency # of Cycles Memory access 23% 2.3 (how is this possible?) R-type 50% 2 Branch and jumps 10% 1 FP operations 17% 3 the CPI is: 0.23* * * *3 = 2.14

Benchmarks What program do we use to evaluate a computer? If we are a user and we want to decide which computer to buy, the answer is simple: Run the program(s) you usually run on your computer. This is called a workload. If you are a computer architect the decision is harder, you have to run programs that reflect the needs of all users. These programs are called benchmarks. Hopefully the benchmarks will predict how an average user workload (if there is such a creature) will behave. The most common suite (חבילה) of benchmarks that measure CPU performance is called the SPEC CPU95 benchmarks, or SPEC benchmarks. They consist of common integer (gcc, perl, lisp, compress, go, …) and FP (scientific programs) applications. The higher a computer's specmark is, the faster it is. SPEC CPU2000 has added multi-media applications to reflect the common workloads used today.

Compiler Effects An optimizing compiler can greatly effect the execution time of a program:

Pentium vs. Pentium Pro The Pentium Pro (which the Pentium II and II is based on) is roughly 1.5 times faster than the Pentium for integer apps ans almost twice as fast for FP apps:

Amdahl's Law Jean Amdahl of IBM states that if you enhance (משפר) one part of the computer the total improvement is proportional (יחסי) to the time spent in that part. For instance if for the example given before you manage to improve the FP instructions from 3 cycles to 1 cycle your speedupenhanced (SE) is 3/1= 3. The frequencyenhanced(FE) is the fraction of execution time spent in the improved part of the computer. In our case it is 0.17. Amdahl's laws says that the new execution time is: (FE/SE)*ET + (1- FE)*ET=(FE/SE + (1-FE))*ET So for the previous example the overall speedup (האצה) is: (0.17/ )*Etold = ETnew ETold/ETnew = 1/0.88 = 1.13, thus the overall speedup is 13% although the speedup of FP instructions is 300%. This leads to the statement: Make the common case fast.