Presentation is loading. Please wait.

Presentation is loading. Please wait.

5/26/2016Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.

Similar presentations


Presentation on theme: "5/26/2016Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University."— Presentation transcript:

1 5/26/2016Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University

2 5/26/2016Erkay Savas2 Performance What is performance? How to measure performance? Performance metrics Performance evaluation Why some hardware perform better than others for different programs? What factors in hardware are related to performance? How does the machine's instruction set affect performance?

3 5/26/2016Erkay Savas3 393600 79424 178200 268700 228750 Passenger throughput (passenger x m.p.h) Airplane Analogy Which of these airplanes has the best performance? 6008400656Airbus A 3xx 5448720146Douglas DC-8-50 13504000132Concorde 6104150470Boeing 747 6104630375Boeing 777 Speed (m.p.h) Range (miles) Passenger Capacity Airplane

4 5/26/2016Erkay Savas4 Computer Performance Response time (latency) –How long does it take for my job to run? –How long does it take to execute a program? –How long must I wait for a database query? Throughput –How many jobs can the machine run at once? –What is the average execution rate? –How much work is getting done? If we upgrade a machine with a new processor what do we increase? If we add a new machine what do we increase?

5 5/26/2016Erkay Savas5 Which Time to Measure? Elapsed Time (Wall clock time, response time) –Counts everything (disk and memory access, I/O, operating system overhead, work on other processes) –Useful but not always good for comparison purposes CPU (execution) time –The time CPU spends computing for the user task –Not include time spent waiting for I/O, running other programs –user CPU time CPU time spent within the program, –system CPU time CPU time spent in the operating system performing tasks on behalf of the program

6 5/26/2016Erkay Savas6 CPU Time Unix time command reflects this breakdown by returning the following when prompted: 90.7u 12.9s 2:39 65% Interpretation: User CPU time is 90.7 s System CPU time is 12.9s Elapsed time is 159 s (  90.7+12.9) CPU time is 65% of total elapsed time

7 5/26/2016Erkay Savas7 A Definition of Performance For some program running on machine X Performance X = 1/Execution_time X The machine X is said to be “n times faster” than the machine Y if Performance X /Performance Y = n Execution_time Y /Execution_time X = n Example: Machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?

8 5/26/2016Erkay Savas8 Metrics of Performance “Time to execute a program” is the ultimate metric in determining the performance However, it is convenient to inspect other metrics as well when we examine the details of a machine. Computers use a clock that runs at a constant rate and determines when an event takes place in hardware. These discrete time intervals are called clock cycles (or ticks, clock ticks, clock periods). Clock rate (frequency) is the inverse of clock period.

9 5/26/2016Erkay Savas9 Clock Cycles Clock “ticks” indicate when to start activities Instead of reporting execution time in seconds, we often use cycles time Start of events often the rising edge of the clock

10 5/26/2016Erkay Savas10 Clock Cycle cycle time ( CT ) = time between ticks = seconds per cycle Cycle Count ( CC ): the number of clock cycles to execute a program clock rate (frequency) = cycles per second (1 Hz = 1 cycle/sec) A 200 MHz clock has a 1/(200·10 6 ) = ? nanosecond cycle time A 4 GHz clock has a 1/(4· 10 9 ) = ? nanosecond cycle time

11 5/26/2016Erkay Savas11 CPI CPI Clocks Per Instruction –Number of cycles spent on an instruction on average. –CC = IC  CPI –Hard to compute. –It is useful when comparing the performances of two machines with the same ISA. (Why?) Example: two machines with the same ISA. For a certain program we have –Machine A: CPI = 2.0 –Machine B: CPI = 1.2 –Which machine is faster? –What if machine A uses 250 ps and machine B 500 ps cycle time

12 5/26/2016Erkay Savas12 Improving Performance So, to improve performance 1.Increase the clock frequency (i.e. decrease the clock period) 2.Reduce the number of the clock cycles per program (IC  CPI)

13 5/26/2016Erkay Savas13 Instruction  Cycle ? No ! The number of cycles per instruction depends on the implementations of the instructions in hardware The number differs for each processor (even with the same ISA)

14 5/26/2016Erkay Savas14 The Reason Operations take different number of cycles –Multiplication takes longer than addition –Floating point operations take longer than integer operations –The access time to a register is much shorter than access to the main memory.

15 5/26/2016Erkay Savas15 Simple Formulae for CPU Time CPU execution time = CPU clock cycles for a program  Clock cycle time (CC  CT) CPU execution time = CPU clock cycles for a program/Clock rate We can write CPU clock cycles for a program = IC  CPI Then CPU execution time = (IC  CPI)/Clock rate

16 5/26/2016Erkay Savas16 Example Computer A of 800 MHz –It runs our favorite program in 15 s Our goal –Design computer B with the same ISA –It will run the same program in 8 s. We will use a new technology –can increase the clock rate; –however, it will also increase CPI by 1.25. What clock rate should we aim to use?

17 5/26/2016Erkay Savas17 Performance Performance is determined by execution time (CPU time) We have also other indicators –# of cycles to execute program –# of instructions in program (IC) –# of cycles per second –average # of cycles per instruction (CPI) –average # of instructions per second Common pitfall: thinking one of the above is indicative of performance when it really isn’t.

18 5/26/2016Erkay Savas18 Number of Instructions Example A compiler designer has the following two alternatives to generate a certain piece of code with instructions A(1 cycle), B (2 cycles), and C(3 cycles): 1.2  10 6 of A, 10 6 of B, and 2  10 6 of C ( IC = 5  10 6 ) 2.4  10 6 of A, 10 6 of B, and 10 6 of C ( IC = 6  10 6 ) –Which code sequence is faster?

19 5/26/2016Erkay Savas19 MIPS Millions Instructions Per Second = MIPS = IC/(Execution_time  10 6 ) MIPS = IC/(CC  cycle time  10 6 ) MIPS = (IC  clock rate)/(IC  CPI  10 6 ) MIPS = clock rate/(CPI  10 6 ) A faster machine has a higher MIPS Execution_time = IC/(MIPS  10 6 )

20 5/26/2016Erkay Savas20 A MIPS Example A computer with 500 MHz clock –Three different classes of instructions: –A (1 cycle), B (2 cycles), C (3 cycles) Two compilers used to produce code for a large piece of software. –Compiler 1: –5 billion A, 1 billion B, and 1 billion C instructions. –Compiler 2: 10 billion A, 1 billion B, and 1 billion C instructions. Which sequence will be faster according to execution time? Which sequence will be faster according to MIPS?

21 5/26/2016Erkay Savas21 Problems of MIPS MIPS specifies instruction execution rate MIPS does not take into account the capabilities of the instructions –Thus, it is impossible to compare computers with different ISA using MIPS. MIPS is not constant, even on a single machine, depends on the application. As we saw in the previous example, MIPS can vary inversely with performance.

22 5/26/2016Erkay Savas22 CPI example CPI –Machine A: CPI = 10/7 = 1.43 –Machine B: CPI = 15/12 = 1.25 CPU time –CPU time = (IC  CPI) / clock rate –Let us assume both machines use 200 MHz clock

23 5/26/2016Erkay Savas23 Overview A given program will require 1.Some number of instructions 2.Some number of clock cycles 3.Some number of seconds Vocabulary –Cycle time: (micro or nano) seconds per cycle –Clock rate (frequency): cycles per second –CPI: clock per instruction –MIPS: millions of instruction per second –MFLOPS: millions of floating point operations per second

24 5/26/2016Erkay Savas24 Performance Performance is ultimately determined by execution time Is any of the following metrics good to measure performance by itself? Why? –# of cycles to execute a program –# of instructions in a program –# of cycles per second –Average # of cycles per instruction –Average # number of instructions per second

25 5/26/2016Erkay Savas25 Question Assuming two machines have the same ISA, which of the following quantities are identical? –Clock rate –CPI –Execution time –# of instructions –MIPS

26 5/26/2016Erkay Savas26 Program Performance IC, clock rate, CPI IC, CPI IC, possibly CPI ISA Compiler Programming Language Algorithm Affects what?How?HW or SW component

27 5/26/2016Erkay Savas27 Benchmarks Programs specifically chosen to measure performance –must reflect typical workload of the user Benchmark types –Real applications –Small benchmarks –Benchmark suites –Synthetic benchmarks

28 5/26/2016Erkay Savas28 Real Applications Workload: Set of programs a typical user runs day in and day out. To use these real applications for metrics is a direct way of comparing the execution time of the workload on two machines. Using real applications for metrics has certain restrictions: –They are usually big –Takes time to port to different machines –Takes considerable time to execute –Hard to observe the outcome of a certain improvement technique

29 5/26/2016Erkay Savas29 Comparing & Summarizing Performance A is 100 times faster than B for program 1 B is 10 times faster than A for program 2 For total performance, arithmetic mean is used: Computer AComputer B Program 11 s100 s Program 21000 s100 s Total time1001 s200 s

30 5/26/2016Erkay Savas30 Arithmetic Mean If each program, in the workload, do not run equal times, then we have to use weighted arithmetic mean weight Computer AComputer B Program 1 (seconds) 101100 Program 2 (seconds) 11000100 Weighted AM -?? Suppose that the program 1 runs 10 times as often as the program 2. Which machine is faster?

31 5/26/2016Erkay Savas31 Small Benchmarks Small code segments which are common in many applications –For example, loops with certain instruction mix –for (j = 0; j<8; j++) S = S + A j  B i-j Good for architects and designers –Since small code segments are easy to compile and simulate even by hand, designers use these kind of benchmarks while working on a novel machine –Can be abused by compiler designers by introducing special-purpose optimizations targeted at specific benchmark.

32 5/26/2016Erkay Savas32 Benchmark Suites SPEC (Standard Performance Evaluation Corporation) –non-profit organization that aims to produce "fair, impartial and meaningful benchmarks for computers” –Began in 1989 - SPEC89 (CPU intensive) –companies agreed on a set of real programs and inputs which they hope reflect a typical user’s workload best. –valuable indicator of performance –can still be abused –Updates are required as the applications and their workload change by time

33 5/26/2016Erkay Savas33 SPEC Benchmark Sets CPU Performance (SPEC CPU200 6 ) Graphics (SPECviewperf) High-performance computing (HPC2002, MPI200 7, OMP2001) Java server applications (jAppServer2004) –a multi-tier benchmark for measuring the performance of Java 2 Enterprise Edition (J2EE) technology-based application servers. Mail systems (MAIL2001, SPECimap2003) Network File systems (SFS97_R1 (3.0)) Web servers (SPEC WEB99, SPEC WEB99 SSL) More information: http://www.spec.org/

34 SPECInt Integer Benchmarks NameDescription 400.perlbenchProgramming Language 401.bzip2Compression 403.gccC Compiler 429.mcfCombinatorial Optimization 445.gobmkArtificial Intelligence 456.hmmerSearch Gene Sequence 458.sjengArtificial Intelligence 462.libquantumPhysics / Quantum Computing 464.h264refVideo Compression 471.omnetppDiscrete Event Simulation 473.astarPath-finding Algorithms 483.xalancbmkXML Processing

35 SPECfp Floating Point Benchmarks NameType wupwiseQuantum chromodynamics swimShallow water model mgridmultigrid solver in 3D potential field appluParabolic/elliptic partial dif. equation mesaThree-dimensional graphics library galgelComputational fluid dynamics artImage recognition using neural nets equakeSeismic wave propagation simulation facerecImage recognition of faces ammpComputational chemistry lucasPrimality testing fma3dCrash simulation sixtrackHigh-energy nuclear physics acceleration design apsimeteorology; pollutant distribution

36 5/26/2016Erkay Savas36 SPEC CPU2006 – Summarizing SPEC ratio: the execution time measurements are normalized by dividing the measured execution time by the execution time on a reference machine –Sun Microsystems Fire V20z, which has an AMD Opteron 252 CPU, running at 2600 MHz. –164.gzip benchmark executes in 90.4 s. –The reference time for this benchmark is 1400 s, –benchmark is 1400/90.4 × 100 = 1548 (a unitless value) Performances of different programs in the suites are summarized using “geometric mean” of SPEC ratios.

37 5/26/2016Erkay Savas37 Pentium III & Pentium 4

38 5/26/2016Erkay Savas38 Comparing Pentium III and Pentium 4 RatioPentium IIIPentium 4 CINT2000/Clock rate in MHz0.470.36 CFP2000/Clock rate in MHz0.340.39 Implementation efficiency?

39 5/26/2016Erkay Savas39 SPEC WEB99 SystemProcessor# of disk drivers # of CPUs # of networks Clock rate (GHz) Result 1550/1000Pentium III22212765 1650Pentium III3211.41810 2500Pentium III8241.133435 2550Pentium III1211.261454 2650Pentium 4 Xeon5243.065698 4600Pentium 4 Xeon10242.24615 6400/700Pentium III Xeon5440.74200 6600Pentium 4 Xeon MP84826700 8450/700Pentium III Xeon7880.78001

40 5/26/2016Erkay Savas40 Power Consumption Concerns Performance studied at different levels: 1.Maximum power 2.Intermediate level that conserves battery life 3.Minimum power that maximizes battery life Intel Mobile Pentium & Pentium M: two available clock rates 1.Maximum 2.Reduced clock rate Pentium M @ 1.6/0.6 GHz Pentium 4-M @ 2.4/1.2 GHz Pentium III-M @ 1.2/0.8 GHz

41 5/26/2016Erkay Savas41 Three Intel Mobile Processors

42 5/26/2016Erkay Savas42 Energy Efficiency

43 5/26/2016Erkay Savas43 Synthetic Benchmarks Artificial programs constructed to try to match the characteristics of a large set of program. Goal: Create a single benchmark program where the execution frequency of instructions in the benchmark simulates the instruction frequency in a large set of benchmarks. Examples: –Dhrystone, Whetstone They are not real programs Compiler and hardware optimizations can inflate the improvement far beyond what the same optimization would do with real programs

44 5/26/2016Erkay Savas44 Amdahl’s Law in Computing Improving one aspect of a machine by a factor of n does not improve the overall performance by the same amount. Speedup = (Performance after imp.) / (Performance before imp.) Speedup = (Execution time before imp.)/ (Execution time after imp.) Execution Time After Improvement = Execution Time Unaffected + (Execution Time Affected/n)

45 5/26/2016Erkay Savas45 Amdahl’s Law Example: Suppose a program runs in 100 s on a machine, with multiplication responsible for 80 s of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Can we improve the performance by a factor 5?

46 5/26/2016Erkay Savas46 Amdahl’s Law The performance enhancement possible due to a given improvement is limited by the amount that the improved feature is used. In previous example, it makes sense to improve multiplication since it takes 80% of all execution time. But after certain improvement is done, the further effort to optimize the multiplication more will yield insignificant improvement. Law of Diminishing Returns A corollary to Amdahl’s Law is to make a common case faster.

47 5/26/2016Erkay Savas47 Examples Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions? We are looking for a benchmark to show off the new floating-point unit described above, and want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 90 seconds with the old floating- point hardware. How much of the execution time would floating-point instructions have to account for in this program in order to yield our desired speedup on this benchmark?

48 5/26/2016Erkay Savas48 Remember Total execution time is a consistent summary of performance –Execution Time = (IC  CPI)/f For a given architecture, performance increases come from: 1.increases in clock rate (without too much adverse CPI effects) 2.improvements in processor organization that lower CPI 3.compiler enhancements that lower CPI and/or IC


Download ppt "5/26/2016Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University."

Similar presentations


Ads by Google