Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Performance Computer Engineering Department.

Similar presentations


Presentation on theme: "Computer Performance Computer Engineering Department."— Presentation transcript:

1 Computer Performance Computer Engineering Department

2 Case Study A company wants to re-design its computer M BASE (5 GHz) to beat the competition, using a hardware team and a compiler team. Instruction CPI i Frequency class A 2 40% B 3 25% C 3 25% D 5 10% By optimizing the hardware and changing the clock to 6 GHz Instruction CPI i Frequency class A 2 40% B 2 25% C 3 25% D 4 10%

3 Case Study - continued The CPI for each machine is CPI MBASE = 2x0.4 + 3x0.25+3x0.25+5x0.1 = 2.8 cycles/instr. CPI MOPT = 2x0.4 + 2.0.25+3x0.25+4x0.1 = 2.45 cycles/instr. The MIPS for each machine are MIPS = # Instructions = # Instructions Execution time # CPU cycles/frequency MIPS = Clock frequency (Million cycles/sec) CPI MIPS MBASE = 5 x 10 3 = 1,785 MIPS 2.8 MIPS MOPT = 6x 10 3 = 2,429 MIPS MIPS MOPT = 2449 = 1.37 2.45 MIPS MBASE 1785

4 Case Study - continued The Compiler team will leave the architecture unchanged (5 GHz clock), but wants to reduce the number of instructions when the high level code is converted to assembly language. Instruction Class % Instruction to Execute vs. Base A 90% B 90% C 85% D 95% So the ratio of instructions overall is =.9x.4+.9x.25+.85x.25+.95x.1 = 0.81 The new CPI = 2x.4x.9 + 3x.25x.9 + 3x.25x.85 + 5x.1x.95 = 3.1 0.81

5 Case Study - continued The resultant speed up from Compiler optimization is CPU time MBASE = Inst. Count x CPI = Inst. Count x 2.8 Clock frequency Clock frequency CPU time MOPT = Inst. Count x 0.81x3.1 = Instr. Count x 2.5 Clock frequency Clock frequency So the speed up is CPU time MBASE = 2.8 = 1.12 (or 12% improvement) CPU time MOPT 2.5 If BOTH hardware and software are optimized, CPI MBOTH = (2x0.4x0.9+2x0.25x0.9+3.0.25x0.85+4x0.1x0.95)/0.81 So CPI MBOTH = 2.7 cycles/instruction

6 Case Study - continued The resultant speed up from optimizing BOTH hardware and software CPU time MBASE = Clock frequency BOTH CPI BASE = 6 x10 9 x 2.8 CPU time MBOTH 0.81 Clock frequency BASE CPI BOTH 4.05x10 9 2.7 1.54 or 54% improvement The improvements take time… and the competition advances too Optimization Method Time taken Improvement Hardware 6 months 37% Compiler 6 months 12% Both 8 months 54% We know that CPU performance grows 50%/year or 3.8% /month

7 Case Study - conclusions So the competition will have a CPU performance increase in six months of (1.038) 6 = 1.25 In eight months the CPU performance will grow (1.038) 8 = 1.35 So only optimizing the compiler will not be sufficient either M OPT or M BOTH is the way to go!

8 Another way to judge performance- Benchmarks  These are libraries of programs that designers and consumers run on various computers to compare their performance.  They emulate a workload similar to the application that the consumer intends to use the computer for, or the designer wants to optimize for.  One advantage of benchmarks is reproducibility such that two or more designs can be compared before a computer hits the market;  To assure objectivity benchmarks are established by an independent committee.

9 Benchmarks - continued  This organization is the Standard Performance Evaluation Corporation (SPEC) http://www.specbench.org/  They publish benchmark results for CPUs, as well as graphics cards, web servers and other architectures.  Since this is a fast-changing field, so do the benchmark ( for CPUs we had SPEC CPU95, which was replaced by SPEC CPU2000, CPU2004 and now SPEC CPU2006)  For servers they used SPECweb99 now replaced by SPECweb2005

10 Benchmarks - continued  Regardless of version and targeted hardware, benchmarks are a collection of programs, not just one. Since each benchmark program (within a given benchmark library) is different, results need to be summarized.  How is execution time used with benchmarks?  Example Machine A Machine B Benchmark program 1 10 100 Benchmark program 2 1000 100 Benchmark program 3 500 550 Total execution time (sec) 1510 650

11 Benchmarks - continued  Performance A/Performance B = Exec. Time B/Exec. Time A = 650/1510 = 0.43 or Performance B = 2.32 Performance A  Thus Machine B is more than 2 times better than A, even though in two of the Benchmark programs Machine A was faster.  Thus total execution time is an indicator of performance if each of the benchmark programs is executed once (or an equal number of times).  Another measure is arithmetic mean = Sum Time i Where Time i is the time taken to execute n program i and n is the total number of programs in the benchmark

12 Benchmarks - continued  If not all programs in the benchmark are executed the same number of times, then we need to use a weighted Arithmetic mean = Sum (W i Times i )/n where W i is the weight assigned to the program i of the benchmark.  A normalized execution time is the ratio of the time taken to execute a given program on a given computer versus the same program being executed by a “reference” computer.  A better way to gauge performance is to use the Geometric mean of normalized execution time. sqrt n ( a 1 x a 2 x …… x a n ), where a i = execution time ratio for program i out of n programs.

13 Benchmarks - continued  The number of programs has grown in SPEC 2000 to 12 integer programs and 14 floating point programs  Additional reading

14 Benchmarks - continued

15 Benchmark Comparison (on SPEC CPU2000) The comparison of Pentium III and Pentium IVs  Both scale linearly with clock rate (aggressive caching reduces memory penalty)  Pentium 4 uses different pipeline and instructions which boost fp computations

16 Benchmarks and Energy efficiency  Reducing power means reducing voltage and/or reducing clock frequency – a technique used in laptops and other mobile applications;  Processors then have three modes: max clock, adaptive clock, minimum clock (minimum power).

17 Benchmarks and Energy efficiency  Energy efficiency= performance/avg. power consumption (watts);  Pentium M (part of Centrino)– designed from the start for mobile computing has superior energy efficiency vs. the Pentium III-M and Pentium 4-M which are modified versions of the standard processors 1 GHz to 2.26 GHz depending on voltage

18 Dual-core Architecture Places two processors on a single chip (ex. Intel Core Duo). http://www.digital-daily.com/cpu/new_core_conroe/

19 Benchmarks - continued  A normalized execution time is the ratio of the time taken to execute a given program on a given computer versus the same program being executed by a “reference” computer.  A better way to gauge performance is to use the Geometric mean of normalized execution time. sqrt n ( a 1 x a 2 x …… x a n ), where a i = execution time ratio for program i out of n programs.

20 Benchmarks - continued Spec CPU2006 has 13 integer tasks (Standard Performance Evaluation Co.) and 18 floating point tasks. The elapsed time in seconds for each of the benchmarks in the CINT2006 or CFP2006 suite is given and the ratio to the reference machine (a Sun UltraSparc II system at 296MHz), is calculated. The SPECint_base2006 and SPECfp_base2006 metrics are calculated as a Geometric Mean of the individual ratios, where each ratio is based on the median execution time from three runs. SPEC CPU2006 Benchmark Descriptions http://www.spec.org/cpu2006/publications/CPU2006benchmarks.pdf

21 Spec CPU2006 for Multi-core CPUs System name ProcessorSpeed Results CoresChipsCores/ chip Threads/ core BasePeak (optimized compiler) AMD Opteron 890, 2.8 GHz212112.713.5 Intel Dual-Core Itanium 2 1.4GHz 212113.614.3 Intel Xeon 5160, 3.00 GHz212115.315.6 Intel Xeon processor X5365, 3.0 GHz, 414118.221.2 Compared to a reference machine 296 MHz UltraSPARC II processor - reference

22 Multi-core Benchmarks http://www23.tomshardware.com/cpu_2007.html?modelx=33&model1=921&model2=868&chart=424

23 Evaluation Summary Actual Target WorkloadFull Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks ProsCons representative very specific non-portable difficult to run, or measure portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to “fool” “peak” may be a long way from application performance

24 Additional readings  The Efficeon product sheet at www.transmeta.com/pdfs/brochures/efficeon_tm8600_pr ocessor.pdf www.transmeta.com/pdfs/brochures/efficeon_tm8600_pr ocessor.pdf  Multi-Core Processor Architecture Explained http://www3.intel.com/cd/ids/developer/asmo- na/eng/211198.htm?page=2&=prn http://www3.intel.com/cd/ids/developer/asmo- na/eng/211198.htm?page=2&=prn  Performance Scaling in the Multi-Core Era http://www.intel.com/cd/ids/developer/asmo- na/eng/dc/threading/290740.htm http://www.intel.com/cd/ids/developer/asmo- na/eng/dc/threading/290740.htm


Download ppt "Computer Performance Computer Engineering Department."

Similar presentations


Ads by Google