Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modified by S. J. Fritz Spring 2009 (1) Based on slides from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ COM 249 – Computer Organization and Assembly.

Similar presentations


Presentation on theme: "Modified by S. J. Fritz Spring 2009 (1) Based on slides from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ COM 249 – Computer Organization and Assembly."— Presentation transcript:

1 Modified by S. J. Fritz Spring 2009 (1) Based on slides from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ COM 249 – Computer Organization and Assembly Language Chapter 1 (continued) Performance

2 Modified by S. J. Fritz Spring 2009 (2) Understanding Performance Algorithm Determines number of operations executed (IC and possibly CPI) Programming language, compiler architecture Determine number of machine instructions executed per operation (IC, CPI) Instruction set architecture Determines the instructions needed for a function, the cycles needed for each instruction and the clock rate of the processor (IC, CPI, and Clock rate) Processor and memory system Determine how fast instructions are executed I/O system (including OS) Determines how fast I/O operations are executed §1.4 Performance

3 Modified by S. J. Fritz Spring 2009 (3) Comparison of Airplanes AirplanePassenger Capacity Cruising Range (miles) Cruising Speed (MPH) Passenger Throughput (Pass x mph) Boeing 7773754630610228,750 Boeing 7474704150610286,700 Concord13240001350178,200 Douglas DC 8-50 146872054479,424

4 Modified by S. J. Fritz Spring 2009 (4) Defining Performance Which airplane has the best performance?

5 Modified by S. J. Fritz Spring 2009 (5) Response Time and Throughput Response time –How long it takes to do a task Throughput –Total work done per unit time e.g., tasks/transactions/… per hour How are response time and throughput affected by –Replacing the processor with a faster version? –Adding more processors? We’ll focus on response time for now…

6 Modified by S. J. Fritz Spring 2009 (6) Response Time and Throughput Decreasing response time usually improves throughput. Do these changes increase response time or throughput or both? Replacing processor with faster one? Adding additional processors? Both Only throughput, since no one task gets done faster.

7 Modified by S. J. Fritz Spring 2009 (7) Relative Performance Define Performance Performance = 1/Execution Time “X is n time faster than Y” Example: time taken to run a program –10s on A, 15s on B –Execution Time B / Execution Time A = 15s / 10s = 1.5 –So A is 1.5 times faster than B

8 Modified by S. J. Fritz Spring 2009 (8) Measuring Execution Time Elapsed time –Total response time, including all aspects Processing, I/O, OS overhead, idle time –Determines system performance CPU time –Time spent processing a given job Discounts I/O time, other jobs’ shares –Comprises user CPU time and system CPU time –Different programs are affected differently by CPU and system performance

9 Modified by S. J. Fritz Spring 2009 (9) CPU Clocking Operation of digital hardware governed by a constant-rate clock Clock (cycles) Data transfer and computation Update state Clock period Clock period: duration of a clock cycle –e.g., 250ps = 0.25ns = 250×10 –12 s Clock frequency (rate): cycles per second –e.g., 4.0GHz = 4000MHz = 4.0×10 9 Hz

10 Modified by S. J. Fritz Spring 2009 (10) Program Execution Time(CPU Time) CPU TIME = Total CPU Clock Cycles X Clock Cycle Length Total CPU Clock Cycles = Number of Instructions X CPI CPI: Clock Cycles per Instruction –Average number of clock cycles each instruction takes to execute CPU Designer’s Choice: Trade off between the number of instructions and the duration of the clock cycle –Long cycle: powerful but complex instructions (CISC) –Short cycle: simple instructions (RISC)

11 Modified by S. J. Fritz Spring 2009 (11) Instruction Count and CPI Instruction Count for a program –Determined by program, ISA and compiler Average cycles per instruction –Determined by CPU hardware –If different instructions have different CPI Average CPI affected by instruction mix

12 Modified by S. J. Fritz Spring 2009 (12) CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? A is faster… …by this much

13 Modified by S. J. Fritz Spring 2009 (13) Classic CPU Performance Equation Basic performance equation in terms of IC, CPI and clock cycle time: CPU time=Instruction count x CPI x Clock cycle time Or since clock rate is the inverse of clock cycle time: CPU Time = Instruction count x CPI Cock Rate

14 Modified by S. J. Fritz Spring 2009 (14) CPI in More Detail If different instruction classes take different numbers of cycles Weighted average CPI Relative frequency

15 Modified by S. J. Fritz Spring 2009 (15) CPI Example Alternative compiled code sequences using instructions in classes A, B, C ClassABC CPI for class123 IC in sequence 1212 IC in sequence 2411 Sequence 1: IC = 5 –Clock Cycles = 2×1 + 1×2 + 2×3 = 10 –Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 –Clock Cycles = 4×1 + 1×2 + 1×3 = 9 –Avg. CPI = 9/6 = 1.5

16 Modified by S. J. Fritz Spring 2009 (16) Basic Components of Performance Components of PerformanceUnits of Measure CPU execution timeSeconds Instruction countInstructions executed Clock cycles per Instruction (CPI) Average number of clock cycles per instruction Clock cycle timeSeconds per clock cycle

17 Modified by S. J. Fritz Spring 2009 (17) Performance Summary Performance depends on –Algorithm: affects IC, possibly CPI –Programming language: affects IC, CPI –Compiler: affects IC, CPI –Instruction set architecture: affects IC, CPI, T c The BIG Picture

18 Modified by S. J. Fritz Spring 2009 (18) Power Trends In CMOS IC technology §1.5 The Power Wall ×1000 ×30 5V → 1V WHY?

19 Modified by S. J. Fritz Spring 2009 (19) Power Wall Clock rate and power have increased over 25 years and 8 computer generations and then flattened off They grew together because they are correlated They slowed down because we have run into a practical power limit for cooling microprocessors-thermal problems

20 Modified by S. J. Fritz Spring 2009 (20) CMOS and Power Dominant technology for IC (integrated circuits) is CMOS(complementary metal oxide semiconductor). For CMOS primary power dissipation is dynamic power – power consumed during switching. Frequency switched is a function of clock rate. Capacitive load is a function of the number of transistors and the technology. Therefore, power has been reduced 30 times by: ×30 5v 1v ×1000

21 Modified by S. J. Fritz Spring 2009 (21) Reducing Power Suppose a new CPU has –85% of capacitive load of old CPU –15% voltage and 15% frequency reduction The power wall –We can’t reduce voltage further –We can’t remove more heat How else can we improve performance? New design…

22 Modified by S. J. Fritz Spring 2009 (22) Uniprocessor Performance §1.6 The Sea Change: The Switch to Multiprocessors Constrained by power, instruction-level parallelism, memory latency

23 Modified by S. J. Fritz Spring 2009 (23) Sea Change Sea Change is an idiom meaning a profound transformation or big change. Taken from Shakespeare’s The Tempest, when Ariel sings: "Full fathom five thy father lies, Of his bones are coral made, Those are pearls that were his eyes, Nothing of him that doth fade, But doth suffer a sea-change, into something rich and strange…”Full fathom five http://en.wikipedia.org/wiki/Sea_change

24 Modified by S. J. Fritz Spring 2009 (24) From Uniprocessor to Multiprocessor Power limits have forced change in the design of microprocessors. Microprocessors now have multiple processors or “cores” per chip. Called multicore (dual core, quad core, etc.) Plan to double the number of cores per chip every two years. Programmers need to rewrite their programs to take advantage of multiple processors.

25 Modified by S. J. Fritz Spring 2009 (25) Multiprocessors Multicore microprocessors –More than one processor per chip Requires explicitly parallel programming –Compare with instruction level parallelism Hardware executes multiple instructions at once Hidden from the programmer –Hard to do Programming for performance Load balancing Optimizing communication and synchronization

26 Modified by S. J. Fritz Spring 2009 (26) Multicore Microprocessors ProductAMD OpteronX4 Barcelona Intel Nehalem IBM Power6 Sun Ultra Spark T2 Niagara2 Cores per chip 4428 Clock Rate 2.5 GHz~2.5GHz4.7 GHz1.4 GHz Power120 W~100 W 94 W

27 Modified by S. J. Fritz Spring 2009 (27) Parallelism Programmers need to switch to explicitly parallel programming. Pipelining (Chapter 4) is an elegant technique to overlap the execution of instructions. Instruction-level parallelism abstracts the parallel nature of the hardware so the programmer and compiler can think of sequential instruction execution.

28 Modified by S. J. Fritz Spring 2009 (28) Parallel Programming Hard to write parallel programs: Parallel programming is by definition performance programming and must be fast. (If speed is not needed write sequentially.) For parallel hardware, programmer must divide the application so that each processor has same amount to do, with little overhead. See Newspaper story analogy p. 43

29 Modified by S. J. Fritz Spring 2009 (29) Real Stuff: Manufacturing AMD chip Manufacture of a chip begins with silicon (found in sand). Silicon is a semiconductor – does not conduct electricity well. Material added to silicon to form: –Conductors (copper or aluminum) –Insulators (plastic or glass) –Conduct or insulate under special conditions (as a switch or transistor) VLSI (very large scale integration) circuit is millions of conductors, insulators and switches in a small package.

30 Modified by S. J. Fritz Spring 2009 (30) Manufacturing ICs Yield: proportion of working dies per wafer http://www.intel.com/museum/onlineexhibits.htm §1.7 Real Stuff: The AMD Opteron X4

31 Modified by S. J. Fritz Spring 2009 (31) AMD Opteron X2 Wafer X2: 300mm wafer, 117 chips, 90nm technology X4: 45nm technology

32 Modified by S. J. Fritz Spring 2009 (32) Integrated Circuit Cost Nonlinear relation to area and defect rate –Wafer cost and area are fixed –Defect rate determined by manufacturing process –Die area determined by architecture and circuit design

33 Modified by S. J. Fritz Spring 2009 (33) SPEC CPU Benchmark Benchmarks are programs used to measure performance –Supposedly typical of actual workload Standard Performance Evaluation Corp (SPEC) –Develops benchmarks for CPU, I/O, Web, … SPEC CPU2006 –Elapsed time to execute a selection of programs Negligible I/O, so focuses on CPU performance –Normalize relative to reference machine –Summarize as geometric mean of performance ratios CINT2006 (integer) and CFP2006 (floating-point)

34 Modified by S. J. Fritz Spring 2009 (34) CINT2006 for Opteron X4 2356 NameDescriptionIC×10 9 CPITc (ns)Exec timeRef timeSPECratio perlInterpreted string processing2,1180.750.406379,77715.3 bzip2Block-sorting compression2,3890.850.408179,65011.8 gccGNU C Compiler1,0501.720.47248,05011.1 mcfCombinatorial optimization33610.000.401,3459,1206.8 goGo game (AI)1,6581.090.4072110,49014.6 hmmerSearch gene sequence2,7830.800.408909,33010.5 sjengChess game (AI)2,1760.960.483712,10014.5 libquantumQuantum computer simulation1,6231.610.401,04720,72019.8 h264avcVideo compression3,1020.800.4099322,13022.3 omnetppDiscrete event simulation5872.940.406906,2509.1 astarGames/path finding1,0821.790.407737,0209.1 xalancbmkXML parsing1,0582.700.401,1436,9006.0 Geometric mean11.7 High cache miss rates

35 Modified by S. J. Fritz Spring 2009 (35) SPEC Power Benchmark Power consumption of server at different workload levels –Performance: ssj_ops/sec –Power: Watts (Joules/sec)

36 Modified by S. J. Fritz Spring 2009 (36) SPECpower_ssj2008 for X4 Target Load %Performance (ssj_ops/sec)Average Power (Watts) 100%231,867295 90%211,282286 80%185,803275 70%163,427265 60%140,160256 50%118,324246 40%920,35233 30%70,500222 20%47,126206 10%23,066180 0%0141 Overall sum1,283,5902,605 ∑ssj_ops/ ∑power493

37 Modified by S. J. Fritz Spring 2009 (37) Pitfalls and Fallacies Fallacies- commonly held misconceptions, usually presented with a counter example. Pitfalls- easily made mistakes, often generalizations of principles that are true in a limited context. Purpose of these sections is to help you to avoid making mistakes. §1.8 Fallacies and Pitfalls

38 Modified by S. J. Fritz Spring 2009 (38) Amdahl’s Law Amdahl's Law governs the speedup of using parallel processors on a problem, versus using only one serial processor. –Before we examine Amdahl's Law, we should gain a better understanding of what is meant by speedup. §1.8 Fallacies and Pitfalls Speedup is the time it takes a program to execute in serial (with one processor) divided by the time it takes to execute in parallel (with many (j) processors). The formula for speedup is: S = T(1) T(j) Efficiency is the speedup, divided by the number of processors used. http://cs.wlu.edu/~whaleyt/classes/parallel/topics/amdahl.html

39 Modified by S. J. Fritz Spring 2009 (39) Amdahl’s Law If N is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and p is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, then Amdahl's law says that speedup is given by Speedup = (s + p ) / (s + p / N ) = 1 / (s + p / N ), where we have set total time s + p = 1 for algebraic simplicity. http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html

40 Modified by S. J. Fritz Spring 2009 (40) Limitations of Amdahl’s Law If a program needs 20 hours using a single processor, and a particular portion of 1 hour cannot be parallelized, while the remaining portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time can not be less than that critical 1 hour. Hence the speed up is limited up to 20x.

41 Modified by S. J. Fritz Spring 2009 (41) Pitfall: Amdahl’s Law Pitfall: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement. Suppose a program runs in 100 seconds on a computer, with multiply operations responsible for 80 seconds of this time. How much must the speed of the multiplication improve to have the program run 5 times faster?

42 Modified by S. J. Fritz Spring 2009 (42) Pitfall: Amdahl’s Law Improving an aspect of a computer and expecting a proportional improvement in overall performance §1.8 Fallacies and Pitfalls –Can’t be done! Example: multiply accounts for 80s/100s –How much improvement in multiply performance to get 5× overall? Corollary: make the common case fast

43 Modified by S. J. Fritz Spring 2009 (43) Uses for Amdahl’s Law Estimate performance improvements Together with CPU performance equation, use to evaluate potential enhancements Use the Corollary: Make the common case fast to enhance performance- easier than optimizing the rare case. Use to examine the practical limits on the number of parallel processors.

44 Modified by S. J. Fritz Spring 2009 (44) Fallacy: Low Power at Idle Fallacy: Computers at low utilization use little power. Look back at X4 power benchmark –At 100% load: 295W –At 50% load: 246W (83%) –At 10% load: 180W (61%) Google data center –Mostly operates at 10% – 50% load –At 100% load less than 1% of the time Consider designing processors to make power proportional to load

45 Modified by S. J. Fritz Spring 2009 (45) Pitfall: MIPS as a Performance Metric Pitfall:Using a subset of the performance equation as a performance metric MIPS: Millions of Instructions Per Second –Doesn’t account for Differences in ISAs between computers Differences in complexity between instructions –CPI varies between programs on a given CPU

46 Modified by S. J. Fritz Spring 2009 (46) Performance Measurements Which computer has the higher MIPS rating? Which computer is faster? MeasurementComputer AComputer B Instruction Count10 billion8 billion Clock Rate4 GHz CPI1.01.1 MIPS= Clock rate A = 4 GHz = 4000 B = 4GHz= 3630 CPI x10 6 1.0 x10 6 1.1 x10 6 CPU time = IC x CPI A=10 x 10 9 x1 = 2.5 B = 8 x 10 9 x1.1= 2.2 Clock rate 4GHz 4Ghz

47 Modified by S. J. Fritz Spring 2009 (47) Concluding Remarks Cost/performance is improving –Due to underlying technology development Hierarchical layers of abstraction –In both hardware and software Instruction set architecture –The hardware/software interface Execution time: the best performance measure Seconds = Instructions x Clock Cycles x Seconds Program Program Instruction Clock cycle Execution time is the only valid measure of performance. §1.9 Concluding Remarks

48 Modified by S. J. Fritz Spring 2009 (48) Concluding Remarks Power is a limiting factor –Use parallelism to improve performance Via multiple processors Exploiting the locality of accesses to a memory hierarchy,via caches The key hardware technology for modern processors is silicon. Historical Perspective ( SEE 1.10 on CD) §1.9 Concluding Remarks


Download ppt "Modified by S. J. Fritz Spring 2009 (1) Based on slides from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ COM 249 – Computer Organization and Assembly."

Similar presentations


Ads by Google