Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright Agrawal, 2007 ELEC6270 Fall 07, Lecture 14 1 ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani.

Similar presentations


Presentation on theme: "Copyright Agrawal, 2007 ELEC6270 Fall 07, Lecture 14 1 ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani."— Presentation transcript:

1 Copyright Agrawal, 2007 ELEC6270 Fall 07, Lecture 14 1 ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Fall07/course.html

2 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 142 SIA Roadmap for Processors (1999) Year199920022005200820112014 Feature size (nm) 180130100705035 Logic transistors/cm 2 6.2M18M39M84M180M390M Clock (GHz) 1.252.13.56.010.016.9 Chip size (mm 2 ) 340430520620750900 Power supply (V) 1.81.51.20.90.60.5 High-perf. Power (W) 90130160170175183 Source: http://www.semichips.orghttp://www.semichips.org

3 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 143 Power Reduction in Processors Just about everything is used. Just about everything is used. Hardware methods: Hardware methods: Voltage reduction for dynamic power Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Dual-threshold devices for leakage reduction Clock gating, frequency reduction Clock gating, frequency reduction Sleep mode Sleep mode Architecture: Architecture: Instruction set Instruction set hardware organization hardware organization Software methods Software methods

4 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 144 SPEC CPU2000 Benchmarks Twelve integer and 14 floating point programs, CINT2000 and CFP2000. Twelve integer and 14 floating point programs, CINT2000 and CFP2000. Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra 5_10 with a 300MHz processor. Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra 5_10 with a 300MHz processor. CINT2000 and CFP2000 summary measurements are the geometric means of SPEC ratios. CINT2000 and CFP2000 summary measurements are the geometric means of SPEC ratios. LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers. LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers.

5 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 145 Reference CPU s: Sun Ultra 5_10 300MHz Processor

6 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 146 CINT2000: 3.4GHz Pentium 4, HT Technology (D850MD Motherboard) SPECint2000_base = 1341 SPECint2000 = 1389 Source: www.spec.orgwww.spec.org

7 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 147 Two Benchmark Results Baseline: A uniform configuration not optimized for specific program: Baseline: A uniform configuration not optimized for specific program: Same compiler with same settings and flags used for all benchmarks Same compiler with same settings and flags used for all benchmarks Other restrictions Other restrictions Peak: Run is optimized for obtaining the peak performance for each benchmark program. Peak: Run is optimized for obtaining the peak performance for each benchmark program.

8 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 148 CFP2000: 3.6GHz Pentium 4, HT Technology (D925XCV/AA-400 Motherboard) SPECfp2000_base = 1627 SPECfp2000 = 1630 Source: www.spec.orgwww.spec.org

9 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 149 CINT2000: 1.7GHz Pentium 4 (D850MD Motherboard) SPECint2000_base = 579 SPECint2000 = 588 Source: www.spec.orgwww.spec.org

10 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1410 CFP2000: 1.7GHz Pentium 4 (D850MD Motherboard) SPECfp2000_base = 648 SPECfp2000 = 659 Source: www.spec.orgwww.spec.org

11 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1411 Energy SPEC Benchmarks Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = ──────────── joules consumed joules consumed

12 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1412 Energy Efficiency Efficiency averaged on n benchmark programs: Efficiency averaged on n benchmark programs: n n Efficiency= ( Π Efficiency i ) 1/n i=1 i=1 where Efficiency i is the efficiency for program i. Relative efficiency: Relative efficiency: Efficiency of a computer Efficiency of a computer Relative efficiency = ───────────────── Eff. of reference computer Eff. of reference computer

13 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1413 SPEC2000 Relative Energy Efficiency Always max. clock Laptop adaptive clk. Min. power min. clock

14 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1414 Voltage Scaling Dynamic: Reduce voltage and frequency during idle or low activity periods. Dynamic: Reduce voltage and frequency during idle or low activity periods. Static: Clustered voltage scaling Static: Clustered voltage scaling Logic on non-critical paths given lower voltage. Logic on non-critical paths given lower voltage. 47% power reduction with 10% area increase reported. 47% power reduction with 10% area increase reported. M. Igarashi et al., “Clustered Voltage Scaling Techniques for Low-Power Design,” Proc. IEEE Symp. Low Power Design, 1997. M. Igarashi et al., “Clustered Voltage Scaling Techniques for Low-Power Design,” Proc. IEEE Symp. Low Power Design, 1997.

15 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1415 Processor Utilization Throughput = Operations / second Throughput Time Compute-intensive processes System idle Low throughput (background) processes Maximum throughput

16 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1416 Examples of Processes Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. Low throughput: data entry, screen updates, low bandwidth I/O data transfer. Low throughput: data entry, screen updates, low bandwidth I/O data transfer. Idle: no computation, no expected output. Idle: no computation, no expected output.

17 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1417 Effects of Voltage Reduction Voltage reduction increases delay, decreases throughput: Voltage reduction increases delay, decreases throughput: Slow reduction in throughput at first Slow reduction in throughput at first Rapid reduction in throughput for V ≤ V Rapid reduction in throughput for V DD ≤ V th Time per operation (TPO) increases Time per operation (TPO) increases Voltage reduction continues to reduce power consumption: Voltage reduction continues to reduce power consumption: Energy per operation (EPO) = Power × TPO Energy per operation (EPO) = Power × TPO

18 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1418 Energy per Operation (EPO) V / V V DD / V th 1234512345 Power TPO EPO 1.0 0.5 0.0

19 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1419 Dynamic Voltage and Clock Throughput Time spent in: Battery life Fast mode Slow mode Idle mode Always full speed 10%0%90% 1 hr Sometimes full speed 1%90%9% 5.3 hrs Rarely full speed 0.1%99%0.9% 9.2 hrs T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors, Springer, 2002, pp. 35-36.

20 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1420 Problem of Process Variation and Leakage Lower V th V th Higher V th Number of chips Power specification Clock specification From a presentation: Power Reduction using LongRun2 in Transmeta’s Efficon Processor, by D. Ditzel May 17, 2006 Yield loss due to high leakage Yield loss due to slow speed

21 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1421 Pipeline Gating A pipeline processor uses speculative execution. A pipeline processor uses speculative execution. Incorrect branch prediction results in pipeline stalls and wasted energy. Incorrect branch prediction results in pipeline stalls and wasted energy. Idea: Stop fetching instructions if a branch hazard is expected: Idea: Stop fetching instructions if a branch hazard is expected: If the count (M) of incorrect predictions exceeds a pre- specified number (N), then suspend fetching instruction for some k cycles. If the count (M) of incorrect predictions exceeds a pre- specified number (N), then suspend fetching instruction for some k cycles. Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25 th Annual International Symp. Computer Architecture, June 1998. Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25 th Annual International Symp. Computer Architecture, June 1998.

22 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1422 Slack Scheduling Application: Superscalar, out-of-order execution: Application: Superscalar, out-of-order execution: An instruction is executed as soon as the required data and resources become available. An instruction is executed as soon as the required data and resources become available. A commit unit reorders the results. A commit unit reorders the results. Delay the completion of instructions whose result is not immediately needed. Delay the completion of instructions whose result is not immediately needed. Example of RISC instructions: Example of RISC instructions: addr0, r1, r2;(A) addr0, r1, r2;(A) sub r3, r4, r5;(B) sub r3, r4, r5;(B) and r9, x1, r9;(C) and r9, x1, r9;(C) or r5, r9, r10;(D) or r5, r9, r10;(D) xor r2, r10, r11;(E) xor r2, r10, r11;(E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000.

23 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1423 Slack Scheduling Example Slack scheduling A BC D E Standard scheduling ABC D E

24 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1424 Slack Scheduling Slack bit Low-power execution units Re-order buffer Scheduling logic

25 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1425 Clock Distribution clock

26 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1426 Clock Power P clk = C L V DD 2 f + C L V DD 2 f / λ + C L V DD 2 f / λ 2 +... stages – 1 1 = C L V DD 2 f Σ─ n = 0λ n where C L =total load capacitance λ =constant fanout at each stage in distribution network Clock consumes about 40% of total processor power.

27 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1427 Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0.75μ CMOS 0.5μ CMOS 0.35μ CMOS Frequency (MHz) 200300600 Total capacitance 12.5nF Clock gating used. Total power 80 - 110W Clock load 3.25nF3.75nF Clock power 40% 40% (20W) Max. clock skew 200ps (<10%) 90ps D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998.

28 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1428 Power Reduction Example Alpha 21064: 200MHz @ 3.45V, power dissipation = Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W Reduce voltage to 1.5V, power (5.3x) = Reduce voltage to 1.5V, power (5.3x) = 4.9W Eliminate FP, power (3x) = Eliminate FP, power (3x) = 1.6W Scale 0.75→0.35μ, power (2x) = Scale 0.75→0.35μ, power (2x) = 0.8W Reduce clock load, power (1.3x) = Reduce clock load, power (1.3x) = 0.6W Reduce frequency 200→160MHz, power (1.25x) = Reduce frequency 200→160MHz, power (1.25x) = 0.5W J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996. J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996.

29 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1429 Parallel Architecture Processor f f/2 Processor f/2 f Input Output Input Output Capacitance = C Voltage = V Frequency = f Power = CV 2 f Capacitance = 2.2C Voltage = 0.6V Frequency = 0.5f Power = 0.396CV 2 f

30 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1430 Pipeline Architecture Processor f Input Output Register ½ Proc. f InputOutput Register ½ Proc. Register Capacitance = C Voltage = V Frequency = f Power = CV 2 f Capacitance = 1.2C Voltage = 0.6V Frequency = f Power = 0.432CV 2 f

31 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1431 Approximate Trend n-parallel proc. n-parallel proc. n-stage pipeline proc. n-stage pipeline proc. CapacitancenCC VoltageV/nV/n Frequencyf/nf Power CV 2 f/n 2 Chip area n times n times 10-20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998.

32 Copyright Agrawal, 2007ELEC6270 Fall 07, Lecture 1432 For More on Microprocessors T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002. T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002. R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002. R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002.


Download ppt "Copyright Agrawal, 2007 ELEC6270 Fall 07, Lecture 14 1 ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani."

Similar presentations


Ads by Google