CSV881: Low-Power Design Multicore Design for Low Power Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal Copyright Agrawal, 2011 Lecture 13: Multicore Design
Low-Power Datapath Architecture Lower supply voltage This slows down circuit speed Use parallel computing to gain the speed back Works well when threshold voltage is also lowered. About 60% reduction in power obtainable. Reference: A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Boston: Kluwer Academic Publishers (Now Springer), 1995. Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design A Reference Datapath Combinational logic Input Register Register Output Cref CK Supply voltage = Vref Total capacitance switched per cycle = Cref Clock frequency = f Power consumption: Pref = CrefVref2f Copyright Agrawal, 2011 Lecture 13: Multicore Design
A Parallel Architecture Supply voltage: VN ≤ V1 = Vref N = Deg. of parallelism Each copy processes every Nth input, operates at reduced voltage Register Comb. Logic Copy 1 f/N Register Comb. Logic Copy 2 Register Output Input N to 1 multiplexer f/N f Register Comb. Logic Copy N Multiphase Clock gen. and mux control f/N CK Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Level Converter: L to H Transistors with thicker oxide and longer channels VDDH Vout_H Vin_L VDDL N. H. E. Weste and D. Harris, CMOS VLSI Design, Third Edition, Section 12.4.3, Addison-Wesley, 2005. Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Level Converter: H to L Transistors with thicker oxide and longer channels VDDL Vout_L Vin_H N. H. E. Weste and D. Harris, CMOS VLSI Design, Third Edition, Section 12.4.3, Addison-Wesley, 2005. Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4 Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Power PN = Pproc + Poverhead Pproc = N(Cinreg + Ccomb) VN2f/N = (Cinreg + Ccomb) VN2f = CrefVN2f Poverhead = CoverheadVN2f ≈ δCref(N – 1)VN2f PN = [1 + δ(N – 1)]CrefVN2f PN VN2 ── = [1 + δ(N – 1)] ─── P1 Vref2 Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Voltage vs. Speed CLVref CLVref Delay of a gate, T ≈ ──── = ────────── I k(W/L)(Vref – Vt)2 where I is saturation current k is a technology parameter W/L is width to length ratio of transistor Vt is threshold voltage 4.0 3.0 2.0 1.0 0.0 1.2μ CMOS Voltage reduction slows down as we get closer to Vt N=3 Normalized gate delay, T N=2 N=1 Supply voltage Vt V3 V2=2.9V Vref =5V Copyright Agrawal, 2011 Lecture 13: Multicore Design
Increasing Multiprocessing 1.0 0.8 0.6 0.4 0.2 0.0 1.2μ CMOS, Vref = 5V Vt=0.8V PN/P1 Vt=0.4V Vt=0V (extreme case) 1 2 3 4 5 6 7 8 9 10 11 12 N Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Extreme Cases: Vt = 0 Delay, T α 1/ Vref For N processing elements, delay = NT → VN = Vref/N PN 1 ── = [1+ δ (N – 1)] ── → 1/N P1 N2 For negligible overhead, δ→0 PN 1 ── ≈ ── P1 N2 For Vt > 0, power reduction is less and there will be an optimum value of N. Copyright Agrawal, 2011 Lecture 13: Multicore Design
Example: Multiplier Core Specification: 200MHz Clock 15W dissipation @ 5V Low voltage operation, VDD ≥ 1.5 volts (VDD – 0.5)2 Relative clock rate = ─────── 20.25 Problem: Integrate multiplier core on a SOC Power budget for multiplier ~ 5W Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design A Multicore Design Multiplier Core 1 Reg 40MHz Multiplier Core 2 Output Reg 5 to 1 mux Reg Input 40MHz 200MHz Multiphase Clock gen. and mux control Multiplier Core 5 Reg 40MHz 200MHz CK Core clock frequency = 200/N, N should divide 200. Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design How Many Cores? For N cores: clock frequency = 200/N MHz Supply voltage, VDDN = 0.5 + (20.25/N)1/2 volts Assuming 10% overhead per core, VDDN Power dissipation =15 [1 + 0.1(N – 1)] (───)2 watts 5 Copyright Agrawal, 2011 Lecture 13: Multicore Design
Design Tradeoffs Number of cores, N Clock (MHz) Core supply VDDN (Volts) Total Power (Watts) 1 200 5.00 15.0 2 100 3.68 8.94 4 50 2.75 5.90 5 40 2.51 5.29 8 25 2.10 4.50 Copyright Agrawal, 2011 Lecture 13: Multicore Design
Power Reduction in Processors Just about everything is used. Hardware methods: Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Clock gating, frequency reduction Sleep mode Architecture: Instruction set hardware organization Software methods Copyright Agrawal, 2011 Lecture 13: Multicore Design
Parallel Architecture Processor Processor Input Output Output f/2 Input Processor f f Capacitance = C Voltage = V Frequency = f Power = CV2f Capacitance = 2.2C Voltage = 0.6V Frequency = 0.5f Power = 0.396CV2f f/2 Copyright Agrawal, 2011 Lecture 13: Multicore Design
Pipeline Architecture Processor ½ Proc. ½ Proc. Input Output Input Output Register Register Register f f Capacitance = C Voltage = V Frequency = f Power = CV2f Capacitance = 1.2C Voltage = 0.6V Frequency = f Power = 0.432CV2f Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance nC C Voltage V/n Frequency f/n f Power CV2f/n2 Chip area n times 10-20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Springer, 1998. Copyright Agrawal, 2011 Lecture 13: Multicore Design
SPECint2000 and SPECfp2000 benchmarks Multicore Processors Computer, May 2005, p. 12 Multicore SPECint2000 and SPECfp2000 benchmarks Performance based on Single core 2000 2004 2008 Copyright Agrawal, 2011 Lecture 13: Multicore Design
Lecture 13: Multicore Design Multicore Processors D. Geer, “Chip Makers Turn to Multicore Processors,” Computer, vol. 38, no. 5, pp. 11-13, May 2005. A. Jerraya, H. Tenhunen and W. Wolf, “Multiprocessor Systems-on-Chips,” Computer, vol. 5, no. 7, pp. 36-40, July 2005; this special issue contains three more articles on multicore processors. S. K. Moore, “Winner Multimedia Monster – Cell’s Nine Processors Make It a Supercomputer on a Chip,” IEEE Spectrum, vol. 43. no. 1, pp. 20-23, January 2006. Copyright Agrawal, 2011 Lecture 13: Multicore Design
Cell - Cell Broadband Engine Architecture Nine-processor chip: 192 Gflops © IEEE Spectrum, January 2006 L to R Atsushi Kameyama, Toshiba James Kahle, IBM Masakazu Suzoki, Sony Copyright Agrawal, 2011 Lecture 13: Multicore Design
Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 Eight Identical Processors f = 5.6GHz (max) 44.8 Gflops Copyright Agrawal, 2011 Lecture 13: Multicore Design