Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench.

Similar presentations


Presentation on theme: "1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench."— Presentation transcript:

1 1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench 2, L. Powell 1, M. Floyd 1, B. Sinharoy 2, M. Lee 1, M. Goulet 1, J. Wagoner 1, N. Schwarz 1, S. Runyon 1, G. Gorman 1, P. Restle 3, R. Kalla 1, J. McGill 1, S. Dodson 1 1 IBM System Group, Austin, TX 2 IBM System Group, Poughkeepsie, NY 3 IBM Research, Yorktown Heights, NY

2 2 Outline  Project Objective  Microarchitecture Changes  Implementation Overview  Design Enablers  Integration Challenges  Timing and Hardware Performance  Power Efficiency  Summary

3 3 POWER5™ Chip Objectives Build on POWER4™ base  Maintain binary and structural compatibility  Deliver superior performance  Enhance and extend SMP scalability  Provide additional server flexibility  Enhance reliability, availability, serviceability (RAS) attributes  Deliver power efficient design  Project…

4 4 Simultaneous Multithreading in POWER5 Chip  Each chip appears as a 4-way SMP to software  Processor resources optimized for enhanced SMT performance  Software controlled thread priority  Dynamic feedback of runtime behavior to adjust priority  Dynamic switching between single and multithreaded mode FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single Threaded Operation Thread 0 active  Microarchitecture…

5 5 Simultaneous Multithreading in POWER5 Chip  Each chip appears as a 4-way SMP to software  Processor resources optimized for enhanced SMT performance  Software controlled thread priority  Dynamic feedback of runtime behavior to adjust priority  Dynamic switching between single and multithreaded mode FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Simultaneous Multi-Threading Thread 0 activeThread 1 active  Microarchitecture…

6 6 PP L2 Memory PP Mem Ctl Fab Ctl Reduced L3 Latency Faster access to memory L3 Cntrl L2 L3 Cntrl Larger SMPs Number of chips cut in half Modifications to POWER4 System Structure

7 7 POWER5 Chip Overview  Technology: 130nm lithography, SOI, Cu wiring  276M transistors  389 mm 2 die size  Two 8-way superscalar SMT cores  Memory subsystem with 1.9MB L2-Cache, L3 directory and memory controller on chip  Extensive RAS support  High-speed elastic bus interface  Implementation…

8 8 ERAT and D-Cache Array Design Changes  System performance vs. area trade-off  ERAT: Fully associative, implemented as Sum-Address CAM  D-cache: 4-way associativity  Result: 2-3% performance gain with improved wireability at 5% area cost  Design…

9 9 L2 and I-Cache Array Design Changes  SMT drives thread level parallelism  Improved associativity on L2-Cache (10-way) and I-Cache (2-way)  L2 access shifted by ½ cycle avoiding extensive array redesign  High speed latch with compare on I-Cache access path  Design…

10 10 2 nd Generation Elastic Interface Design  EI-II performance improvements  Runs over 2 GHz in laboratory -- head-room on IO frequencies –Allows bus frequencies to continue scaling with processor frequency  Optimizes V ref at T0 by level forwarding  Maintains guardband via periodic self calibration  Design…

11 11 Implementation of Engineered Buses and IO Wires  Pre-planned and custom routed buses  ~50K engineered wires at chip level  ~2X of POWER4 chip  Custom buffer insertion process  ~250K buffer/inverters  2.5X of POWER4 chip  Wire and bus characterization  Noise tolerance  Impact of coupling on delay  Inductance analysis  Integration…

12 12 Implementation of Engineered Buses and IO Wires  Pre-planned and custom routed buses  ~50K engineered wires at chip level  ~2X of POWER4 chip  Custom buffer insertion process  ~250K buffer/inverters  2.5X of POWER4 chip  Wire and bus characterization  Noise tolerance  Impact of coupling on delay  Inductance analysis  IO performance driven routing  5Ω resistance limit on chip  Fully shielded (single ended design)  Integration…

13 13 Dual Clock Distribution total nominal skew18ps local skew9ps slew rate from 30 - 70%52 - 71ps latency PLL to LCB777ps duty cycle control±25ps switching power @ 1.08V and 2GHz 10.5W total nominal skew18ps local skew9ps slew rate from 30 - 70%52 - 71ps latency PLL to LCB777ps duty cycle control±25ps switching power @ 1.08V and 1.8GHz 9.5W  Integration… Memory Clock Domain (4 Buffers)  1 central chip buffer  3 sector buffers  asynchronous to main mesh Main Clock Grid (91 Buffers)  1 full chip buffer  1 central chip buffer  3 half chip buffers  6 quadrant buffers  80 sector buffers

14 14 Chip Timing and Shmoo Plot  Timing Closure  Sort mode (functional/scan/lbist)  Early mode (functional/scan)  Timing Model Analysis  690K scannable M/S latches  180K non-scan mid-cycle latches  6.75M timing checks  TAT 19 hours Shmoo Plot Frequency (GHz) Voltage (Volt) at 25ºC Fail Pass  Timing…

15 15 Power Efficient Design Implementation  DC power mitigation  Leverage triple V t technology Decrease low V t usage by 90% Increase high V t usage by 30%  Leverage triple T ox technology Thick T ox usage for decoupling capacitors  AC power mitigation  Minimal usage of dynamic circuits  Reduce loading on clock mesh  Incorporation of dynamic clock gating  Power…

16 16 scan-only latches C2 latches gating logic global disable local disable mesh clock gated c1 clock dynamic stop enable cycle-to-cycle clock control (~1/2 cycle path) cycle-predict clock control (~full cycle path) scan-only latches C2 latches gating logic global disable local disable mesh clock gated c1 clock dynamic stop enable MS latch Dynamic Clock Gating Implementation  Power…  Approach allows aggressive use of clock gating to conserve power

17 17 Improved Power Efficiency  AC power reduction by ≥ 25%  DC power reduction by ≥ 50%  Total power reduction by > 33% for numerical intensive workload  Power…

18 18  Power… Thermal Protection recovery-temperature over-temperature

19 19 Summary  First dual core SMT microprocessor  Extended SMP to 64-way  Operating in laboratory  Power dynamically managed with no performance penalty  Implementation permits future technology scalability from circuit and power perspective  Innovative approach leveraging technology with system focus for high performance in a power efficient design  Summary…


Download ppt "1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench."

Similar presentations


Ads by Google