1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench.

Slides:



Advertisements
Similar presentations
The Bus Architecture of Embedded System ESE 566 Report 1 LeTian Gu.
Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
© 2013 IBM Corporation Use of Hierarchical Design Methodologies in Global Infrastructure of the POWER7+ Processor Brian Veraa Ryan Nett.
1 Power Management for High- speed Digital Systems Tao Zhao Electrical and Computing Engineering University of Idaho.
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
High-Performance Microprocessor Design. Outline Introduction Technology scaling Power Clock Verification.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
© 2006 IBM Corporation 0. IBM Research © 2007 IBM Corporation Multi-Core Design Automation Challenges John Darringer IBM T. J. Watson Research Center.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock.
Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.
Sequential Definitions  Use two level sensitive latches of opposite type to build one master-slave flipflop that changes state on a clock edge (when the.
EE141 © Digital Integrated Circuits 2nd Timing Issues 1 Digital Integrated Circuits A Design Perspective Timing Issues Jan M. Rabaey Anantha Chandrakasan.
CSE477 L19 Timing Issues; Datapaths.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 19: Timing Issues; Introduction to Datapath.
Clock Design Adopted from David Harris of Harvey Mudd College.
Chapter 11 Timing Issues in Digital Systems Boonchuay Supmonchai Integrated Design Application Research (IDAR) Laboratory August 20, 2004; Revised - July.
CSE477 L19 Timing Issues; Datapaths.1Irwin&Vijay, PSU, 2002 Complex Digital Circuits Design Lecture 2: Timing Issues; [Adapted from Rabaey’s Digital Integrated.
A Novel Clock Distribution and Dynamic De-skewing Methodology Arjun Kapoor – University of Colorado at Boulder Nikhil Jayakumar – Texas A&M University,
Chapter 17 Parallel Processing.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
EDA Roadmap Taskforce Report Draft 2 2/9/99 Figure 0.1 Process: Focus on Change Challenges & Directions Technology Paradigm Shifts Market Segment Semiconductor.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Memory Technology “Non-so-random” Access Technology:
Computer performance.
Case Study - SRAM & Caches
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
CCSE251 Introduction to Computer Organization
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.
1 Provided By: Ali Teymouri Based on article “Jaguar: A Next-Generation Low-Power x86-64 Core ” Coarse: Custom Implementation of DSP Systems University.
Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.
Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
1 ECE 556 Design Automation of Digital Systems By Prof. Charlie Chung-Ping Chen ECE Department UW-Madison.
ECE 124a/256c Advanced VLSI Design Forrest Brewer.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Recent Topics on Programmable Logic Array
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
RTL Hardware Design by P. Chu Chapter Poor design practice and remedy 2. More counters 3. Register as fast temporary storage 4. Pipelined circuit.
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
© Digital Integrated Circuits 2nd Inverter Digital Integrated Circuits A Design Perspective The Inverter Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
AIX and PowerVM Workshop © 2013 IBM Corporation 1 POWER5POWER5+POWER6POWER7POWER7+ Technology130nm90nm65nm45nm32nm Size389 mm mm mm mm.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the Power6  Clock Rate: 2.4 GHz GHz  Feature size: 45.
The Alpha – Data Stream Matt Ziegler.
A High-Speed & High-Capacity Single-Chip Copper Crossbar John Damiano, Bruce Duewer, Alan Glaser, Toby Schaffer, John Wilson, and Paul Franzon North Carolina.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
© 2004 IBM Corporation Power Everywhere POWER5 Processor Update Mark Papermaster VP, Technology Development IBM Systems and Technology Group.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CS203 – Advanced Computer Architecture
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
Lynn Choi School of Electrical Engineering
Lynn Choi School of Electrical Engineering
Hyperthreading Technology
Hardware Overview System P & Power5.
Presentation transcript:

1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench 2, L. Powell 1, M. Floyd 1, B. Sinharoy 2, M. Lee 1, M. Goulet 1, J. Wagoner 1, N. Schwarz 1, S. Runyon 1, G. Gorman 1, P. Restle 3, R. Kalla 1, J. McGill 1, S. Dodson 1 1 IBM System Group, Austin, TX 2 IBM System Group, Poughkeepsie, NY 3 IBM Research, Yorktown Heights, NY

2 Outline  Project Objective  Microarchitecture Changes  Implementation Overview  Design Enablers  Integration Challenges  Timing and Hardware Performance  Power Efficiency  Summary

3 POWER5™ Chip Objectives Build on POWER4™ base  Maintain binary and structural compatibility  Deliver superior performance  Enhance and extend SMP scalability  Provide additional server flexibility  Enhance reliability, availability, serviceability (RAS) attributes  Deliver power efficient design  Project…

4 Simultaneous Multithreading in POWER5 Chip  Each chip appears as a 4-way SMP to software  Processor resources optimized for enhanced SMT performance  Software controlled thread priority  Dynamic feedback of runtime behavior to adjust priority  Dynamic switching between single and multithreaded mode FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single Threaded Operation Thread 0 active  Microarchitecture…

5 Simultaneous Multithreading in POWER5 Chip  Each chip appears as a 4-way SMP to software  Processor resources optimized for enhanced SMT performance  Software controlled thread priority  Dynamic feedback of runtime behavior to adjust priority  Dynamic switching between single and multithreaded mode FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Simultaneous Multi-Threading Thread 0 activeThread 1 active  Microarchitecture…

6 PP L2 Memory PP Mem Ctl Fab Ctl Reduced L3 Latency Faster access to memory L3 Cntrl L2 L3 Cntrl Larger SMPs Number of chips cut in half Modifications to POWER4 System Structure

7 POWER5 Chip Overview  Technology: 130nm lithography, SOI, Cu wiring  276M transistors  389 mm 2 die size  Two 8-way superscalar SMT cores  Memory subsystem with 1.9MB L2-Cache, L3 directory and memory controller on chip  Extensive RAS support  High-speed elastic bus interface  Implementation…

8 ERAT and D-Cache Array Design Changes  System performance vs. area trade-off  ERAT: Fully associative, implemented as Sum-Address CAM  D-cache: 4-way associativity  Result: 2-3% performance gain with improved wireability at 5% area cost  Design…

9 L2 and I-Cache Array Design Changes  SMT drives thread level parallelism  Improved associativity on L2-Cache (10-way) and I-Cache (2-way)  L2 access shifted by ½ cycle avoiding extensive array redesign  High speed latch with compare on I-Cache access path  Design…

10 2 nd Generation Elastic Interface Design  EI-II performance improvements  Runs over 2 GHz in laboratory -- head-room on IO frequencies –Allows bus frequencies to continue scaling with processor frequency  Optimizes V ref at T0 by level forwarding  Maintains guardband via periodic self calibration  Design…

11 Implementation of Engineered Buses and IO Wires  Pre-planned and custom routed buses  ~50K engineered wires at chip level  ~2X of POWER4 chip  Custom buffer insertion process  ~250K buffer/inverters  2.5X of POWER4 chip  Wire and bus characterization  Noise tolerance  Impact of coupling on delay  Inductance analysis  Integration…

12 Implementation of Engineered Buses and IO Wires  Pre-planned and custom routed buses  ~50K engineered wires at chip level  ~2X of POWER4 chip  Custom buffer insertion process  ~250K buffer/inverters  2.5X of POWER4 chip  Wire and bus characterization  Noise tolerance  Impact of coupling on delay  Inductance analysis  IO performance driven routing  5Ω resistance limit on chip  Fully shielded (single ended design)  Integration…

13 Dual Clock Distribution total nominal skew18ps local skew9ps slew rate from % ps latency PLL to LCB777ps duty cycle control±25ps switching 1.08V and 2GHz 10.5W total nominal skew18ps local skew9ps slew rate from % ps latency PLL to LCB777ps duty cycle control±25ps switching 1.08V and 1.8GHz 9.5W  Integration… Memory Clock Domain (4 Buffers)  1 central chip buffer  3 sector buffers  asynchronous to main mesh Main Clock Grid (91 Buffers)  1 full chip buffer  1 central chip buffer  3 half chip buffers  6 quadrant buffers  80 sector buffers

14 Chip Timing and Shmoo Plot  Timing Closure  Sort mode (functional/scan/lbist)  Early mode (functional/scan)  Timing Model Analysis  690K scannable M/S latches  180K non-scan mid-cycle latches  6.75M timing checks  TAT 19 hours Shmoo Plot Frequency (GHz) Voltage (Volt) at 25ºC Fail Pass  Timing…

15 Power Efficient Design Implementation  DC power mitigation  Leverage triple V t technology Decrease low V t usage by 90% Increase high V t usage by 30%  Leverage triple T ox technology Thick T ox usage for decoupling capacitors  AC power mitigation  Minimal usage of dynamic circuits  Reduce loading on clock mesh  Incorporation of dynamic clock gating  Power…

16 scan-only latches C2 latches gating logic global disable local disable mesh clock gated c1 clock dynamic stop enable cycle-to-cycle clock control (~1/2 cycle path) cycle-predict clock control (~full cycle path) scan-only latches C2 latches gating logic global disable local disable mesh clock gated c1 clock dynamic stop enable MS latch Dynamic Clock Gating Implementation  Power…  Approach allows aggressive use of clock gating to conserve power

17 Improved Power Efficiency  AC power reduction by ≥ 25%  DC power reduction by ≥ 50%  Total power reduction by > 33% for numerical intensive workload  Power…

18  Power… Thermal Protection recovery-temperature over-temperature

19 Summary  First dual core SMT microprocessor  Extended SMP to 64-way  Operating in laboratory  Power dynamically managed with no performance penalty  Implementation permits future technology scalability from circuit and power perspective  Innovative approach leveraging technology with system focus for high performance in a power efficient design  Summary…