Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

Slides:

Advertisements

Similar presentations

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs Houman Homayoun PhD Candidate Dept. of Computer Science, UC Irvine.

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

1 Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

Low Power Techniques in Processor Design

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Dept. of Computer Science, UC Irvine

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Architectural and Circuit-Levels Design Techniques for Power and Temperature Optimizations in On- Chip SRAM Memories Houman Homayoun PhD Candidate Dept.

Leakage reduction techniques Three major leakage current components 1. Gate leakage ; ~ Vdd 4 2. Subthreshold ; ~ Vdd 3 3. P/N junction.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

Reducing the Scheduling Critical Cycle using Wakeup Prediction HPCA-10 Todd Ehrhart and Sanjay Patel Center for Reliable and High-Performance Computing.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Dynamic Associative Caches:

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

YASHWANT SINGH, D. BOOLCHANDANI

SECTIONS 1-7 By Astha Chawla

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Ka-Ming Keung Swamy D Ponpandi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California San Diego

Copyright © 2010 Houman HomayounUniversity of California San Diego 2 Outline – Multiple Sleep Mode Brief overview of state-of-art superscalar processor Introducing the idea of multiple sleep modes design Architectural control of multiple sleep modes Results Conclusions

Copyright © 2010 Houman HomayounUniversity of California San Diego 3 Superscalar Architecture Fetch Decode Rename Instruction Queue Execute Logical Register File Physical Register File ROB F.U. Reservation Station Write-Back Dispatch Issue Load Store Queue Fetch Decode Rename Instruction Queue Execute Logical Register File Physical Register File ROB F.U. Reservation Station Write-Back Dispatch Issue Load Store Queue

Copyright © 2010 Houman HomayounUniversity of California San Diego 4 On-chip SRAMs+CAMs and Power On-chip SRAMs+CAMs in high-performance processors are large Branch Predictor Reorder Buffer Instruction Queue Instruction/Data TLB Load and Store Queue L1 Data Cache L1 Instruction Cache L2 Cache more than 60% of chip budget Dissipate significant portion of power via leakage Pentium M processor die photo Courtesy of intel.com

Copyright © 2010 Houman HomayounUniversity of California San Diego 5 Techniques Address Leakage in SRAM+CAM Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first Drowsy Cache Keeps cache lines in low-power state, w/ data retention Cache Decay Evict lines not used for a while, then power them down Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that. Circuit Architecture

Copyright © 2010 Houman HomayounUniversity of California San Diego 6 Sleep Transistor Stacking Effect Subthreshold current: inverse exponential function of threshold voltage Stacking transistor N with slpN: The source to body voltage (VM ) of transistor N increases, reduces its subthreshold leakage current, when both transistors are off Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability

Copyright © 2010 Houman HomayounUniversity of California San Diego 7 Wakeup Latency To benefit the most from the leakage savings of stacking sleep transistors keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible) Drawback: impact on the wakeup latency (sleep transistor wakeup delay + sleep signal propagation delay) of the circuit Control the gate voltage of the sleep transistors Increasing the gate voltage of footer sleep transistor reduces the virtual ground voltage (VM) reduction in the circuit wakeup delay overhead reduction in leakage power savings

Copyright © 2010 Houman HomayounUniversity of California San Diego 8 Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead Wakeup Delay vs. Leakage Power Reduction trade-off between the wakeup overhead and leakage power saving

Copyright © 2010 Houman HomayounUniversity of California San Diego 9 Multiple Sleep Modes Specifications Wakeup Delay varies from 1~more than 10 processor cycles (2.2GHz). Large wakeup power overhead for large SRAMs. Need to find Period of Infrequent Access On-chip SRAM multiple sleep mode normalized leakage power savings

Copyright © 2010 Houman HomayounUniversity of California San Diego 10 Reducing Leakage in SRAM Peripherals Maximize the leakage reduction put SRAM into ultra low power mode adds few cycles to the SRAM access latency significantly reduces performance Minimize Performance Degradation put SRAM into the basic low power mode requires near zero wakeup overhead Not noticeable leakage power reduction

Copyright © 2010 Houman HomayounUniversity of California San Diego 11 Motivation for Dynamically Controlling Sleep Mode large leakage reduction benefit Ultra and aggressive low power modes low performance impact benefit Basic-lp mode Periods of frequent access Basic-lp mode Periods of infrequent access Ultra and aggressive low power modes dynamically adjust sleep power mode

Copyright © 2010 Houman HomayounUniversity of California San Diego 12 Architectural Motivations Architectural Motivation A load miss in L1/L2 caches takes a long time to service prevents dependent instructions from being issued When dependent instructions cannot issue performance is lost At the same time, energy is lost as well! This is an opportunity to save energy

Copyright © 2010 Houman HomayounUniversity of California San Diego 13 Multiple Sleep Mode Control Mechanism L2 cache miss or multiple DL1 misses triggers power mode transitioning. The general algorithm may not deliver optimal results for all units. modified the algorithm for individual on-chip SRAM-based units to maximize the leakage reduction at NO performance cost. General state machine to control power mode transitions

Copyright © 2010 Houman HomayounUniversity of California San Diego 14 Branch Predictor 1 out of every 9 fetched instructions in integer benchmarks and out of 63 fetched instructions in floating point benchmarks accesses the branch predictor always put branch predictor in deep low power modes (lp, ultra-lp or aggr-lp) and waking up on access. noticeable performance degradation for some benchmarks.

Copyright © 2010 Houman HomayounUniversity of California San Diego 15 Observation: Branch Predictor Access Pattern Within a benchmark there is significant variation in Instructions Per Branch (IPB). once the IPB drops (increases) significantly it may remain low (high) for a long period of time. Distribution of the number of branches per 512-instruction interval (over 1M cycles)

Copyright © 2010 Houman HomayounUniversity of California San Diego 16 Branch Predictor Peripherals Leakage Control Can identify the high IPB period, once the first low IPB period is detected. The number of fetched branches is counted every 512 cycles, once the number of branches is found to be less than a certain threshold (24 in this work) a high IPB period identified. The IPB is then predicted to remain high for the next twenty 512 cycles intervals (10K cycles). Branch predictor peripherals transition from basic-lp mode to lp mode when a high IPB period is identified. During pre-stall and stall periods the branch predictor peripherals transition to aggr-lp and ultra-lp mode, respectively.

Copyright © 2010 Houman HomayounUniversity of California San Diego 17 Leakage Power Reduction Noticeable Contribution of Ultra and Basic low power mode

Copyright © 2010 Houman HomayounUniversity of California San Diego 18 Outline – Resource Adaptation why an IQ, ROB, RF major power dissipators? Study processor resources utilization during L2/multiple L1 misses service time Architectural approach on dynamically adjusting the size of resources during cache miss period for power conservation Results Conclusions

Copyright © 2010 Houman HomayounUniversity of California San Diego 19 Instruction Queue The Instruction Queue is a CAM-like structure which holds instructions until they can be issued. Set entries for new dispatched instructions Read entries to issue instructions to functional units Wakeup instructions waiting in the IQ once a result is ready Select instructions for issue when the number of instructions available exceed the processor issue limit (Issue Width). Main Complexity: Wakeup Logic

Copyright © 2010 Houman HomayounUniversity of California San Diego 20 Logical View of Instruction Queue No Need to always have such aggressive wakeup/issue width! At each cycle, the match lines are pre-charged high To allow the individual bits associated with an instruction tag to be compared with the results broadcasted on the taglines. Upon a mismatch, the corresponding matchline is discharged. Otherwise, the match line stays at Vdd, which indicates a tag match. At each cycle, up to 4 instructions broadcasted on the taglines, four sets of one-bit comparators for each one-bit cell are needed. All four matchlines must be ORed together to detect a match on any of the broadcasted tags. The result of the OR sets the ready bit of instruction source operand

Copyright © 2010 Houman HomayounUniversity of California San Diego 21 ROB and Register File The ROB and the register file are multi-ported SRAM structures with several functionalities: Setting entries for up to IW instructions in each cycle, Releasing up to IW entries during commit stage in a cycle, and Flushing entries during the branch recovery. Dynamic PowerLeakage Power

Copyright © 2010 Houman HomayounUniversity of California San Diego 22 Architectural Motivations Architectural Motivation: A load miss in L1/L2 caches takes a long time to service prevents dependent instructions from being issued When dependent instructions cannot issue After a number of cycles the instruction window is full ROB, Instruction Queue, Store Queue, Register Files The processor issue stalls and performance is lost At the same time, energy is lost as well! This is an opportunity to save energy Scenario I: L2 cache miss period Scenario II: three or more pending DL1 cache misses

Copyright © 2010 Houman HomayounUniversity of California San Diego 23 How Architecture can help reducing power in ROB, Register File and Instruction Queue Significant issue width decrease! Scenario I: The issue rate drops by more than 80% Scenario II: The issue rate drops is 22% for integer benchmarks and 32.6% for floating-point benchmarks.

Copyright © 2010 Houman HomayounUniversity of California San Diego 24 How Architecture can help reducing power in ROB, Register File and Instruction Queue Benchmark Scenario IScenario IIBenchmarkScenario IScenario II bzip applu crafty apsi gap Art gcc equake gzip facerec mcf galgel parser lucas twolf mgrid vortex swim vpr wupwise INT average FP average ROB occupancy grows significantly during scenario I and II for integer benchmarks: 98% and 61% on average The increase in ROB occupancy for floating point benchmarks is less, 30% and 25% on average for scenario I and II.

Copyright © 2010 Houman HomayounUniversity of California San Diego 25 How Architecture can help reducing power in ROB, Register File and Instruction Queue IRF occupancy always grows for both scenarios when experimenting with integer benchmarks. a similar case is for FRF when running floating-point benchmarks and only during scenario II

Copyright © 2010 Houman HomayounUniversity of California San Diego 26 Proposed Architectural Approach Adaptive resource resizing during cache miss period Reduce the issue and the wakeup width of the processor during L2 miss service time. Increase the size of ROB and RF during L2 miss service time or when at least three DL1 misses are pending simple resizing scheme: reduce to half size. not necessarily optimized for individual units, but a simple scheme to implement at circuit!

Copyright © 2010 Houman HomayounUniversity of California San Diego 27 Results Small Performance loss~1% 15~30% dynamic and leakage power reduction

Copyright © 2010 Houman HomayounUniversity of California San Diego 28 Conclusions Introducing the idea of multiple sleep mode design Apply multiple sleep mode to on-chip SRAMs Find period of low activity for state transition Introduce the idea of resource adaptation Apply resource adaptation to on-chip SRAMs+CAMs Find period of low activity for state transition Applying similar adaptive techniques to other energy hungry resources in the processor Multiple sleep mode functional units