12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.

Slides:

Advertisements

Similar presentations

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.

Advertisements

Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Low power Design Strategies Daniele Folegnani. Talk outline Why Low Power is Important Power Consumption in CMOS Circuits New Trends for Future Microprocessors.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Low Power Techniques in Processor Design

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,

Power Profiling using Sim-Panalyzer Andria Dyess and Trey Brakefield CPE631 Spring 2005.

Runtime Software Power Estimation and Minimization Tao Li.

Basic Memory Management 1. Readings r Silbershatz et al: chapters

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Dynamic Associative Caches:

Multiscalar Processors

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

SECTIONS 1-7 By Astha Chawla

Microarchitectural Techniques for Power Gating of Execution Units

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Department of Computer Science University of California, Santa Barbara

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Power-Aware Microprocessors

Adaptive Code Unloading for Resource-Constrained JVMs

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th International Symposium on Microarchitecture (MICRO-34), December 3rd, 2001

12/03/2001 MICRO’01 Presentation Outline Motivation Resource usage in superscalar datapaths Resource allocation strategy Performance results Concluding remarks

12/03/2001 MICRO’01 Motivation High-end superscalar CPUs employ a substantial amount of datapath resources Consequences: High overall power dissipation Areal Energy/Power density is at a dangerous level Thus: Energy dissipation needs to be preferably controlled through technology independent techniques

12/03/2001 MICRO’01 What This Work is All About Power-hungry resources are allocated on a “one-size-fits-all” basis Unnecessary dissipation from overcommitted resources Examples of resources: Issue Queue, Reorder Buffer, Load/Store Queue, caches, Function units - Resources considered in this work: IQ, ROB, LSQ Main idea: Control resource allocation/deallocation dynamically to track the demands of the application Goals: Must limit any impact on performance Must allow for easy retrofit into existing datapaths Must have a stable and low-overhead control strategy

12/03/2001 MICRO’01 Dynamic Resizing of IQ, ROB and LSQ IQ Function Units Instruction Issue F1Dec/ RN1 FU1 FU2 FUm ROB ARF LSQ Result/status forwarding buses EX Instruction dispatch Architectural Register File : resized resource F2 Fetch RN2/ Dis Decode/Dispatch

12/03/2001 MICRO’01 Main Issues How do we measure/estimate resource needs? Continuous measurement vs. periodic sampling What is the control strategy? Centralized vs. distributed How is the performance impact limited? Periodic upsizing vs. asynchronous upsizing What are the relevant circuit techniques? Overall redesign vs. simple changes

12/03/2001 MICRO’01 Resource Usage in Superscalar Datapath: Example (fpppp)

12/03/2001 MICRO’01 Resource Usage in Superscalar Datapath: Example (apsi)

12/03/2001 MICRO’01 Incremental Resource Allocation/Deallocation The ROB, IQ and LSQ are each implemented as a set of independent partitions Each partition is a register file, complete with its own sensing and precharge/write logic, multiple ports and through busses All partitions have associative addressing logic

12/03/2001 MICRO’01 Partitioned Organization Bitlines or forwarding lines within a partition Precharger array Input/output drivers Bypass switch array Non-associative part Associative part Precharger array Input/output drivers Bypass switch array Associative part Non-associative part Bitlines Forwarding lines Through line Bypass switch Partition 1 Partition 2 Precharger array Input/output drivers Bypass switch array Associative part Non-associative part Partition 3

12/03/2001 MICRO’01 Incremental Resource Allocation/Deallocation Allocations are increased by adding a free partition Deallocations are performed by powering down a partition after its contents have been used up Easy to do for the IQ A little more challenging for the ROB and the LSQ because of the FIFO nature.

12/03/2001 MICRO’01 Sampling and Downsizing Strategies Downsizing decisions are taken at the end of update period Update periods have a fixed duration of UP cycles Within an update period, multiple samples of the occupancies are taken at regular intervals of SP cycles cycles SP UP

12/03/2001 MICRO’ Actual occupancy Allocated entries SP SP / UPSP SP / UP 0 A Resizing Example (SP=4, UP=16)

12/03/2001 MICRO’ SP SP / UPSP SP / UP 0 Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

12/03/2001 MICRO’ SP SP / UPSP SP / UP 0 Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

12/03/2001 MICRO’ SP SP / UPSP SP / UP Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

12/03/2001 MICRO’ SP SP / UPSP SP / UP 1234Avg. Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

12/03/2001 MICRO’01 Upsizing Strategy Count the number of cycles when dispatch blocks because the resource is full. If the counter exceeds OT (Overflow Threshold), add one partition -upsizing is more aggressive than downsizing – reduces hit on performance Reset the overflow counter to 0 at the beginning of a new UP (Update Period)

12/03/2001 MICRO’ SP SP / UPSP SP / UP 1234Avg. Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16)

12/03/2001 MICRO’ SP SP / UPSP SP / UP A Resizing Example (SP=4, UP=16, OT=4) Actual occupancy Allocated entries

12/03/2001 MICRO’ SP SP / UPSP SP / UP Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4)

12/03/2001 MICRO’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 1 A Resizing Example (SP=4, UP=16, OT=4)

12/03/2001 MICRO’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 12 A Resizing Example (SP=4, UP=16, OT=4)

12/03/2001 MICRO’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 12 A Resizing Example (SP=4, UP=16, OT=4)

12/03/2001 MICRO’ SP SP / UPSP SP / UP Actual occupancy Allocated entries 123 A Resizing Example (SP=4, UP=16, OT=4)

12/03/2001 MICRO’ Actual occupancy Allocated entries 1234 A Resizing Example (SP=4, UP=16, OT=4) OT = SP SP / UPSP SP / UP

12/03/2001 MICRO’ Actual occupancy Allocated entries 1234 A Resizing Example (SP=4, UP=16, OT=4) OT = SP SP / UPSP

12/03/2001 MICRO’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234

12/03/2001 MICRO’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234

12/03/2001 MICRO’ Actual occupancy Allocated entries A Resizing Example (SP=4, UP=16, OT=4) SP SP / UPSP 1234

12/03/2001 MICRO’01 Summary of the Control Strategy Only three parameters used for control: OT (Overflow Threshold) UP (Update Period) SP (Sample Period) Less than 1% power overhead for control logic Advantages: Can easily achieve a desired power/performance tradeoff by adjusting OT and UP Monitoring on a cycle-by-cycle basis is avoided – done once every SP cycles

12/03/2001 MICRO’01 General Considerations for Deallocations All information within the partition to be deallocated must be consumed For the IQ, instructions from the partition must be issued For the ROB, entries within the partition must be committed For the LSQ, entries within the partition must start the D- cache access No new instruction should be dispatched to this partition This can cause dispatch to block for a longer duration in the case of the ROB because of its circular nature

12/03/2001 MICRO’01 Experimental Setup: the Accupower Toolkit Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator Energy/Power Estimator Power/energy stats SPICE measures of Energy per transition Transition counts, Context information

12/03/2001 MICRO’01 Configuration of the Simulated System Machine width4-way Issue Queue32 entries with 4 partitions 96 entries with 6 partitionsReorder Buffer Load/Store Queue 32 entries with 4 partitions Simulated the execution of SPEC2000 benchmarks.

12/03/2001 MICRO’01 Experimental Results: Effect on Performance IPC OT IPC Drop %0.9%4.9%19.3%

12/03/2001 MICRO’01 Experimental Results: Average Active Size (IQ) IPC OT Savings%14%27%51%

12/03/2001 MICRO’01 Experimental Results: Average Active Size (ROB) IPC OT Savings%19%34%58%

12/03/2001 MICRO’01 Experimental Results: Average Active Size (LSQ) IPC OT Savings%7%20%47%

12/03/2001 MICRO’01 Experimental Results (OT=512, UP=2048, SP=32)

12/03/2001 MICRO’01 Experimental Results: Power Reduction mW OT Power Savings %40%48%65% IPC Drop %0.9%4.9%19.3%

12/03/2001 MICRO’01 Other Matters Dispatch rate modulation on top of resizing does not cause substantial additional power savings and results in higher IPC drops (WCED’01) Note that this work also addresses leakage dissipations! We are in the process of extending this work to add caches, FUs, TLBs, …, and dynamic threshold variation Work in progress on the use of resizing hooks that are exposed to the compiler

12/03/2001 MICRO’01 Related Work Adaptive Issue Queue (Buyuktosunoglu et al, PACS’00): Multi-partitioned issue queue Number of partitions dynamically allocated based on the number of ready flags set in entries within active partition IPC drop triggers growth Resizable Issue Queue (Folegnani and Gonzalez, ISCA’01): FIFO issue queue, multi-partitioned Resizing based on number of instruction committed from the “youngest” partition used for downsizing Pipeline Balancing (Bahar and Manne, ISCA’01): For multi-clustered datapath organizations Dynamic resizing of Issue Queue & Dynamic Cluster Activation IPC monitored to allow clusters/issue queue partitions to be turned off with minimal impact on performance Others (IPC monitoring & resource control by OS, dynamic profiling)

12/03/2001 MICRO’01 Concluding Remarks Significant power savings with minimal impact on performance are achieved by dynamically resizing multiple datapath resources. 48% power savings with only a 4.9% IPC drop Simple control strategy is used that avoids resource monitoring on a cycle-by-cycle basis Basic techniques are orthogonal to other power reduction strategies like selective bit-slice activation, frequency and voltage scaling and additional circuit techniques