Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Slides:

Advertisements

Similar presentations

Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Allocating Memory.

ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Chapter 12 Pipelining Strategies Performance Hazards.

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,

PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Introduction to Computer Organization Pipelining.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

Multiscalar Processors

SECTIONS 1-7 By Astha Chawla

CS203 – Advanced Computer Architecture

CDA 3101 Spring 2016 Introduction to Computer Organization

Lecture 6: Advanced Pipelines

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Lecture: Out-of-order Processors

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York - Binghamton *currently with Intel Barcelona Research Center

Outline Introduction and motivations Introduction and motivations Register Packing: Register Packing: Conservative Packing Conservative Packing Speculative Packing Speculative Packing Results and discussions Results and discussions Conclusion Conclusion

Introduction Implications of larger instruction windows Implications of larger instruction windows Increases register pressure Increases register pressure Generally dealt with by using large register files Generally dealt with by using large register files Large register files have: Large register files have: Higher access time or require multi-cycle access Higher access time or require multi-cycle access Higher energy dissipation Higher energy dissipation Need to decrease the register file pressure Need to decrease the register file pressure

Motivations Many generated results have a lot of leading zeros or ones Many generated results have a lot of leading zeros or ones Fewer bits are needed to represent the value Fewer bits are needed to represent the value Register files are thus not used efficiently Register files are thus not used efficiently

“Narrow” Values Prefixes of all 1s can be replaced with a single 1 and the prefixes of all 0s can be replaced with a single 0. Prefixes of all 1s can be replaced with a single 1 and the prefixes of all 0s can be replaced with a single →1 (width = 1) →1 (width = 1) →0 (width = 1) →0 (width = 1) →01 (width = 2) →01 (width = 2) →101 (width = 3) →101 (width = 3) → (width = 8) → (width = 8) Narrow width operands do not use the full width of a register Narrow width operands do not use the full width of a register

Distribution of Widths

Exploiting Narrow Values Packing multiple results into a single physical register improves performance as the effective number of physical registers go up Packing multiple results into a single physical register improves performance as the effective number of physical registers go up

Main Challenges Value widths are not known until the results are actually produced Value widths are not known until the results are actually produced Register allocation made to a result can change if the value turns out to be narrow Register allocation made to a result can change if the value turns out to be narrow Consumers of the result have to be informed if it is reallocated to a different register based on its width Consumers of the result have to be informed if it is reallocated to a different register based on its width If multiple results are packed into a common register some means must be provided to locate them unambiguously If multiple results are packed into a common register some means must be provided to locate them unambiguously

Detecting Value Widths Have to quantize the widths to simplify implementation Have to quantize the widths to simplify implementation Chunks of bytes or double bytes Chunks of bytes or double bytes Width detection logic is embedded into the final stages of an execution unit Width detection logic is embedded into the final stages of an execution unit Techniques for detecting widths are well known – Leading Zero Detectors in floating point units Techniques for detecting widths are well known – Leading Zero Detectors in floating point units

Storing Narrow Values in Registers Parts of a result do not need to be stored contiguously. Parts of a result do not need to be stored contiguously. Upper half of narrow result A Lower half of narrow result A Upper half of narrow result B Lower half of narrow result B P7

Addressing Narrow Values Use a bit mask to specify partitions holding components of the value along with the register address Use a bit mask to specify partitions holding components of the value along with the register address Upper half of narrow result A Lower half of narrow result A Address of A = P7, 1001 P7

Register Read Logic

Register Packing Alternatives Conservative Packing Conservative Packing Assume result to use the full width of a register at allocation time Assume result to use the full width of a register at allocation time Speculative Packing Speculative Packing Predict the result width at allocation time and allocate accordingly Predict the result width at allocation time and allocate accordingly

Conservative Packing Initially allocate a full-width register Initially allocate a full-width register If the result turns out to be narrow: If the result turns out to be narrow: Release the unneeded parts to the free pool Release the unneeded parts to the free pool If there is a suitable partition: reallocate. If there is a suitable partition: reallocate.

Conservative Packing Instruction I is dispatched: P2 P5 Free Partition Allocated Partition

Conservative Packing Instruction I is dispatched: P2 is allocated P2 P5 Free Partition Allocated Partition

Conservative Packing Instruction I is dispatched: Width of result = 2 slots P5’s upper half is allocated and P2 is released P2 P5 Free Partition Allocated Partition

Taking Care of Reassignments Two broadcasts are needed Two broadcasts are needed First broadcast uses old tag (=originally assigned register id) to inform dependents that the result will be available shortly First broadcast uses old tag (=originally assigned register id) to inform dependents that the result will be available shortly Second broadcast drives the old tag and the new tag (= newly-assigned register id + “parts” bits) Second broadcast drives the old tag and the new tag (= newly-assigned register id + “parts” bits) old tag is used to locate dependents old tag is used to locate dependents new tag picked up by matching entries and used later to read out source value from the register file new tag picked up by matching entries and used later to read out source value from the register file

Tag Broadcast for Wakeup P1, 1001P2, 1111 Consumer Issue Queue P2, 1111 Producer Function Unit Tag Bus P2, 1111

Tag Rebroadcast Example P1, 1001P2, 1111 Consumer Issue Queue P2, 1111 Producer Function Unit Old Tag P5, 1100 New Tag P2, 1111P5, 1100

IPCs for Conservative Packing

Conservative Packing: Observations Extra broadcast is needed for all results that don’t use all of the partitions within a register Extra broadcast is needed for all results that don’t use all of the partitions within a register Performance is heavily constrained by the number of broadcast buses Performance is heavily constrained by the number of broadcast buses 6% for 4 buses 6% for 4 buses 14% for 8 buses 14% for 8 buses -26% for 4 buses assuming an extra cycle delay for width estimation -26% for 4 buses assuming an extra cycle delay for width estimation

Speculative Packing Predict the width of the result and allocate accordingly Predict the width of the result and allocate accordingly Width overprediction: two choices here Width overprediction: two choices here Release unused parts of register – rebroadcast only the parts bits Release unused parts of register – rebroadcast only the parts bits Do not release unused parts – no rebroadcast is needed Do not release unused parts – no rebroadcast is needed Width underprediction: requires reallocation and an update broadcast Width underprediction: requires reallocation and an update broadcast

Width Predictor Width prediction bits are maintained within the L1 I-Cache Width prediction bits are maintained within the L1 I-Cache Prediction bits do no percolate down the memory hierarchy from L1 Prediction bits do no percolate down the memory hierarchy from L1 Default prediction is full width Default prediction is full width Prediction bits are updated only on mispredictions Prediction bits are updated only on mispredictions

Width Prediction is Accurate !

Deadlock Avoidance If there is a misprediction and there are no free register parts available: If there is a misprediction and there are no free register parts available: Stall writeback and wait Stall writeback and wait This can still cause a deadlock if the instruction is the oldest in the pipeline This can still cause a deadlock if the instruction is the oldest in the pipeline Create an exception and squash all instructions younger than the instruction (including itself) Create an exception and squash all instructions younger than the instruction (including itself) Steal a register from a younger instruction and squash all instructions coming after the owner Steal a register from a younger instruction and squash all instructions coming after the owner

Comparison of Deadlock Avoidance Schemes

Speedups of Speculative Packing

Performance of Packing

Conclusions We proposed and evaluated two register packing schemes We proposed and evaluated two register packing schemes Because of the high number of tag broadcasts Conservative Packing suffers in performance Because of the high number of tag broadcasts Conservative Packing suffers in performance Speculative Packing results in 15% IPC improvement on the average with 64 fp and 64 int registers (with tag bus sharing) Speculative Packing results in 15% IPC improvement on the average with 64 fp and 64 int registers (with tag bus sharing)

Thank You ! Oguz Ergin Department of Computer Science State University of New York - Binghamton Intel Barcelona Research Center

64-bit Apps vs. 32-bit Apps Can use fewer registers on 32-bit apps running on a 64-bit datapath: this may result in some energy savings Can use fewer registers on 32-bit apps running on a 64-bit datapath: this may result in some energy savings See similar trends on data widths for 32-bit applications on 32 bit datapath See similar trends on data widths for 32-bit applications on 32 bit datapath Savings shown in running apps retargeted for 64 bits PISA ISA does have a fair number of 64 bit operands in FP benchmarks, integers holding addresses etc. Savings shown in running apps retargeted for 64 bits PISA ISA does have a fair number of 64 bit operands in FP benchmarks, integers holding addresses etc.

32-bit value widths