ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

Slides:

Advertisements

Similar presentations

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.

Advertisements

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 12 Pipelining Strategies Performance Hazards.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

Lecture 8 Shelving in Superscalar Processors (Part 1)

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

ISLPED’99 International Symposium on Low Power Electronics and Design

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

PipeliningPipelining Computer Architecture (Fall 2006)

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Dynamic Associative Caches:

Dynamic Scheduling Why go out of style?

Smruti R. Sarangi IIT Delhi

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Microprocessor Microarchitecture Dynamic Pipeline

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Computer Architecture Lecture 3

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Ka-Ming Keung Swamy D Ponpandi

Lecture 8: Dynamic ILP Topics: out-of-order processors

Reducing Cache Traffic and Energy with Macro Data Load

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Sizing Structures Fixed relations Empirical (simulation-based)

ECE 721 Modern Superscalar Microarchitecture

Presentation transcript:

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY International Symposium on Low Power Electronics and Design (ISLPED’03), August 26 th 2003

ISLPED’03 2 Outline Reorder Buffer (ROB) complexities Motivation for the low-complexity ROB Low-complexity ROB (ICS’02) Improving the design using short-lived values Results Concluding remarks

ISLPED’03 3 P6 Style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB

ISLPED’03 4 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

ISLPED’03 5 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3

ISLPED’03 6 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%

ISLPED’03 7 How Efficiently are the Ports Used ? ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%

ISLPED’03 8 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3

ISLPED’03 9 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3

ISLPED’03 10 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ 1 3 ROB

ISLPED’03 11 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines

ISLPED’03 12 Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

ISLPED’03 13 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses:

ISLPED’03 14 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses: Late Forwarding: Use the Normal Forwarding Buses!

ISLPED’03 15 Improving Performance Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports

ISLPED’03 16 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Architectural Register File

ISLPED’03 17 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ RETENTION LATCHES ROB

ISLPED’03 18 Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO

ISLPED’03 19 Advantages of Using Retention Latches Reduces energy dissipation in the ROB – avoids creating a localized hot spot Reduces associated performance losses Reduces ROB complexity – smaller floor plan, easier validation

ISLPED’03 20 Improving Retention Latch Management PROBLEM: All generated results, irrespective of whether they could be potentially read from the RLs, are written into the latches unconditionally CONSEQUENCE: The array of RLs is not utilized efficiently and performance loss is still noticeable SOLUTION: We identify the values which are never going to be read after the cycle of their generation and avoid writing of these values into the RLs

ISLPED’03 21 Our definition: a value is short-lived if the destination register is renamed by the time of the result generation Identified one cycle before the result writeback LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 RENAMER Short-Lived Values

ISLPED’03 22 AVOID WRITING SHORT-LIVED VALUES INTO THE RETENTION LATCHES Reasons: Short-lived values are forwarded directly to all potential consumers in the issue queue No instruction will ever consume a short- lived value from the retention latches Results: Increased RL hit ratios and better overall performance Key Idea: Do not cache short-lived values

ISLPED’ entry ROB, 4-way processor The Good News : 80%+ of the Values are Short-Lived %

ISLPED’03 24 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed Identifying Short-Lived Values

ISLPED’03 25 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed Identifying Short-Lived Values

ISLPED’03 26 Renamed bit is checked one cycle before writeback Value produced by LOAD is short-lived because Renamed [31]=1 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed Identifying Short-Lived Values

ISLPED’03 27 Hit Ratios to Retention Latches 46%73% Hit Ratios bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. Average Hit Ratio:

ISLPED’03 28 Experimental Results: Effect on Performance IPC 1.7%1.7%0.5%1.1% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. IPC Drop:

ISLPED’03 29 Experimental Results: Effect on ROB Power Energy (pJ) 15.9%13.7%15.0% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. Savings:

ISLPED’03 30 Conclusions We proposed a mechanism to further improve the performance and reduce the complexity of a processor that uses retention latches and eliminates the ROB source read ports The idea is to avoid caching the short-lived result values in the retention latches Both retention latch hit ratio and the overall performance improved Alternatively, fewer retention latches can be used with the same performance

ISLPED’03 31 THANK YOU ! *supported in part by DARPA through the PAC-C program and NSF LOW POWER RESEARCH GROUP Department of Computer Science State University of New York Binghamton, NY International Symposium on Low Power Electronics and Design (ISLPED’03), August 27 th 2003