ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY International Symposium on Low Power Electronics and Design (ISLPED’03), August 26 th 2003
ISLPED’03 2 Outline Reorder Buffer (ROB) complexities Motivation for the low-complexity ROB Low-complexity ROB (ICS’02) Improving the design using short-lived values Results Concluding remarks
ISLPED’03 3 P6 Style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB
ISLPED’03 4 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment
ISLPED’03 5 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
ISLPED’03 6 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%
ISLPED’03 7 How Efficiently are the Ports Used ? ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%
ISLPED’03 8 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
ISLPED’03 9 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB 1 2 3
ISLPED’03 10 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ 1 3 ROB
ISLPED’03 11 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines
ISLPED’03 12 Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
ISLPED’03 13 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses:
ISLPED’03 14 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Result/status forwarding buses: Late Forwarding: Use the Normal Forwarding Buses!
ISLPED’03 15 Improving Performance Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports
ISLPED’03 16 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB Architectural Register File
ISLPED’03 17 Datapath with the Retention Latches IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ RETENTION LATCHES ROB
ISLPED’03 18 Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO
ISLPED’03 19 Advantages of Using Retention Latches Reduces energy dissipation in the ROB – avoids creating a localized hot spot Reduces associated performance losses Reduces ROB complexity – smaller floor plan, easier validation
ISLPED’03 20 Improving Retention Latch Management PROBLEM: All generated results, irrespective of whether they could be potentially read from the RLs, are written into the latches unconditionally CONSEQUENCE: The array of RLs is not utilized efficiently and performance loss is still noticeable SOLUTION: We identify the values which are never going to be read after the cycle of their generation and avoid writing of these values into the RLs
ISLPED’03 21 Our definition: a value is short-lived if the destination register is renamed by the time of the result generation Identified one cycle before the result writeback LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 RENAMER Short-Lived Values
ISLPED’03 22 AVOID WRITING SHORT-LIVED VALUES INTO THE RETENTION LATCHES Reasons: Short-lived values are forwarded directly to all potential consumers in the issue queue No instruction will ever consume a short- lived value from the retention latches Results: Increased RL hit ratios and better overall performance Key Idea: Do not cache short-lived values
ISLPED’ entry ROB, 4-way processor The Good News : 80%+ of the Values are Short-Lived %
ISLPED’03 24 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed Identifying Short-Lived Values
ISLPED’03 25 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed Identifying Short-Lived Values
ISLPED’03 26 Renamed bit is checked one cycle before writeback Value produced by LOAD is short-lived because Renamed [31]=1 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R Renamed Identifying Short-Lived Values
ISLPED’03 27 Hit Ratios to Retention Latches 46%73% Hit Ratios bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. Average Hit Ratio:
ISLPED’03 28 Experimental Results: Effect on Performance IPC 1.7%1.7%0.5%1.1% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. IPC Drop:
ISLPED’03 29 Experimental Results: Effect on ROB Power Energy (pJ) 15.9%13.7%15.0% appluapsiartequakemesamgridswimwupwiseFP Avg. bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr Avg. Savings:
ISLPED’03 30 Conclusions We proposed a mechanism to further improve the performance and reduce the complexity of a processor that uses retention latches and eliminates the ROB source read ports The idea is to avoid caching the short-lived result values in the retention latches Both retention latch hit ratio and the overall performance improved Alternatively, fewer retention latches can be used with the same performance
ISLPED’03 31 THANK YOU ! *supported in part by DARPA through the PAC-C program and NSF LOW POWER RESEARCH GROUP Department of Computer Science State University of New York Binghamton, NY International Symposium on Low Power Electronics and Design (ISLPED’03), August 27 th 2003