Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Similar presentations


Presentation on theme: "1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer."— Presentation transcript:

1 1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

2 2 Outline – Introduction – Motivations – Contributions Basic idea: isolate short-lived operands in a small dedicated register file and avoid their writes to the ROB and the ARF Resources impacted: ROB, ARF Power savings: 21% with 32-entry additional RF – Results – Conclusions – Future work

3 3 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 D-cache LSQ ROB A P6-like Superscalar Datapath

4 4 Out-of-Order Execution and In-Order Retirement ROB FRD Inst. Queue Ex ARF In-order front end Out-of-order core In-order retirement

5 5 Energy-dissipating Events ROB FRD Inst. Queue Ex ARF In-order front end Out-of-order core In-order retirement Write Read

6 6 The Idea : Isolating Short-Lived Values ROB FRD Inst. Queue Ex ARF Write Read SRF Write short-lived values into a small dedicated RF (SRF) In-order front end Out-of-order core In-order retirement

7 7 – Used to avoid false data dependencies. – A new physical register is allocated for EVERY new result – P6 style: ROB slots serve as physical registers Register Renaming LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, P2, 100 SUB P32, P31, P3 ADD P33, P32, P4

8 8 – Register Alias Table (RAT) maintains the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 111 221 331 441 551 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Original code

9 9 – Register Alias Table (RAT) maintains the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1310 221 331 441 551 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 Original code Renamed code

10 10 – Rename Table (RT) is used to maintain the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1310 221 331 441 5320 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 Original code Renamed code

11 11 – Rename Table (RT) is used to maintain the mappings between logical and physical registers Register Renaming: the Implementation Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1330 221 331 441 5320 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Original code Renamed code

12 12 – Our definition: a value is short-lived if the destination register is renamed by the time of the result generation. – Identified one cycle before the result writeback Short-Lived Values LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 RENAMER

13 13 96-entry ROB, 4-way processor The Good News : 80%+ of the Values are Short-Lived As rename-to-writeback latency increases in future datapaths, the percentage of short-lived values will also go up

14 14 The Idea : Isolating Short-Lived Values ROB FRD Inst. Queue Ex ARF Write Read SRF Write short-lived values into a small dedicated RF (SRF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 In-order front end Out-of-order core In-order retirement

15 15 Need to hang on to the short-lived values to: Recover from branch mispredictions Reconstruct precise state Why do we need the SRF ? LOAD R1, R2, 100 BEQ R5, R1, #100 ADD R1, R5, R4

16 16 – Maintain the bit-vector Renamed – Set by the Renamer at the time of renaming Identifying Short-Lived Values Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1310 221 331 441 5320 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 1 Renamed

17 17 – Maintain the bit-vector Renamed – Set by the Renamer at the time of renaming Identifying Short-Lived Values Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1330 221 331 441 5320 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 1 Renamed

18 18 – Renamed bit is checked one cycle before writeback – Value produced by LOAD is short-lived because Renamed [31]=1 Identifying Short-Lived Values LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 1 Renamed

19 19 – When do we write short-lived values into the SRF? – When and how are the short-lived values removed from the SRF? – What happens on a branch misprediction? – How do we reconstruct a precise state? Managing the SRF: the Issues

20 20 Format of an SRF entry ValidROB idxData Branch Tag 1 Branch Tag 2 Dest. Arch. Reg. Branch Identifier for Renamer : used to remove this entry if renamer gets squashed Branch Identifier for this instruction : used to remove this entry if this instruction gets squashed Branch Identifier of an instruction = id/tag of immediately preceding conditional branch

21 21 – An instruction writes a short-lived result value into the SRF if: A free entry exists in the SRF No SRF entry keyed with the same ROB slot is already established – Bit-vector Allocated_in_SRF is maintained – One bit for each ROB entry – Set at the time of writeback if value is written into the SRF – Reset at the time of removing the value from the SRF Writing to the SRF: the Conditions ValidROB idxData Branch Tag 1 Branch Tag 2 Dest. reg

22 22 Scenario 1 : Normal Commitment of Renamer Scenario 2 : Renamer gets squashed Scenario 3 : The instruction generating the short- lived value itself gets squashed Scenarios for Removing the Values from the SRF

23 23 – Values are removed by the Renamer – 2-step process: Mark the instruction whose value is to be removed from the SRF (done at the time of renaming) Remove the marked value from the SRF IF NEED BE (done at the time of commitment) – When ADD commits, it removes the value written by LOAD Removing the Values from the SRF : Scenario 1 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Renamer

24 24 Marking the Values for Removal Arch. Reg Phys. Reg. Location (0-ROB,1- ARF) 001 1310 221 331 441 5320 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 ROB LOADSUB 3233

25 25 Marking the Values for Removal Arch. Reg Phys. Reg. Location (0-ROB,1- ARF) 001 1310 221 331 441 5320 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 ROB LOADSUBADD 3233 31 FS (Flush SRF) field of the ROB

26 26 – FS field of B must match the ROB index field of a SRF entry – This SRF entry must belong to A Removing the Values (B is the renamer for A) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 31 LOADSUBADD 3233 31 SRF ROB 1311load ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format A B

27 27 Another Example (LOAD could not write to SRF) Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1330 221 331 441 5320 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Original code Renamed code SRF was full! 31 1 Renamed

28 28 Another Example Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1330 221 331 441 551 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MULR2, R3, R4 DIV R2, R2, R5 Original code Renamed code Committed 31 0 Renamed Committed

29 29 Another Example Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1330 2310 331 441 551 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MULP31, R3, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MULR2, R3, R4 DIVR2, R2, R5 Original code Renamed code Committed 31 0 Renamed Committed

30 30 Another Example Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1330 2320 331 441 551 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MULP31, R3, R4 DIVP32, R31, R5 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MULR2, R3, R4 DIVR2, R2, R5 31 1 Renamed Original code Renamed code Committed

31 31 Another Example (A’s ROB slot is assigned for C) 31 LOADSUBADD 3233 31 SRF ROB 0 ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format A B LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

32 32 Another Example (A’s ROB slot is assigned for C) 31 MULDIVADD 3233 31 SRF ROB 1312mul ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format C B LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MULP31, R3, R4 DIVP32, R31, R5 D

33 33 – Bit-vector Uncommitted_Write is maintained One bit for each ROB entry Set at the time of establishing SRF entry Reset at the time of commitment – Instruction B removes the value written by A (allocated to ROB slot i) if: Allocated_in_SRF[i]=1, and (this needs to be better explained) Uncommitted_Write[i]=0; Ensuring that the right values are removed

34 34 – When an instruction allocated to ROB slot i commits and Allocated_in_SRF[i]=1, the data is not copied to the ARF. Avoiding Unnecessary Committments Dest. reg ROB FRD Inst. Queue Ex ARF Write Read SRF Write

35 35 – Problem: Renamer can get squashed -> stale entries remain in the SRF if nothing is done – Example: Handling Branch Mispredictions : Scenario 2 32 BRSUBADD 3334 31 ROB SRF 1311load LOAD 31

36 36 – Problem: Renamer can get squashed -> stale entries remain in the SRF if nothing is done – Example: Handling Branch Mispredictions 32 BR ROB SRF 1311load LOAD 313334

37 37 – Solution: Tag each entry in the SRF with the id of the branch preceding the renamer (BT1). When the renamer is squashed, the value is removed from the SRF and is written to either the ROB (based on the value of Uncommitted_Write bit) Multiplex the ports to reduce complexity Handling Branch Mispredictions ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format

38 38 – Maintain the array Branch_Tags – One entry for each ROB slot Obtaining Branch Tag BT1 Arch. Reg Phys. Reg. Location (0-ROB,1-ARF) 001 1310 221 331 441 5330 LOAD P31, P2, 100 BEQ P6, P7, 200 SUB P33, P31, P3 ADD P34, P33, P4 LOAD R1, R2, 100 BEQ R6, R7, 200 SUB R5, R1, R3 ADD R1, R5, R4 31 Branch_Tags 7

39 39 – Problem: The instruction whose value was inserted into the SRF can itself be squashed – Example: Handling Branch Mispredictions : Scenario 3 31 LOADSUBADD 3233 31 ROB SRF 1311load BR 30

40 40 – Problem: The instruction whose value was inserted into the SRF can itself be squashed – Example: Handling Branch Mispredictions 313233 ROB SRF 1311load BR 30

41 41 – Solution: Tag each entry in the SRF with the id of the branch preceding the instruction itself (BT2). Simply remove the value from the SRF if such a branch in mispredicted Handling Branch Mispredictions ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format

42 42 – Allow all instructions preceding the faulting instruction to commit – Squash all instructions following the faulting instruction – Copy the values of ALL valid SRF entries to the ARF. Supporting Precise Interrupts ValidROB idxData Branch Tag 1 Branch Tag 2 Dest SRF format

43 43 Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE decks SPICE Microarchitectural Simulator Energy/Power Estimator Power/energy stats SPICE measures of Energy per transition Transition counts, Context information Inter-thread buffers Data analyzer/ Intra-stream analysis Two separate threads Experimental Setup

44 44 % Results: Percentage of Values Written into the SRF 40.5%60.1%77.5%82.3%86.7%

45 45 cycles Results: Average Time Spent by a Value in the SRF Average: 12-15 cycles

46 46 % Results: Percentage of Values not copied into the ARF 42.2%61.9%79.3%84.1%86.7%

47 47 pJ Results: Net Energy Reduction 21%16%9% ROB+additional logic ARFSRF 23%

48 48 pJ Results: Net Energy Reduction 21%16%9% ROB + additional logic ARF SRF 23%

49 49 – Register Traffic Analysis (Franklin and Sohi, MICRO’92). Studied the useful lifetime of register instances Delaying the writes until 30 more instructions are dispatched, can eliminate 80% of the writes (if perfect knowledge of the last use is available) Buffering 30 most recently generated results avoids 80% of wbks – Lozano and Gao (MICRO’95) 90% of all results values are short-lived (consumed while in the ROB) Mechanism to avoid commitment of these values and also avoid register allocation for them is proposed ROB slots are exposed to the compiler in the form of symbolic registers – Lazy Retirement (Savransky, Ronen, Gonzalez, WCED’02) Hardware-based scheme to avoid unnecessary commitments Copying from the ROB to the ARF is delayed until the ROB slot is reused. In many cases, the register is invalidated by the newer instruction Additional rename table is needed. About 75% of commits are avoided. Related Work

50 50 – Significant power savings & negligible impact on performance – Sources of power savings: majority of generated results written into small lightly-ported SRF Unnecessary commitments are avoided Additional logic/ storage needed to do this is simple – For a 32-entry SRF, more than 77% of writebacks and more than 79% of commitments can be avoided – This results in the energy savings of 21% on the ROB and the ARF Conclusions

51 51 THANK YOU ! This work was supported in part by DARPA through the PAC-C program and NSF LOW POWER RESEARCH GROUP Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower Parallel Architectures and Compilation Techniques (PACT’03) October 1 st 2003

52 52 – SRF – Three bit vectors (same size as the ROB) Renamed Allocated_in_SRF Uncommitted_Write – 4-bit array Branch_Tags (same size as the ROB) Complexity of the Solution


Download ppt "1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer."

Similar presentations


Ads by Google