Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.

Similar presentations


Presentation on theme: "ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry."— Presentation transcript:

1 ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21 st International Conference on Computer Design (ICCD’03), October 14 th 2003

2 ICCD’03 2 – Reorder Buffer (ROB) complexities – Motivation for the low-complexity ROB – Low-complexity ROB designs Fully Distributed ROB Retention Latches (RLs) revisited (ICS’02) Combined Scheme – Results – Concluding remarks Outline

3 ICCD’03 3 P6-style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB

4 ICCD’03 4 IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB RB PPC 620-style Superscalar Datapath

5 ICCD’03 5 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

6 ICCD’03 6 What This Work is All About – ROB complexity reduction is important for reducing power and improving performance ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles – Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

7 ICCD’03 7 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines

8 ICCD’03 8 Instruction dispatch P6-style Superscalar Datapath IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Architectural Register File F2 Fetch Decode/Dispatch D2 ROB

9 ICCD’03 9 Reorder Buffer Distribution IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Holds pointers to entries within ROBCs ROB Components (ROBCs)

10 ICCD’03 10 Impact of Distributing the ROB – Each ROBC is effectively is a small Rename Buffer Smaller read/write access energy Faster access time – Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires) – Fits in naturally with a multi-clustered datapath design

11 ICCD’03 11 – Port conflicts result in performance penalty – Interconnection network is more complex Problems with the earlier Multi-banked RF Schemes

12 ICCD’03 12 – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment – Interconnection network is more complex and some good news! Problems with the earlier Multi-banked RF Schemes

13 ICCD’03 13 – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment – Interconnection network is more complex Completely remove source read ports and some good news! Problems with the earlier Multi-banked RF Schemes

14 ICCD’03 14 Problems with the earlier Multi-banked RF Schemes – Port conflicts result in performance penalty Totally avoid write port conflicts Minimize read port conflicts at commitment Totally avoid source read port conflicts – Interconnection network is more complex Completely remove source read ports and some good news!

15 ICCD’03 15 ROBCs Assigned to Each Function Unit 1 2 3 4 n ROBC #1 11 2 3 1 ROBC #2 1 2 3 4 m1 21 ROBC #m 1 FU #m FU #2 FU #1 Centralized ROBDistributed ROBCs FU_id offset

16 ICCD’03 16 Good News:Write port conflicts are avoided ROBC #1 1 2 3 ROBC #2 1 2 3 4 ROBC #m 1 FU #m FU #2 FU #1 1 write port Distributed ROBCs 1 2 3 4 n 11 m1 21 Centralized ROB FU_id offset

17 ICCD’03 17 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 instruction 5

18 ICCD’03 18 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD instruction 5

19 ICCD’03 19 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD reserved instruction 5

20 ICCD’03 20 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved 5 ADD

21 ICCD’03 21 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 5

22 ICCD’03 22 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB reserved 5

23 ICCD’03 23 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 5

24 ICCD’03 24 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 AND 5

25 ICCD’03 25 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 reserved AND 5

26 ICCD’03 26 Round Robin Scheduling at Dispatch Time 1 2 3 4 n Int ADD ROBC #1 1 2 FU_id offset Centralized ROBDistributed ROBCs Int ADD ROBC #2 1 2 Int ADD ROBC #3 1 2 Int ADD ROBC #4 1 2 ADD 11 instruction reserved SUB 21 reserved AND 13 5

27 ICCD’03 27 Good News:Avoiding Read Port Conflicts 1 2 3 4 n 1 2 FU_id offset Centralized ROBDistributed ROBCs 1 2 1 2 1 2 ADD 11 instruction reserved SUB 21 1 read port To commitment 31 AND reserved 5

28 ICCD’03 28 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs 1 2 ADD 11 instruction SUB 21 AND 13 MUL 5 Int MUL/DIV ROBC #5

29 ICCD’03 29 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs 2 1 ADD 11 instruction SUB 21 AND 13 MUL 5 reserved Int MUL/DIV ROBC #5

30 ICCD’03 30 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs 1 2 ADD 11 instruction reserved SUB 21 AND 13 5 51 MUL Int MUL/DIV ROBC #5 MUL

31 ICCD’03 31 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 DIV 5 51 MUL 1 2 reserved Int MUL/DIV ROBC #5

32 ICCD’03 32 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 DIV 5 51 MUL 1 2 reserved Int MUL/DIV ROBC #5

33 ICCD’03 33 Round Robin Scheduling at Dispatch Time 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 5 51 MUL 52 DIV 1 2 reserved Int MUL/DIV ROBC #5 DIV

34 ICCD’03 34 Read Port Conflicts at Commitment 1 2 3 4 n FU_id offset Centralized ROBDistributed ROBCs ADD 11 instruction SUB 21 AND 13 5 51 MUL 52 DIV 1 2 reserved Int MUL/DIV ROBC #5 reserved To commitment CONFLICT: If MUL and DIV wants to commit in the same cycle 1 read port DIV

35 ICCD’03 35 Distributed ROB Design 1 ROBC Writeback 1 write port to write results

36 ICCD’03 36 Distributed ROB Design 1 ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment

37 ICCD’03 37 Distributed ROB Design 1: with source read ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment

38 ICCD’03 38 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator (Rooted in SimpleScalar) Energy/Power Estimator Power/energy stats SPICE measures of energy per transition Transition counts, Context information

39 ICCD’03 39 Configuration of the Simulated System Machine width4-way Issue Queue32 entries 96 entriesReorder Buffer Load/Store Queue 32 entries Simulated the execution of SPEC2000 benchmarks

40 ICCD’03 40 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 16.9 4.4 4.1 0.11.6 0.043.8 0.0428.6 9.3 SPEC 2000 FP Average 14.2 4.93.2 0.83.8 0.66.7 1.123.5 7.5 SPEC 2000 Average 15.7 4.63.7 0.42.6 0.35.0 0.526.4 8.5 peak avg.

41 ICCD’03 41 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 16.9 4.4 4.1 0.11.6 0.043.8 0.0428.6 9.3 SPEC 2000 FP Average 14.2 4.93.2 0.83.8 0.66.7 1.123.5 7.5 SPEC 2000 Average 15.7 4.63.7 0.42.6 0.35.0 0.526.4 8.5 peak avg. 888844444416 Number of entries assigned to each ROBC

42 ICCD’03 42 Peak/Average demands on the number of ROBC entries ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 16.9 4.4 4.1 0.11.6 0.043.8 0.0428.6 9.3 SPEC 2000 FP Average 14.2 4.93.2 0.83.8 0.66.7 1.123.5 7.5 SPEC 2000 Average 15.7 4.63.7 0.42.6 0.35.0 0.526.4 8.5 peak avg. 888844444416++++++++++= 72 entry 8_4_4_4_16 configuration Number of entries assigned to each ROBC

43 ICCD’03 43 Percentage of cycles when dispatch blocks for 8_4_4_4_16 ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 0.90.1005.2 SPEC 2000 FP Average 1.51.00.10.81.9 SPEC 2000 Average 1.20.500.43.8 Average IPC drop% with 8_4_4_4_16 configuration = 4.8%

44 ICCD’03 44 Percentage of cycles when dispatch blocks for 8_4_4_4_16 ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 0.90.1005.2 SPEC 2000 FP Average 1.51.00.10.81.9 SPEC 2000 Average 1.20.500.43.8 888844444416++++++++++= 72 entry Number of entries assigned to each ROBC

45 ICCD’03 45 Reducing performance penalty: 12_6_4_6_20 Configuration ROBC type Int ADD #1, #2, #3, #4 Int MUL/DIV FP ADD #1, #2, #3, #4 FP MUL/DIV Load SPEC 2000 Integer Average 0.90.1005.2 SPEC 2000 FP Average 1.51.00.10.81.9 SPEC 2000 Average 1.20.500.43.8 12 64444620++++++++++= 96 entry 12_6_4_6_20 configuration Number of entries assigned to each ROBC

46 ICCD’03 46 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

47 ICCD’03 47 Distributed ROB Design 1: with source read ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment

48 ICCD’03 48 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Dispatch/Issue 1 read port to read the source operands Commit 1 read port for instruction commitment

49 ICCD’03 49 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment

50 ICCD’03 50 Where are the Source Values Coming From? IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3

51 ICCD’03 51 Where are the Source Values Coming From ? 96-entry ROB, 4-way processor SPEC2K Benchmarks 62%32%6%

52 ICCD’03 52 How Efficiently are the Ports Used ? ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment 6%

53 ICCD’03 53 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3

54 ICCD’03 54 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROB 1 2 3

55 ICCD’03 55 Our Solution: Elimination of Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 1 3 ROB

56 ICCD’03 56 Distributed Reorder Buffer Scheme IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Holds pointers to entries within ROBCs ROBCs

57 ICCD’03 57 Elimination of Source Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs

58 ICCD’03 58 Elimination of Source Read Ports IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs

59 ICCD’03 59 Completely Eliminating the Source Read Ports on the ROBCs – The Problem: Issue of instructions that require a value stored in a ROBC will stall – Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

60 ICCD’03 60 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB ROBCs Holds pointers to entries within ROBCs

61 ICCD’03 61 Late Forwarding: Use the Normal Forwarding Buses! IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding ROBCs Holds pointers to entries within ROBCs

62 ICCD’03 62 Performance Drop of Simplified ROBC Design Performance Drop % 9.6% Average IPC Drop: bzip2gapgccgzipmcfparserperltwolfInt Avg.vortexvpr appluapsiartequakemesamgridswimwupwiseFP Avg. 37% 17%

63 ICCD’03 63 IPC Penalty: Source Value Not Accessible within the ROBC Forwarding Late Forwarding/ Commitment Lifetime of a Result Value Result Generation time Value within ARF Value within a ROBC

64 ICCD’03 64 Improving IPC with No Read Ports – Cache recently generated values in a set of RETENTION LATCHES (RL) – Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports

65 ICCD’03 65 Adding Retention Latches into the Picture IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding ROBCs Holds pointers to entries within ROBCs

66 ICCD’03 66 Adding Retention Latches into the Picture IQ Function Units Instruction Issue F1D1 FU1 FU2 FUm ARF Result/status forwarding buses EX Instruction dispatch Architectural Register File F2 Fetch Decode/Dispatch D2 ROBC 1 ROBC 2 ROBC m ROB Late Forwarding RETENTION LATCHES Holds pointers to entries within ROBCs

67 ICCD’03 67 Eliminating All Source Read Ports ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment

68 ICCD’03 68 Distributed ROB Design 2: with Retention Latches ROBC Writeback 1 write port to write results Commit 1 read port for instruction commitment Eight, 2-ported FIFO RLs

69 ICCD’03 69 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

70 ICCD’03 70 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 1.7%

71 ICCD’03 71 Performance Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. IPC Average IPC drop% with 12_6_4_6_20 configuration = 3.8%

72 ICCD’03 72 Power Results for 12_6_4_6_20 Configuration gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:49%47%23%

73 ICCD’03 73 Power Results for 12_6_4_6_20 Configuration (Compared to Baseline case with 64 entry Rename Buffers) gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:39%37%20%

74 ICCD’03 74 Summary of Results – Low performance degradation: 1.7% IPC drop on the average (compared to 2-cycle ROB) 3.8% IPC drop on the average (compared to 1-cycle ROB) – ROB Power savings: as high as 49% are realized (compared to P6-style datapath: 96 entry ROB) as high as 39% (compared to Rename Buffer design: 96 entry ROB, 64 entry RB)

75 ICCD’03 75 Conclusions – We introduced a conflict-free distributed Reorder Buffer design – ROB power savings of as high as 49% are realized with only a small (1.7%) performance penalty – ROB complexity is drastically reduced by Distributing the ROB into multiple banks Reducing the port requirements to no more than 2 ports for each ROB components

76 ICCD’03 76 ~ Thank You~

77 ICCD’03 77 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21 st International Conference on Computer Design (ICCD’03), October 14 th 2003

78 ICCD’03 78 Related Work – Replicated (Kessler, IEEE Micro) and distributed (Canal et.al, HPCA’00 and Farkas et.al, MICRO’97) RFs in a clustered organization – Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01) – Multiple Register Banks with additional pipeline stage to avoid complex arbitration logic (Tseng et.al, ISCA’03 – Multiple Register Banks without write port conflicts (Wallase et.al, PACT’96)

79 ICCD’03 79 ROB Port Requirements for a W-way CPU ROB Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch W write ports to setup entries Commit W read ports for instruction commitment

80 ICCD’03 80 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment

81 ICCD’03 81 Reducing ROB Power and Complexity ROB Phys.regs. ROB

82 ICCD’03 82 LOAD Int MUL Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Distribution Centralized ROB FP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Smaller structures : shorter bitlines, lower capacitive loading, etc. LESS POWER DISSIPATION! Phys.regs.

83 ICCD’03 83 LOAD Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Dedicate FUs to ROBCs Centralized ROB Int MULFP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Less ports : much smaller structures LESS POWER DISSIPATION! + LESS COMPLEXITY! Phys.regs.

84 ICCD’03 84 LOAD Int ADD 4 Int ADD 3 Int ADD 2 Int ADD 1 Fully Distributed Reorder Buffer Scheme Centralized ROB Int MULFP MUL FP ADD 1FP ADD 2FP ADD 3 FP ADD 4 Less ports : much smaller structures LESS POWER DISSIPATION! + LESS COMPLEXITY! Phys.regs. ROBCs

85 ICCD’03 85 Fully Distributed Reorder Buffer Scheme

86 ICCD’03 86 Fully Distributed Reorder Buffer Scheme – Distributed ROB Components (ROBCs) are assigned to each Function Unit No write port conflicts at writeback stage, and minimal read port conflicts at commitment: Negligible performance penalty Each ROBC can be tailored to the needs of its FU : No over commitment of resources, less complexity – The FIFO structure that maintains pointers to the ROBCs remains centralized

87 ICCD’03 87 Fully Distributed Reorder Buffer Scheme 1 2 3 4 n ROBC #1 11 2 3 1 FU_id offset ROBC #2 1 2 3 4 m1 21 ROBC #m 1 Centralized ROBDistributed ROBCs

88 ICCD’03 88 Fully Distributed Reorder Buffer Scheme 1 2 3 4 n ROBC #1 11 2 3 1 ROBC #2 1 2 3 4 m1 21 ROBC #m 1 Centralized ROBDistributed ROBCs FU_id offset

89 ICCD’03 89 ROB Port Requirements for a W-way CPU ROB Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands Decode/Dispatch 1 W-wide write port to setup entries Commit 1 W-wide read port for instruction commitment

90 ICCD’03 90 Results for the Scheme with Retention Latches gapgccgzipparserperltwolfInt Avg.vortexvpr appluartmesamgridswimwupwiseFP Avg. Power Savings % Power savings%:23%


Download ppt "ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry."

Similar presentations


Ads by Google