Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

Similar presentations


Presentation on theme: "Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University."— Presentation transcript:

1 Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University

2 2 Outline Motivation Case study: Cyclone Towards high-performance wakeup- free scheduler  A general model  Employing pre-check scheme  A segmented issue queue Conclusions and future work

3 3 Superscalar Issue Queue rdyL opd tagL opd tagR rdyR rdyL opd tagL opd tagR rdyR = = = = OR tag1 tagIW instN-1inst0 Wakeup Logic Delay = T tagdrive + T tagmatch + T matchOR T tagdirve = c 0 + (c 1 + c 2 xIW)xN + (c 3 + c 4 xIW + c 5 xIW 2 )xN 2 T tagmatch,T matchOR = c 0 + c 1 xIW + c 2 xIW 2 S. Palacharla et al., ISCA24

4 4 Superscalar Issue Queue Selection Logic T selection = c 0 + c 1 xlog 4 N S. Palacharla et al., ISCA24 req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb Issue Queue req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 enb from/to other subtrees root cell

5 5 Challenges in Dynamic Instruction Scheduling Broadcast-based dynamic scheduler  Higher complexity  Power hungry  A major limiter to clock frequency: increasing issue queue size, issue width, wire delay, and shorten logic levels per pipeline stage Complexity Effective Issue  Speculative wakeup [Stark et.al.]  Dependency chain based ordering [Canal/Gonzalez ICS 00//01; Michaud/Seznec HPCA01;  Segmented Issue queue [Raasch et.al. ISCA 2002]  Wakeup-free dynamic scheduler [Ernst ISCA 2003 et.al.] Lower complexity Lower power consumption Better scalability  Have to trade performance loss

6 6 Our Goals Explore the predictability of instruction issue latency Identify the performance impediments in wakeup-free architectures Design high-performance wakeup-free schedulers

7 7 Cyclone: Conflict in the Main Queue FP benchmarksInt benchmarks Order Enforced Enforce ordered placement to avoid conflict between instructions with different latencies

8 8 Possible Structural Problems Instruction promotion/forwarding incurs conflict along the path Very limited instruction pool for selection  Only entries in column 0 in the main queue can be issued  Ready instructions (not in column 0) are delayed due to conflict Limited number of issue ports has less tolerance to mispredicted ready instructions  Waste issue port  Prevent ready instruction from issue  Complete with newly decoded instructions due to replay

9 9 A General Model: WF-Replay lat Wakeup-Free Issue Queue lat register file ready bits replay? Rename Pre- schedule From decoder Timing Table to FUs Selection Logic from FUs Collapsing issue queue without promotion. Conventional random selection logic Given much wider issue width How to relax the structural constraints? Instruction is removed if no replay is needed

10 10 Instruction Pre-scheduling I0I0 I1I1 I2I2 I3I3 Rename/ PSCHED 0 max + reschedule? Timing Table PSCHED 1 max + + dep check MUX control Register Mapping Table + lat0 lat1 lat2 lat3 Adapted from Cyclone, D. Ernst et. al., ISCA’03

11 11 Latency Triggered Selection lat Wakeup-Free Issue Queue lat req0grant0req1grant1req2grant2req3grant3 reqenb lat req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 enb root cell

12 12 WF-Replay IPC (F4-I8 vs F4- I4) Issue Width: 8Issue Width: 4 WF-Replay loses 9.7% performance (IPC) to Base as the issue width reduces to 4 instruction per cycle

13 13 Competition at Issue Ports? Issue Width: 8Issue Width: 4

14 14 Precheck to Avoid Competition Competition at issue port may delay ready (predictive) instructions Delayed instructions may again compete with instructions dependent on them Causing more instructions falsely ready or to be delayed Wider issue port can avoid unnecessary competition at cost of higher complexity Solution: preventing falsely ready instructions from selection by pre-checking register ready bits

15 15 WF-Precheck Scheduler lat Wakeup-Free Issue Queue Issuing Rename Pre- schedule From decoder Timing Table to FUs Register Ready Bit Register from Mem. Selection Logic rylatrylatrylatrylatrylatry Precheck register ready bits when predicted latency reaches 0 Selection request is filtered by ‘ry’ bit Trade replay for pre-check Only issue truly ready instructions

16 16 Complexity of Pre-checking On the average, 40.2% instructions have both source operands ready and 45.4% instructions have one source operand ready at pre-schedule stage. Pre-check request is less than 2 per cycle.

17 17 Issue Port Competition (F4-I4)

18 18 WF-Precheck IPC (F4-I4)

19 19 Impact of Load Related Predictions

20 20 How about Selection Logic? Selection Logic T selection = c 0 + c 1 xlog 4 N S. Palacharla et al., ISCA24 req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb Issue Queue req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 enb from/to other subtrees root cell

21 21 WF-Segment Issue Queue Selection Logic ry 4 issue ports to FUs Dispatch Routing 0 1-2 3-4 >4 Rename / Pre- scheduling Time Table Register Ready Bits from FUsMem. from decoder Switchback path

22 22 WF-Segment Issue Queue On the average, WF-Segment trades 3% IPC loss to WF-Precheck and 5% loss to the Base for optimizing selection logic.

23 23 Conclusions Explore and identify the performance impediments in wakeup-free scheduling High-performance wakeup-free dynamic schedulers  WF-Replay : eliminates structural constraints  WF-Precheck : avoids unnecessary competition at issue ports  WF-Segment : optimizes selection logic for high clock speed

24 24 Future Work Routing complexity analysis in WF- Segment scheduler Power analysis for wakeup-free schedulers Sophisticated pre-scheduler

25 25

26 26 Wire Delay Challenges Increasing pipeline depth for high performance Clock period (FO4) decreases dramatically Cross-chip wire delay will be up to 10 cycles as technology shrinks M. S. Hrishikesh et al, ISCA29 Stephen W. Keckler et al, ISSCC’03

27 27 Precheck as A Single Stage

28 28 Load/Store Dependence Predictor


Download ppt "Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University."

Similar presentations


Ads by Google