December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team.

December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team University of Wisconsin-Madison

December 4, 2003 Slide 2 of 23Ilhyun Kim -- MICRO-36 It’s all about granularity Instruction-centric hardware design HW structures are built to match an instruction’s specifications Controls occur at every instruction boundary Instruction granularity may impose constraints on the hardware design space Relaxing the constraints at different processing granularities Coarser Finer Processing granularity instructionoperand Half-price architecture (ISCA03) conventional Coarser-granular architecture macro-op

December 4, 2003 Slide 3 of 23Ilhyun Kim -- MICRO-36 Outline Scheduling loop constraints Overview of coarser-grained scheduling Macro-op scheduling implementation Performance evaluation Conclusions & future work

December 4, 2003 Slide 4 of 23Ilhyun Kim -- MICRO-36 Scheduling loop constraints Loops in out-of-order execution Scheduling atomicity (wakeup / select within a single cycle) Essential for back-to-back instruction execution Hard to pipeline in conventional designs Poor scalability Extractable ILP is a function of window size Complexity increases exponentially as the size grows Increasing pressure due to deeper pipelining and slower memory system FetchDecodeSchedDispRFExeWBCommit Scheduling loop (wakeup / select) Exe loop (bypass) Load latency resolution loop

December 4, 2003 Slide 5 of 23Ilhyun Kim -- MICRO-36 Related Work Scheduling atomicity Speculation & pipelining Grandparent scheduling [Stark], Select-free scheduling [Brown] Poor scalability Low complexity scheduling logic FIFO style window [Palacharla, H.Kim] Data-flow based window [Canal, Michaud, Raasch …] Judicious window scaling Segmented windows [Hrishikesh], WIB [Lebeck] … Issue queue entry sharing AMD K7 (MOP), Intel Pentium M (uops fusion)  Still based on instruction-centric scheduler designs Making a scheduling decision at every instruction boundary Overcoming atomicity and scalability in isolation

December 4, 2003 Slide 6 of 23Ilhyun Kim -- MICRO-36 Source of the atomicity constraint Minimal execution latency of instruction Many ALU operations have single-cycle latency Schedule should keep up with execution 1-cycle instructions need 1-cycle scheduling Multi-cycle operations do not need atomic scheduling  Relax the constraints by increasing the size of scheduling unit Combine multiple instructions into a multi-cycle latency unit Scheduling decisions occur at multiple instruction boundaries Attack both atomicity and scalability constraints

December 4, 2003 Slide 7 of 23Ilhyun Kim -- MICRO-36 Macro-op scheduling overview Issue queue insert Wakeup Pipelined scheduling RF Select Payload RAM Sequencing instructions EXE I-cache Fetch MOP detection Wakeup order information Dependence information MOP pointers Fetch / Decode / RenameQueueSchedulingRF / EXE / MEM / WB / Commit Coarser MOP-grained Instruction-grained MEM cache ports MOP formation Rename Disp WB Commit

December 4, 2003 Slide 8 of 23Ilhyun Kim -- MICRO-36 MOP scheduling(2x) example Pipelined instruction scheduling of multi-cycle MOPs Still issues original instructions consecutively Larger instruction window Multiple original instructions logically share a single issue queue entry 12 345 6 798 1011 12 13 14 15 16 n n+1 select wakeup select wakeup 1 3 2 5 4 8 7 10 12 9 11 13 14 15 16 n n+1 select / wakeup select / wakeup 6 Macro-op (MOP) 9 cycles 16 queue entries 10 cycles 9 queue entries

December 4, 2003 Slide 10 of 23Ilhyun Kim -- MICRO-36 Issues in grouping instructions Candidate instructions Single-cycle instructions: integer ALU, control, store agen operations Multi-cycle instructions (e.g. loads) do not need single-cycle scheduling The number of source operands Grouping two dependent instructions  up to 3 source operands Allow up to 2 source operands (conventional) / no restriction (wired-OR) MOP size Bigger MOP sizes may be more beneficial 2 instructions in this study MOP formation scope Instructions are processed in order before inserted into issue queue Candidate instructions need to be captured within a reasonable scope

December 4, 2003 Slide 11 of 23Ilhyun Kim -- MICRO-36 Dependence edge distance (instruction count) 73% of value-generating candidates (potential MOP heads) have dependent candidate instructions (potential MOP tails) An 8-instruction scope captures many dependent pairs Variability in distances (e.g. gap vs. vortex)  remember this  Our configuration: grouping 2 single-cycle instructions within an 8-instruction scope 49.250.927.848.737.456.340.247.542.747.737.644.7% total insts MOP potential 8-instruction scope

December 4, 2003 Slide 12 of 23Ilhyun Kim -- MICRO-36 MOP detection Finds groupable instruction pairs Dependence matrix-based detection (detailed in the paper) Performance is insensitive to detection latency (pointers reused repeatedly) A pessimistic 100-cycle latency loses 0.22% of IPC Generates MOP pointers 4 bits per instruction, stored in $IL1 A MOP pointer represents a groupable instruction pair Issue queue insert Wakeup RF Select Payload RAM EXE I-cache Fetch MOP detection Wakeup order information Dependence information MOP pointers MEM MOP formation Rename WB Commit pointer

December 4, 2003 Slide 13 of 23Ilhyun Kim -- MICRO-36 MOP detection – Avoiding cycle conditions Cycle condition examples (leading to deadlocks) Conservative cycle detection heuristic Precise detection is hard (multiple levels of dep tracking) ? 1 3 2 1 3 2 4 Assume a cycle if both outgoing and incoming edges are detected Captures over 90% of MOP opportunities (compared to the precise detection)

December 4, 2003 Slide 14 of 23Ilhyun Kim -- MICRO-36 MOP formation Locates MOP pairs using MOP pointers MOP pointers are fetched along with instructions Converts register dependences to MOP dependences Architected register IDs  MOP IDs Identical to register renaming Except that it assigns a single ID to two groupable instructions Reflects the fact that two instructions are grouped into one scheduling unit Two instructions are later inserted into one issue entry Issue queue insert Wakeup RF Select Payload RAM EXE I-cache Fetch MOP detection Wakeup order information Dependence information MOP pointers MEM MOP formation Rename WB Commit MOP

December 4, 2003 Slide 15 of 23Ilhyun Kim -- MICRO-36 Scheduling MOPs Instructions in a MOP are scheduled as a single unit A MOP is a non-pipelined, 2-cycle operation from the scheduler’s perspective Issued when all source operands are ready, incurs one tag broadcast Wakeup / select timings Issue queue insert Wakeup RF Select Payload RAM EXE I-cache Fetch MOP detection Wakeup order information Dependence information MOP pointers MEM MOP formation Rename WB Commit n n+1 n+2 n+3 n+4 select 1 wakeup 2, 3 select 2, 3 wakeup 4 select 4 select MOP(1, 3) wakeup 2, 4 select 2, 4 select 1 wakeup 2, 3 select 2, 3 wakeup 4 select 4 Atomic scheduling 2-cycle scheduling2-cycle MOP scheduling cycle 1 4 1 2 3 4 1 3 24 2 3

December 4, 2003 Slide 16 of 23Ilhyun Kim -- MICRO-36 Sequencing instructions A MOP is converted back to two original instructions The dual-entry payload RAM sends two original instructions Original instructions are sequentially executed within 2 cycles Register values are accessed using physical register IDs ROB separately commits original instructions in order MOPs do not affect precise exception or branch misprediction recovery Issue queue insert Wakeup RF Select Payload RAM EXE I-cache Fetch MOP detection Wakeup order information Dependence information MOP pointers MEM MOP formation Rename WB Commit sequence original insts

December 4, 2003 Slide 18 of 23Ilhyun Kim -- MICRO-36 Machine parameters Simplescalar-Alpha-based 4-wide OoO + speculative scheduling w/ selective replay, 14 stages Ideally pipelined scheduler conceptually equivalent to atomic scheduling + 1 extra stage 128 ROB, unrestricted / 32-entry issue queue 4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2), 256K L2 (8), memory (100) Combined branch prediction, fetch until the first taken branch MOP scheduling 2-cycle (pipelined) scheduling + 2X MOP technique 2 (conventional) or 3 (wired-OR) source operands MOP detection scope: 2 cycles (4-wide X 2-cycle = up to 8 insts) Spec2k INT, reduced input sets Reference input sets for crafty, eon, gap (up to 3B instructions)

December 4, 2003 Slide 19 of 23Ilhyun Kim -- MICRO-36 # grouped instructions 28~46% of total instructions are grouped 14~23% reduction in the instructions count in scheduler Dependent MOP cases enable consecutive issue of dependent instructions 2-src3-src

December 4, 2003 Slide 20 of 23Ilhyun Kim -- MICRO-36 MOP scheduling performance (relaxed atomicity constraint only) Up to ~19% of IPC loss in 2-cycle scheduling MOP scheduling restores performance Enables consecutive issue of dependent instructions 97.2% of atomic scheduling performance on average Unrestricted IQ / 128 ROB

December 4, 2003 Slide 21 of 23Ilhyun Kim -- MICRO-36 Insight into MOP scheduling Performance loss of 2-cycle scheduling Correlated to dependence edge distance Short dependence edges (e.g. gap)  instruction window is filled up with chains of dependent instructions  2-cycle scheduler cannot find plenty of ready instructions to issue MOP scheduling captures short-distance dependent instruction pairs They are the important ones Low MOP coverage due to long dependence edges does not matter 2-cycle scheduler can find many instructions to issue (e.g. vortex)  MOP scheduling complements 2-cycle scheduling Overall performance is less sensitive to code layout

December 4, 2003 Slide 22 of 23Ilhyun Kim -- MICRO-36 MOP scheduling performance (relaxed atomicity + scalability constraints) Benefits from both relaxed atomicity and scalability constraints  Pipelined 2-cycle MOP scheduling performs comparably or better than atomic scheduling 32 IQ / 128 ROB

December 4, 2003 Slide 23 of 23Ilhyun Kim -- MICRO-36 Conclusions & Future work Changing processing granularity can relax the constraints imposed by instruction-centric designs Constraints in instruction scheduling loop Scheduling atomicity, poor scalability Macro-op scheduling relaxes both constraints at a coarser granularity Pipelined, 2-cycle macro-op scheduling can perform comparably or even better than atomic scheduling Potentials for narrow bandwidth microarchitecture Extending the MOP idea to the whole pipeline (Disp, RF, bypass) e.g. achieving 4-wide machine performance using 2-wide bandwidth

December 4, 2003 Slide 24 of 23Ilhyun Kim -- MICRO-36 Questions??

December 4, 2003 Slide 25 of 23Ilhyun Kim -- MICRO-36 Select-free (Brown et al.) vs. MOP scheduling 4.1% better IPC on average over select-free-scoreboard (best 8.3%) Select-free cannot outperform the atomic scheduling Select-free scheduling is speculative and requires recovery operations MOP scheduling is non-speculative, leading to many advantages 32 IQ / 128 ROB, no extra stage for MOP formation

December 4, 2003 Slide 26 of 23Ilhyun Kim -- MICRO-36 MOP detection – MOP pointer generation Finding dependent pairs Dependence matrix-based detection (detailed in MICRO paper) Insensitive to detection latency (pointers reused repeatedly) A pessimistic 100-cycle latency loses 0.22% of IPC Similar to instruction preprocessing in trace cache lines MOP pointers (4 bits per instruction) 0 011: add r1  r2, r3 0 000: lw r4  0(r3) 1 010: and r5  r4, r2 0 000: bez r1, 0xff (taken) 0 000: sub r6  r5, 1 control offset MOP pointers Control bit (1) : captures up to 1 control discontinuity Offset bits (3) : instruction count from head to tail

December 4, 2003 Slide 27 of 23Ilhyun Kim -- MICRO-36 MOP formation – MOP dependence translation Assigns a single ID to two MOPable instructions reflecting the fact that two instructions are grouped into one unit The process and required structure is identical to register renaming Register values are still access based on original register IDs 1 2 3 4 3 4 5 6 5 6 7 … 7 8 - - Logical reg ID Physical reg ID Register rename table p5 p6 p7 p8 p3 p4 I1 I2 I3 I4 m5 m6 m3 m4 1 2 3 4 3 4 5 5 5 6 7 … 6 6 - - Logical reg ID MOP ID MOP translation table a single MOP ID is allocated to two grouped instructions I1 I2 I3 I4

December 4, 2003 Slide 28 of 23Ilhyun Kim -- MICRO-36 Inserting MOPs into issue queue Inserting instructions across different groups Issue queue insert Wakeup RF Select Payload RAM EXE I-cache Fetch MOP detection Wakeup order information Dependence information MOP pointers MEM MOP formation Rename WB Commit Issue queue 2134 6578 pending cycle n X 1 2 4 3 6578 pending cycle n+1 1 2 4 5 6 8 3 7 pending cycle n+2 :MOP pointer

December 4, 2003 Slide 29 of 23Ilhyun Kim -- MICRO-36 Performance considerations Independent MOPs Group independent instructions with the same source dependences No direct performance benefit but reduce queue contention Last-arriving operands in tail instructions 1 2 3 CLK 10 CLK 15 CLK 19 CLK 17 1 2 3 CLK 10 CLK 15 CLK 17 CLK 12 Unnecessarily delays head instructions MOP detection logic filters out harmful grouping Create an alternative pair if any

December 4, 2003 Slide 30 of 23Ilhyun Kim -- MICRO-36

December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team.

Similar presentations

Presentation on theme: "December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team.

Similar presentations

Presentation on theme: "December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team."— Presentation transcript:

Similar presentations

About project

Feedback