Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk
Brainiac vs. Speed Demon Brainiac: Maximize # of instructions per cycle More complex, slower clock Speed Demon: Maximize clock frequency Faster clock, fewer instructions per cycle
Complexity-Effective Issue Width vs. Clock Cycle Goal is to balance both Best design is Complexity-Effective Allow complex issue scheme AND fast clock cycle
What is Complexity? Delay of Critical Path through a piece of logic.
Creating a Complexity-Effective Architecture Analyze complexity of each pipeline stage Select component with most complexity Propose less-complex alternative that achieves similar performance
Agenda OoO Superscalar Pipeline Overview Register Renaming Wakeup & Select Bypass Complexity-Effective Optimizations Dependence-based Scheduler Clustering
Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Fetch: Read Instructions from I-Cache Predict Branches
Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Decode: Parse instruction Shuffle opcode parts to appropriate ports for rename
Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Rename: Map architectural registers to physical Eliminate false dependences Dispatch renamed instructions to scheduler
Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Wakeup: Instructions check whether they become ready Compare register names from Writeback stage Select: Choose from amongst ready instructions Wakeup Select Issue Window
Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Register File Read: Read source operands Wakeup Select Issue Window
Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Bypass and Execute: Execute instructions in functional units Bypass results from outputs to inputs Wakeup Select Issue Window
Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Data Cache Access: Load & Store to data cache Wakeup Select Issue Window
Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Write result to register file Broadcast tag to wakeup waiting instrs. happens two cycles before results produced Wakeup Select Issue Window
Alternate Pipeline Fetch DecodeRename Register File Data Cache Bypass Used by Pentium Pro, PowerPC Re-order buffer (ROB) holds values Tomasulo-like renaming (point to ROB entries) Reorder Buffer Wakeup Select Reservatio n Stations
Sources of Complexity Fetch DecodeRename Register File Data Cache Bypass Complexity scales with issue width and window size. Wakeup Select Issue Window
What about..... Fetch, Decode, Register File, Cache? Previously studied Register File, Caches Easy to scale ? Fetch, Decode, Functional Units
Register Renaming Eliminate WAR and WAW hazards Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld p5, [p7] Add p3, p5, p7 Sub p6, p7, p4 Ld p9, [p6] Architectural (Logical) Registers Physical Registers
Register Rename Logic Logical Source & Destination Registers Register Alias Table Dependence Check Logic Logical Source & Destination Registers MUX Physical Source & Destination Registers SRAM array Scales with Issue Width
Register Alias Table Logical Register Name Physical Register Name Decode Bitlines Wordlines Senseamps T rename = T decode + T wordline + T bitline + T senseamp
Rename Delay Scales Linearly with Issue Width Better
Instruction Wakeup Broadcast newly available operands Update which operands are ready for each instruction Mark instruction ready when both left and right operands are ready
Wakeup Logic Delay = T tagdrive + T tagmatch + T matchOR Tag Drive Tag Match Tag OR
Wakeup Delay Delay increases quadratically with window size Better
Wakeup Delay Breakdown Wire delay dominates at smaller feature sizes! Better
Select Consider all instructions that are ready Select one ready instruction for each functional unit Uses some selection policy e.x., Oldest First little affect on performance
Selection Logic Delay = c 0 + c 1 ×log 4 (WINSIZE) Request Signal Grant Signal
Selection Delay Only Logic delay, wire delay ignored Better
Data Bypass Network Forward results from completing instructions to dependent instructions # of paths depends on pipeline depth and issue width # of paths = 2×IW 2 ×S S pipe stages after producing results
Data Bypass Logic FUs broadcasts results (potentially multiple subsequent results) Regfile reads current operand values MUX selects correct source
Bypass Delay T bypass = 0.5 × R metal × C metal × L 2 L = length of result wires Delay independent of feature size Dependent on specific Layout Issue Width Wire Length (λ) Delay (ps)
Putting it All Together Issue Width Window Size Rename Delay (ps) Wakeup+Select delay (ps) Bypass delay (ps) 0.8μm μm μm
Key Sources of Complexity Wakeup+Select Logic: Limiting stage for 5 out of 6 designs Considers all instructions simultaneously Bypass Paths Wire dominated delay Lots of long wires
Dependence-based Microarchitecture Fetch Decode Rename Steer Register File Data Cache Wakeup Select Bypass FIFOs Replace Issue Window with a few small FIFO queues Only schedule from head of FIFO Adds new Steering stage
Instruction Steering Heuristic 3 possible cases for instruction I: 1. All operands ready & in register file I steered to empty FIFO 2. I requires operand produced by I source in FIFO F a If no instruction behind I source, I steered to F a Otherwise, steered to empty FIFO 3. Two operands produced by I left and I right. Apply rule 2 to I left If steered to empty FIFO, apply rule 2 to I right
Steering Example 0: addu r18, r0, r2 1: addiu r2, r0, -1 2: beq r18, r2, L2 3: lw r4, (r28) 4: sllv r2, r18, r20 5: xor r16, r2, r19 6: lw r3, (r28) 7: sll r2, r16, 0x2 8: addu r2, r2, r23 9: lw r2, 0(r2) 10: sllv r4, r18, r4 11: addu r17, r4, r19 12: addiu r3, r3, 1 13: sw r3, (r28) 14: bew r2, r17, L
Performance Comparison Cycle count within 8% for all benchmarks Better
But... T Execution = Dependence-based microarchitecture is less complex Allows for faster clock Overall, dependence-based has better performance # Instructions × Clock Period Instructions per Cycle
Improving Bypass Paths One-cycle bypass within Cluster Two-cycle bypass between Clusters
Performance of Clustering IPC up to 12% lower Clock is 25% faster Overall, 10% - 22% faster 16% faster on average Better
1. 1 Window + 1 Cluster 2. 1 Window + 2 Clusters 3. 2 Windows + 2 Clusters i. Issue Window + Steering heuristic ii. Issue Window + Random steering iii. FIFOs + Steering heuristic Other Alternatives
Performance of Clustering Steering has largest impact on performance Better
Conclusions Wakeup+Select and Bypass paths likely to be limiting pipeline stages Atomicity makes these stages critical Consider more complexity-effective designs for these stages Sacrifice some IPC decrease for higher clock frequency