Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Brainiac vs. Speed Demon Brainiac: Maximize # of instructions per cycle More complex, slower clock Speed Demon: Maximize clock frequency Faster clock, fewer instructions per cycle

Complexity-Effective Issue Width vs. Clock Cycle Goal is to balance both Best design is Complexity-Effective Allow complex issue scheme AND fast clock cycle

What is Complexity? Delay of Critical Path through a piece of logic.

Creating a Complexity-Effective Architecture Analyze complexity of each pipeline stage Select component with most complexity Propose less-complex alternative that achieves similar performance

Agenda OoO Superscalar Pipeline Overview Register Renaming Wakeup & Select Bypass Complexity-Effective Optimizations Dependence-based Scheduler Clustering

Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Fetch: Read Instructions from I-Cache Predict Branches

Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Decode: Parse instruction Shuffle opcode parts to appropriate ports for rename

Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Rename: Map architectural registers to physical Eliminate false dependences Dispatch renamed instructions to scheduler

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Wakeup: Instructions check whether they become ready Compare register names from Writeback stage Select: Choose from amongst ready instructions Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Register File Read: Read source operands Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Bypass and Execute: Execute instructions in functional units Bypass results from outputs to inputs Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Data Cache Access: Load & Store to data cache Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Write result to register file Broadcast tag to wakeup waiting instrs. happens two cycles before results produced Wakeup Select Issue Window

Alternate Pipeline Fetch DecodeRename Register File Data Cache Bypass Used by Pentium Pro, PowerPC Re-order buffer (ROB) holds values Tomasulo-like renaming (point to ROB entries) Reorder Buffer Wakeup Select Reservatio n Stations

Sources of Complexity Fetch DecodeRename Register File Data Cache Bypass Complexity scales with issue width and window size. Wakeup Select Issue Window

What about..... Fetch, Decode, Register File, Cache? Previously studied Register File, Caches Easy to scale ? Fetch, Decode, Functional Units

Register Renaming Eliminate WAR and WAW hazards Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld p5, [p7] Add p3, p5, p7 Sub p6, p7, p4 Ld p9, [p6] Architectural (Logical) Registers Physical Registers

Register Rename Logic Logical Source & Destination Registers Register Alias Table Dependence Check Logic Logical Source & Destination Registers MUX Physical Source & Destination Registers SRAM array Scales with Issue Width

Register Alias Table Logical Register Name Physical Register Name Decode Bitlines Wordlines Senseamps T rename = T decode + T wordline + T bitline + T senseamp

Rename Delay Scales Linearly with Issue Width Better

Instruction Wakeup Broadcast newly available operands Update which operands are ready for each instruction Mark instruction ready when both left and right operands are ready

Wakeup Logic Delay = T tagdrive + T tagmatch + T matchOR Tag Drive Tag Match Tag OR

Wakeup Delay Delay increases quadratically with window size Better

Wakeup Delay Breakdown Wire delay dominates at smaller feature sizes! Better

Select Consider all instructions that are ready Select one ready instruction for each functional unit Uses some selection policy e.x., Oldest First little affect on performance

Selection Logic Delay = c 0 + c 1 ×log 4 (WINSIZE) Request Signal Grant Signal

Selection Delay Only Logic delay, wire delay ignored Better

Data Bypass Network Forward results from completing instructions to dependent instructions # of paths depends on pipeline depth and issue width # of paths = 2×IW 2 ×S S pipe stages after producing results

Data Bypass Logic FUs broadcasts results (potentially multiple subsequent results) Regfile reads current operand values MUX selects correct source

Bypass Delay T bypass = 0.5 × R metal × C metal × L 2 L = length of result wires Delay independent of feature size Dependent on specific Layout Issue Width Wire Length (λ) Delay (ps) 420500184.9 8490001056.4

Putting it All Together Issue Width Window Size Rename Delay (ps) Wakeup+Select delay (ps) Bypass delay (ps) 0.8μm 4321577.92903.7184.9 8641710.53369.41056.4 0.35μm 432627.21248.4184.9 864726.61484.81056.4 0.18μm 432351.0578.0184.9 864427.9724.01056.4

Key Sources of Complexity Wakeup+Select Logic: Limiting stage for 5 out of 6 designs Considers all instructions simultaneously Bypass Paths Wire dominated delay Lots of long wires

Dependence-based Microarchitecture Fetch Decode Rename Steer Register File Data Cache Wakeup Select Bypass FIFOs Replace Issue Window with a few small FIFO queues Only schedule from head of FIFO Adds new Steering stage

Instruction Steering Heuristic 3 possible cases for instruction I: 1. All operands ready & in register file I steered to empty FIFO 2. I requires operand produced by I source in FIFO F a If no instruction behind I source, I steered to F a Otherwise, steered to empty FIFO 3. Two operands produced by I left and I right. Apply rule 2 to I left If steered to empty FIFO, apply rule 2 to I right

Steering Example 0: addu r18, r0, r2 1: addiu r2, r0, -1 2: beq r18, r2, L2 3: lw r4, -32768(r28) 4: sllv r2, r18, r20 5: xor r16, r2, r19 6: lw r3, -32676(r28) 7: sll r2, r16, 0x2 8: addu r2, r2, r23 9: lw r2, 0(r2) 10: sllv r4, r18, r4 11: addu r17, r4, r19 12: addiu r3, r3, 1 13: sw r3, -32676(r28) 14: bew r2, r17, L3 013 2 01 2 3 4 56 7 8 9 10 11 12 13 14 4 5 6 78 9 10 11 12 13 14

Performance Comparison Cycle count within 8% for all benchmarks Better

But... T Execution = Dependence-based microarchitecture is less complex Allows for faster clock Overall, dependence-based has better performance # Instructions × Clock Period Instructions per Cycle

Improving Bypass Paths One-cycle bypass within Cluster Two-cycle bypass between Clusters

Performance of Clustering IPC up to 12% lower Clock is 25% faster Overall, 10% - 22% faster 16% faster on average Better

1. 1 Window + 1 Cluster 2. 1 Window + 2 Clusters 3. 2 Windows + 2 Clusters i. Issue Window + Steering heuristic ii. Issue Window + Random steering iii. FIFOs + Steering heuristic Other Alternatives

Performance of Clustering Steering has largest impact on performance Better

Conclusions Wakeup+Select and Bypass paths likely to be limiting pipeline stages Atomicity makes these stages critical Consider more complexity-effective designs for these stages Sacrifice some IPC decrease for higher clock frequency

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Similar presentations

Presentation on theme: "Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Similar presentations

Presentation on theme: "Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk."— Presentation transcript:

Similar presentations

About project

Feedback