Presentation is loading. Please wait.

Presentation is loading. Please wait.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Similar presentations


Presentation on theme: "Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk."— Presentation transcript:

1 Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

2 Brainiac vs. Speed Demon Brainiac: Maximize # of instructions per cycle More complex, slower clock Speed Demon: Maximize clock frequency Faster clock, fewer instructions per cycle

3 Complexity-Effective Issue Width vs. Clock Cycle Goal is to balance both Best design is Complexity-Effective Allow complex issue scheme AND fast clock cycle

4 What is Complexity? Delay of Critical Path through a piece of logic.

5 Creating a Complexity-Effective Architecture Analyze complexity of each pipeline stage Select component with most complexity Propose less-complex alternative that achieves similar performance

6 Agenda OoO Superscalar Pipeline Overview Register Renaming Wakeup & Select Bypass Complexity-Effective Optimizations Dependence-based Scheduler Clustering

7 Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Fetch: Read Instructions from I-Cache Predict Branches

8 Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Decode: Parse instruction Shuffle opcode parts to appropriate ports for rename

9 Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Rename: Map architectural registers to physical Eliminate false dependences Dispatch renamed instructions to scheduler

10 Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Wakeup: Instructions check whether they become ready Compare register names from Writeback stage Select: Choose from amongst ready instructions Wakeup Select Issue Window

11 Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Register File Read: Read source operands Wakeup Select Issue Window

12 Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Bypass and Execute: Execute instructions in functional units Bypass results from outputs to inputs Wakeup Select Issue Window

13 Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Data Cache Access: Load & Store to data cache Wakeup Select Issue Window

14 Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Write result to register file Broadcast tag to wakeup waiting instrs. happens two cycles before results produced Wakeup Select Issue Window

15 Alternate Pipeline Fetch DecodeRename Register File Data Cache Bypass Used by Pentium Pro, PowerPC Re-order buffer (ROB) holds values Tomasulo-like renaming (point to ROB entries) Reorder Buffer Wakeup Select Reservatio n Stations

16 Sources of Complexity Fetch DecodeRename Register File Data Cache Bypass Complexity scales with issue width and window size. Wakeup Select Issue Window

17 What about..... Fetch, Decode, Register File, Cache? Previously studied Register File, Caches Easy to scale ? Fetch, Decode, Functional Units

18 Register Renaming Eliminate WAR and WAW hazards Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld p5, [p7] Add p3, p5, p7 Sub p6, p7, p4 Ld p9, [p6] Architectural (Logical) Registers Physical Registers

19 Register Rename Logic Logical Source & Destination Registers Register Alias Table Dependence Check Logic Logical Source & Destination Registers MUX Physical Source & Destination Registers SRAM array Scales with Issue Width

20 Register Alias Table Logical Register Name Physical Register Name Decode Bitlines Wordlines Senseamps T rename = T decode + T wordline + T bitline + T senseamp

21 Rename Delay Scales Linearly with Issue Width Better

22 Instruction Wakeup Broadcast newly available operands Update which operands are ready for each instruction Mark instruction ready when both left and right operands are ready

23 Wakeup Logic Delay = T tagdrive + T tagmatch + T matchOR Tag Drive Tag Match Tag OR

24 Wakeup Delay Delay increases quadratically with window size Better

25 Wakeup Delay Breakdown Wire delay dominates at smaller feature sizes! Better

26 Select Consider all instructions that are ready Select one ready instruction for each functional unit Uses some selection policy e.x., Oldest First little affect on performance

27 Selection Logic Delay = c 0 + c 1 ×log 4 (WINSIZE) Request Signal Grant Signal

28 Selection Delay Only Logic delay, wire delay ignored Better

29 Data Bypass Network Forward results from completing instructions to dependent instructions # of paths depends on pipeline depth and issue width # of paths = 2×IW 2 ×S S pipe stages after producing results

30 Data Bypass Logic FUs broadcasts results (potentially multiple subsequent results) Regfile reads current operand values MUX selects correct source

31 Bypass Delay T bypass = 0.5 × R metal × C metal × L 2 L = length of result wires Delay independent of feature size Dependent on specific Layout Issue Width Wire Length (λ) Delay (ps) 420500184.9 8490001056.4

32 Putting it All Together Issue Width Window Size Rename Delay (ps) Wakeup+Select delay (ps) Bypass delay (ps) 0.8μm 4321577.92903.7184.9 8641710.53369.41056.4 0.35μm 432627.21248.4184.9 864726.61484.81056.4 0.18μm 432351.0578.0184.9 864427.9724.01056.4

33 Key Sources of Complexity Wakeup+Select Logic: Limiting stage for 5 out of 6 designs Considers all instructions simultaneously Bypass Paths Wire dominated delay Lots of long wires

34 Dependence-based Microarchitecture Fetch Decode Rename Steer Register File Data Cache Wakeup Select Bypass FIFOs Replace Issue Window with a few small FIFO queues Only schedule from head of FIFO Adds new Steering stage

35 Instruction Steering Heuristic 3 possible cases for instruction I: 1. All operands ready & in register file I steered to empty FIFO 2. I requires operand produced by I source in FIFO F a If no instruction behind I source, I steered to F a Otherwise, steered to empty FIFO 3. Two operands produced by I left and I right. Apply rule 2 to I left If steered to empty FIFO, apply rule 2 to I right

36 Steering Example 0: addu r18, r0, r2 1: addiu r2, r0, -1 2: beq r18, r2, L2 3: lw r4, -32768(r28) 4: sllv r2, r18, r20 5: xor r16, r2, r19 6: lw r3, -32676(r28) 7: sll r2, r16, 0x2 8: addu r2, r2, r23 9: lw r2, 0(r2) 10: sllv r4, r18, r4 11: addu r17, r4, r19 12: addiu r3, r3, 1 13: sw r3, -32676(r28) 14: bew r2, r17, L3 013 2 01 2 3 4 56 7 8 9 10 11 12 13 14 4 5 6 78 9 10 11 12 13 14

37 Performance Comparison Cycle count within 8% for all benchmarks Better

38 But... T Execution = Dependence-based microarchitecture is less complex Allows for faster clock Overall, dependence-based has better performance # Instructions × Clock Period Instructions per Cycle

39 Improving Bypass Paths One-cycle bypass within Cluster Two-cycle bypass between Clusters

40 Performance of Clustering IPC up to 12% lower Clock is 25% faster Overall, 10% - 22% faster 16% faster on average Better

41 1. 1 Window + 1 Cluster 2. 1 Window + 2 Clusters 3. 2 Windows + 2 Clusters i. Issue Window + Steering heuristic ii. Issue Window + Random steering iii. FIFOs + Steering heuristic Other Alternatives

42 Performance of Clustering Steering has largest impact on performance Better

43 Conclusions Wakeup+Select and Bypass paths likely to be limiting pipeline stages Atomicity makes these stages critical Consider more complexity-effective designs for these stages Sacrifice some IPC decrease for higher clock frequency


Download ppt "Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk."

Similar presentations


Ads by Google