Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

Similar presentations


Presentation on theme: "CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97."— Presentation transcript:

1 CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97

2 Complexity-Effective Conflict between clock speed and parallelism Goals of the paper:  Characterize complexity as a function of issue width, window size, and feature size  Propose clustered microarchitecture that allows fast clocks with high parallelism

3 Current Trends (circa 1997) More functional units, large in-flight windows Impact on cycle-time critical structures  Register renaming  Instruction wake-up  Instruction selection  Result bypass  Register files  Caches

4 Wire Delay Trends Logic delays scale linearly with feature size Wire delay ~ RC = R m x C m x L 2 R m =  / (width x thickness) C m = 2 x  x  0 x (thickness/width + width/thickness) R m ~ S ; C m ~ S ; L ~ 1/S (gate size scaled by 1/S) Hence, delay across 50K gates is constant in ps and is linear with S in terms of FO4

5 Update on Wire Delays “The Future of Wires”, Ho, Mai, Horowitz, 2001 C m actually decreases with reduced feature widths Hence, wire delay across 50K gates (in FO4) increases only slightly and is not quite linear with S – uses repeaters Wire delays are still a problem (though, not as bad as Palacharla et al. claim) – also note, FO4s/clock is shrinking

6 Update on Wire Delays From “Future of Wires”, Ho, Mai, Horowitz

7 Register Rename Logic Map Table Dependence Check Logic Mux Logical Source Regs Logical Dest Regs Logical Source Reg Physical Source Regs Physical Dest Regs Free Pool

8 Map Table – RAM Phys reg id Num entries = Num logical regs Shadow copies (shift register) 7-bits

9 Map Table – CAM Logical reg id Num entries = Num phys regs Shadow copies 5-bits1-bit validvalid

10 Delay Model cell Wire length = C + 3 x IW Delay = RC = c 0 + c 1 x IW + c 2 x IW 2 Rename delay ~ IW The wire delay component increases as we shrink to 0.18  Problems: They assume that wire delay/ (in ns) remains constant. No window size?

11 Wakeup Logic rdyLrdyRtagRtagL or == tag1tagIW … rdyLrdyRtagRtagL............

12 Wakeup Logic CAM array wire length ~ issue width x winsize Capacitive load ~ winsize Matchline length ~ issue width Issue width has a greater impact on delay as it influences tagdrive and tagmatch (the quadratic components are not very dominant) For smaller features, the wire delays dominate

13 Selection Logic Issue window reqgrant anyreq enable Arbiter cell

14 Selection Logic Multiple FUs are handled by having more stages in series – further increases selection logic delay Delay ~ log(WINSIZE) Wire lengths ~ WINSIZE, but are ignored – hence, delay scales very well with feature size

15 Bypass Delay The number of bypass paths equals 2xIW 2 xS (S is the number of pipeline stages) Wire length ~ IW, hence, delay ~ IW 2 The layout and pipeline depth (capacitive load) also matter

16 Summary of Results Issue Width Window Size Rename Delay (ps) Wakeup + Select (ps) Bypass Delay (ps) 4848 32 64 1577.9 1710.5 2903.7 3369.4 184.9 1056.4 4848 32 64 627.2 726.6 1248.4 1484.8 184.9 1056.4 4848 32 64 351.0 427.9 578.0 724.0 184.9 1056.4 0.8  m technology 0.35  m technology 0.18  m technology

17 Bottlenecks Wakeup+Select and Bypass have the longest delays and represent atomic operations Pipelining will prevent back-to-back operations Increased issue width / window size / wire delays exacerbate the problem (also for the register file and cache)

18 Dependence-Based Microarchitecture r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

19 Dependence-Based Microarchitecture r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

20 Dependence-Based Microarchitecture r4  r3 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

21 Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

22 Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

23 Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

24 Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

25 Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2r9  r1 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

26 Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2r9  r1 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands r1  r2 

27 Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r3  r9 

28 Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r4 

29 Dependence-Based Microarchitecture r8  r5 + r2r7  r6 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r5  r6 

30 Pros and Cons Wakeup and select over a subset of issue queue entries (only FIFO heads) Under-utilization as FIFOs do not get filled (causes about 5% IPC loss) – but it is not hard to increase their sizes You still need an operand-rdy table

31 Clustered Microarchitectures

32 Simplifies wakeup+select and bypassing Dependence-based, hence most communication is local Low porting requirements on register file, issue queue IPC loss of 6.3%, but a clock speed improvement

33 Conclusions As issue width and window size increase, the delays of most structures go up dramatically Dominant wire delays exacerbate the problem Hence, to support large widths, build smaller cores that communicate with each other With dependence information, it is possible to minimize communication costs

34 Next Class’ Paper “Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures”, ISCA’00 Do not get bogged down in details & methodology


Download ppt "CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97."

Similar presentations


Ads by Google