Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Similar presentations


Presentation on theme: "CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin."— Presentation transcript:

1 CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00

2 Previous Papers Limits of ILP – it is probably worth doing o-o-o superscalar Complexity-Effective – wire delays make the implementations harder and increase latencies Today’s paper – these latencies severely impact IPCs and slow the growth in processor performance

3 1995-2000

4 Clock speed has improved by 50% every year  Reduction in logic delays  Deeper pipelines  This will soon end IPC has gone up dramatically (the increased complexity was worth it)  Will this end too?

5 Wire Scaling Multiple wire layers – the SIA roadmap predicts dimensions (somewhat aggressive) As transistor widths shrink, wires become thinner, and their resistivity goes up (quadratically – Table 1) Parallel-plate capacitance reduces, but coupling capacitance increases (slight overall increase) The equations are different, but the end result is similar to Palacharla’s (without repeaters)

6 Wire Scaling

7 With repeaters, delay of a fixed-length wire does not go up quadratically as we shrink gate-width In going from 250nm  35nm,  5mm wire delay 170ps  390ps  delay to cross X gates 170ps  55ps  SIA clock speed 0.75GHz  13.5GHz  delay to cross X gates 0.13 cyc  0.75 cycles We could increase wire width, but that compromises bandwidth

8 Clock Scaling Logic delay (the FO4 delay) scales linearly with gate length Likewise, work per pipeline stage has also been shrinking The SIA predicts that today’s 16 FO4/stage delay will shrink to 5.6 FO4/stage A 64-bit add takes 5.5 FO4 – hence, they examine SIA (super-aggressive), 8-FO4 (aggressive), and 16-FO4 (conservative) scaling strategies

9 Clock Scaling

10 While the 15-20% improvement in technology scaling will continue, the 15-20% improvement in pipeline depth will cease

11 On-Chip Wire Delays The number of bits reachable in a cycle are shrinking (by more than a factor of two across three generations)  Structures that fit in a cycle today, will have to be shrunk (smaller regfiles, issue queues) Chip area is steadily increasing  Less than 1% of the chip reachable in a cycle, 30 cycles to go across the chip! Processors are becoming communication-bound

12 Processor Structure Delays To model the microarchitecture, they estimate the delays of all wire-limited structures Structuref SIA f8f8 f 16 64K-2-port L1753 64-entry 10-port regfile321 20-entry 8-port issueq321 64-entry 8-port ROB321 Weakness: bypass delays are not considered

13 Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Any other approaches?

14 Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Replicated Capacity Scaling: fast core with few resources, but lots of them – high IPC if you can localize communication

15 IPC Comparisons 20-IQ 40 Regs F F F F 20-IQ 40 Regs F F F F 2-cycle wakeup 2-cycle regread 2-cycle bypass 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F Pipeline Scaling Capacity Scaling Replicated Capacity Scaling

16 Methodology

17 Results

18 Every instruction experiences longer latencies IPCs are much lower for aggressive clocks Overall performance is still comparable for all approaches

19 Results In 17 years, we are seeing only a 7-fold speedup (historically, it should have been 1720) – annual increase of 12.5% Slow growth because pipeline depth and IPC increase will stagnate

20 Questionable Assumptions Additional transistors are not being used to improve IPC All instructions pay wire-delay penalties

21 Conclusions Large monolithic cores will perform poorly – microarchitectures will have to be partitioned On-chip caches will be the biggest bottlenecks – 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s Future proposals should be wire-delay-sensitive

22 Next Class’ Paper “Dynamic Code Partitioning for Clustered Architectures”, UPC-Barcelona, 2001 Instruction steering heuristics to balance load and minimize communication

23 Title Bullet


Download ppt "CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin."

Similar presentations


Ads by Google