CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00

Previous Papers Limits of ILP – it is probably worth doing o-o-o superscalar Complexity-Effective – wire delays make the implementations harder and increase latencies Today’s paper – these latencies severely impact IPCs and slow the growth in processor performance

1995-2000

Clock speed has improved by 50% every year  Reduction in logic delays  Deeper pipelines  This will soon end IPC has gone up dramatically (the increased complexity was worth it)  Will this end too?

Wire Scaling Multiple wire layers – the SIA roadmap predicts dimensions (somewhat aggressive) As transistor widths shrink, wires become thinner, and their resistivity goes up (quadratically – Table 1) Parallel-plate capacitance reduces, but coupling capacitance increases (slight overall increase) The equations are different, but the end result is similar to Palacharla’s (without repeaters)

Wire Scaling

With repeaters, delay of a fixed-length wire does not go up quadratically as we shrink gate-width In going from 250nm  35nm,  5mm wire delay 170ps  390ps  delay to cross X gates 170ps  55ps  SIA clock speed 0.75GHz  13.5GHz  delay to cross X gates 0.13 cyc  0.75 cycles We could increase wire width, but that compromises bandwidth

Clock Scaling Logic delay (the FO4 delay) scales linearly with gate length Likewise, work per pipeline stage has also been shrinking The SIA predicts that today’s 16 FO4/stage delay will shrink to 5.6 FO4/stage A 64-bit add takes 5.5 FO4 – hence, they examine SIA (super-aggressive), 8-FO4 (aggressive), and 16-FO4 (conservative) scaling strategies

Clock Scaling

While the 15-20% improvement in technology scaling will continue, the 15-20% improvement in pipeline depth will cease

On-Chip Wire Delays The number of bits reachable in a cycle are shrinking (by more than a factor of two across three generations)  Structures that fit in a cycle today, will have to be shrunk (smaller regfiles, issue queues) Chip area is steadily increasing  Less than 1% of the chip reachable in a cycle, 30 cycles to go across the chip! Processors are becoming communication-bound

Processor Structure Delays To model the microarchitecture, they estimate the delays of all wire-limited structures Structuref SIA f8f8 f 16 64K-2-port L1753 64-entry 10-port regfile321 20-entry 8-port issueq321 64-entry 8-port ROB321 Weakness: bypass delays are not considered

Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Any other approaches?

Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Replicated Capacity Scaling: fast core with few resources, but lots of them – high IPC if you can localize communication

IPC Comparisons 20-IQ 40 Regs F F F F 20-IQ 40 Regs F F F F 2-cycle wakeup 2-cycle regread 2-cycle bypass 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F Pipeline Scaling Capacity Scaling Replicated Capacity Scaling

Methodology

Results

Every instruction experiences longer latencies IPCs are much lower for aggressive clocks Overall performance is still comparable for all approaches

Results In 17 years, we are seeing only a 7-fold speedup (historically, it should have been 1720) – annual increase of 12.5% Slow growth because pipeline depth and IPC increase will stagnate

Questionable Assumptions Additional transistors are not being used to improve IPC All instructions pay wire-delay penalties

Conclusions Large monolithic cores will perform poorly – microarchitectures will have to be partitioned On-chip caches will be the biggest bottlenecks – 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s Future proposals should be wire-delay-sensitive

Next Class’ Paper “Dynamic Code Partitioning for Clustered Architectures”, UPC-Barcelona, 2001 Instruction steering heuristics to balance load and minimize communication

Title Bullet

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Similar presentations

Presentation on theme: "CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Similar presentations

Presentation on theme: "CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin."— Presentation transcript:

Similar presentations

About project

Feedback