CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

Electrical and Computer Engineering Department

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Modeling and Optimization of VLSI Interconnect Lecture 1: Introduction Avinoam Kolodny Konstantin Moiseev.

CS Lecture 25 Wire Delay is not a Problem for SMT Z. Chishti, T.N. Vijaykumar Proceedings of ISCA-31 June, 2004.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

The End of Conventional Microprocessors Edwin Olson 9/21/2000.

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.

CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

CS203 – Advanced Computer Architecture

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Lynn Choi School of Electrical Engineering

Architecture & Organization 1

Lecture: SMT, Cache Hierarchies

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 2: Performance Today’s topics: Technology wrap-up

Architecture & Organization 1

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture: SMT, Cache Hierarchies

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Presentation transcript:

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00

Previous Papers Limits of ILP – it is probably worth doing o-o-o superscalar Complexity-Effective – wire delays make the implementations harder and increase latencies Today’s paper – these latencies severely impact IPCs and slow the growth in processor performance

Clock speed has improved by 50% every year  Reduction in logic delays  Deeper pipelines  This will soon end IPC has gone up dramatically (the increased complexity was worth it)  Will this end too?

Wire Scaling Multiple wire layers – the SIA roadmap predicts dimensions (somewhat aggressive) As transistor widths shrink, wires become thinner, and their resistivity goes up (quadratically – Table 1) Parallel-plate capacitance reduces, but coupling capacitance increases (slight overall increase) The equations are different, but the end result is similar to Palacharla’s (without repeaters)

Wire Scaling

With repeaters, delay of a fixed-length wire does not go up quadratically as we shrink gate-width In going from 250nm  35nm,  5mm wire delay 170ps  390ps  delay to cross X gates 170ps  55ps  SIA clock speed 0.75GHz  13.5GHz  delay to cross X gates 0.13 cyc  0.75 cycles We could increase wire width, but that compromises bandwidth

Clock Scaling Logic delay (the FO4 delay) scales linearly with gate length Likewise, work per pipeline stage has also been shrinking The SIA predicts that today’s 16 FO4/stage delay will shrink to 5.6 FO4/stage A 64-bit add takes 5.5 FO4 – hence, they examine SIA (super-aggressive), 8-FO4 (aggressive), and 16-FO4 (conservative) scaling strategies

Clock Scaling

While the 15-20% improvement in technology scaling will continue, the 15-20% improvement in pipeline depth will cease

On-Chip Wire Delays The number of bits reachable in a cycle are shrinking (by more than a factor of two across three generations)  Structures that fit in a cycle today, will have to be shrunk (smaller regfiles, issue queues) Chip area is steadily increasing  Less than 1% of the chip reachable in a cycle, 30 cycles to go across the chip! Processors are becoming communication-bound

Processor Structure Delays To model the microarchitecture, they estimate the delays of all wire-limited structures Structuref SIA f8f8 f 16 64K-2-port L entry 10-port regfile entry 8-port issueq entry 8-port ROB321 Weakness: bypass delays are not considered

Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Any other approaches?

Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Replicated Capacity Scaling: fast core with few resources, but lots of them – high IPC if you can localize communication

IPC Comparisons 20-IQ 40 Regs F F F F 20-IQ 40 Regs F F F F 2-cycle wakeup 2-cycle regread 2-cycle bypass 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F Pipeline Scaling Capacity Scaling Replicated Capacity Scaling

Methodology

Results

Every instruction experiences longer latencies IPCs are much lower for aggressive clocks Overall performance is still comparable for all approaches

Results In 17 years, we are seeing only a 7-fold speedup (historically, it should have been 1720) – annual increase of 12.5% Slow growth because pipeline depth and IPC increase will stagnate

Questionable Assumptions Additional transistors are not being used to improve IPC All instructions pay wire-delay penalties

Conclusions Large monolithic cores will perform poorly – microarchitectures will have to be partitioned On-chip caches will be the biggest bottlenecks – 3-cycle 0.5KB L1s, cycle 2MB L2s Future proposals should be wire-delay-sensitive

Next Class’ Paper “Dynamic Code Partitioning for Clustered Architectures”, UPC-Barcelona, 2001 Instruction steering heuristics to balance load and minimize communication

Title Bullet