CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

Slides:

Advertisements

Similar presentations

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Electrical and Computer Engineering Department

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.

CS Lecture 25 Wire Delay is not a Problem for SMT Z. Chishti, T.N. Vijaykumar Proceedings of ISCA-31 June, 2004.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 9: April 15, 2005 ILP 2.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

The End of Conventional Microprocessors Edwin Olson 9/21/2000.

A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.

CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Cache Memory and Performance

Lynn Choi Dept. Of Computer and Electronics Engineering

Half-Price Architecture

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

COMPLEXITY EFFECTIVE SUPERSCALAR PROCESSORS

The Microarchitecture of the Pentium 4 processor

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

ECE 2162 Reorder Buffer.

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97

Complexity-Effective Conflict between clock speed and parallelism Goals of the paper:  Characterize complexity as a function of issue width, window size, and feature size  Propose clustered microarchitecture that allows fast clocks with high parallelism

Current Trends (circa 1997) More functional units, large in-flight windows Impact on cycle-time critical structures  Register renaming  Instruction wake-up  Instruction selection  Result bypass  Register files  Caches

Wire Delay Trends Logic delays scale linearly with feature size Wire delay ~ RC = R m x C m x L 2 R m =  / (width x thickness) C m = 2 x  x  0 x (thickness/width + width/thickness) R m ~ S ; C m ~ S ; L ~ 1/S (gate size scaled by 1/S) Hence, delay across 50K gates is constant in ps and is linear with S in terms of FO4

Update on Wire Delays “The Future of Wires”, Ho, Mai, Horowitz, 2001 C m actually decreases with reduced feature widths Hence, wire delay across 50K gates (in FO4) increases only slightly and is not quite linear with S – uses repeaters Wire delays are still a problem (though, not as bad as Palacharla et al. claim) – also note, FO4s/clock is shrinking

Update on Wire Delays From “Future of Wires”, Ho, Mai, Horowitz

Register Rename Logic Map Table Dependence Check Logic Mux Logical Source Regs Logical Dest Regs Logical Source Reg Physical Source Regs Physical Dest Regs Free Pool

Map Table – RAM Phys reg id Num entries = Num logical regs Shadow copies (shift register) 7-bits

Map Table – CAM Logical reg id Num entries = Num phys regs Shadow copies 5-bits1-bit validvalid

Delay Model cell Wire length = C + 3 x IW Delay = RC = c 0 + c 1 x IW + c 2 x IW 2 Rename delay ~ IW The wire delay component increases as we shrink to 0.18  Problems: They assume that wire delay/ (in ns) remains constant. No window size?

Wakeup Logic rdyLrdyRtagRtagL or == tag1tagIW … rdyLrdyRtagRtagL

Wakeup Logic CAM array wire length ~ issue width x winsize Capacitive load ~ winsize Matchline length ~ issue width Issue width has a greater impact on delay as it influences tagdrive and tagmatch (the quadratic components are not very dominant) For smaller features, the wire delays dominate

Selection Logic Issue window reqgrant anyreq enable Arbiter cell

Selection Logic Multiple FUs are handled by having more stages in series – further increases selection logic delay Delay ~ log(WINSIZE) Wire lengths ~ WINSIZE, but are ignored – hence, delay scales very well with feature size

Bypass Delay The number of bypass paths equals 2xIW 2 xS (S is the number of pipeline stages) Wire length ~ IW, hence, delay ~ IW 2 The layout and pipeline depth (capacitive load) also matter

Summary of Results Issue Width Window Size Rename Delay (ps) Wakeup + Select (ps) Bypass Delay (ps)  m technology 0.35  m technology 0.18  m technology

Bottlenecks Wakeup+Select and Bypass have the longest delays and represent atomic operations Pipelining will prevent back-to-back operations Increased issue width / window size / wire delays exacerbate the problem (also for the register file and cache)

Dependence-Based Microarchitecture r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r4  r3 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2r9  r1 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2r9  r1 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands r1  r2 

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r3  r9 

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r4 

Dependence-Based Microarchitecture r8  r5 + r2r7  r6 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r5  r6 

Pros and Cons Wakeup and select over a subset of issue queue entries (only FIFO heads) Under-utilization as FIFOs do not get filled (causes about 5% IPC loss) – but it is not hard to increase their sizes You still need an operand-rdy table

Clustered Microarchitectures

Simplifies wakeup+select and bypassing Dependence-based, hence most communication is local Low porting requirements on register file, issue queue IPC loss of 6.3%, but a clock speed improvement

Conclusions As issue width and window size increase, the delays of most structures go up dramatically Dominant wire delays exacerbate the problem Hence, to support large widths, build smaller cores that communicate with each other With dependence information, it is possible to minimize communication costs

Next Class’ Paper “Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures”, ISCA’00 Do not get bogged down in details & methodology