CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Lecture: Pipelining Basics
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
1 Lecture 3: Pipelining Basics Biggest contributors to performance: clock speed, parallelism Today: basic pipelining implementation (Sections A.1-A.3)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
Electrical and Computer Engineering Department
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998.
1 Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP (Appendix A.4-A.10, Sections )
CS Lecture 25 Wire Delay is not a Problem for SMT Z. Chishti, T.N. Vijaykumar Proceedings of ISCA-31 June, 2004.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.
CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
Lecture 16: Basic Pipelining
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Itanium® 2 Processor Architecture
Lecture 16: Basic Pipelining
Lecture: Pipelining Basics
Lecture 17: Pipelining Today’s topics: 5-stage pipeline Hazards
Lecture: Out-of-order Processors
CS203 – Advanced Computer Architecture
Lecture: Pipelining Basics
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 16: Basic Pipelining
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 5: Pipelining Basics
Lecture 18: Pipelining Today’s topics:
Lecture 17: Pipelining Today’s topics: 5-stage pipeline Hazards.
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture: Pipelining Extensions
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 4: Advanced Pipelines
Lecture 5: Pipeline Wrap-up, Static ILP
Lecture: Pipelining Basics
Presentation transcript:

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P. Shivakumar UT-Austin and Compaq ISCA’02

Improvements in Clock Speed 33MHz 66MHz 100MHz 200MHz 450MHz 1GHz 2GHz 1000nm130nm

Definitions Clock Period  logic +  latch +  skew +  jitter  logic : the actual work being done in one stage  latch : data has to be saved in latch registers at the end of each pipeline stage (1 FO4 = 36ps at 100nm)  skew : Two parts of the circuit may receive their clocks thru different paths, resulting in a slight phase difference (0.3 FO4)  jitter : Unpredictable variations (0.5 FO4)

Processor Model An Alpha-like processor with latencies updated for 100nm Simplification: the study is insensitive to the technology generation Note that all structures are perfectly pipelined – this is a “Limit of Pipelining” study

Effect of Deep Pipelining Add = 16 FO4 Mpred = 128 FO4 Load from mem = 400 FO4 Mult = 160 FO4 Overhead = 2 FO4.... add = 16+2mpred = 8x18load = 400mult = 180 Clock Period 18 FO4 10 FO4 add = mpred = 16x10load = 400mult = 200 Clock Period FO4s Cycles Clock speed 18 FO = GHz 10 FO = GHz

Yet, Performance Increases… Deepening a car assembly line  more cars being made at the same time  a new car rolls out at twice the freq Independent instrs benefit from deep pipelining Dependent instrs are slowed down The latter dominates when pipelining overhead is a large fraction of clock period

Example Latencies Logic DelayL1DIssueQInt-Add 2 FO FO FO FO4322

In-Order Processors With no overhead, when  logic reduces from 8FO4 to 4FO4, performance can go up by 100% (like in the car assembly line), but only goes up by 18% With overhead, max performance is seen for 6FO4 for all three benchmark classes For the Cray, optimal pipeline depth was 10.9FO4 (Int) and 5.4FO4 (vector) Degree of parallelism: vector > int-programs-today > int-programs-before (no caches!)

Out-of-Order Processors Optimal logic delay for integer is 6FO4, for FP non-vector is 5FO4, for FP vector is 4FO4 These results are insensitive to overhead costs and microarchitecure optimizations P.S. The effect of o-o-o execution on performance:  Non-vector FP: 0.5  1.0  Integer : 0.8  1.8  Vector FP : 0.9  3.5

Out-of-Order Processors

Increased Pipeline Depth Reasons for IPC decrease:  Longer ALU latencies (not quantified)  Longer load latency (~25% for 6-cyc increase)  Longer branch mpred cost (~10%)  Longer wakeup+select (~55%)

Pipelining Wakeup It takes a long time to broadcast tags across the entire issueq Hence, wake the first eight instructions in the first cycle, wake the next eight in the second, and so on This works well if most ready instructions are in the first stage – a 10-stage pipeline worsens performance by only 11% -- will this change the optimal logic depth?

Instruction Select Stage-1 only goes through one arbiter Stages 2-4 have a pre-select and go thru 2 arbiters Does well if most ready instrs in stage-1 (4% loss) stage 1 16-input arbiters stage 2 stage 3 stage 4 / 8

IssueQ Compaction Both techniques work well only if instructions move up to occupy empty slots Wastes energy, increases complexity Correctness problems – what if you miss the tag while in transit

Conclusions Logic per stage will only shrink by a factor of two – limits clock speed improvements in the future Pipelining wakeup+select has the biggest impact on IPC

Related Work Hartstein and Puzak (IBM): Most programs have optimal pipeline depth between 13-30, corresponding to FO4 delays of 4-8 Sprangle and Carmean (Intel): Optimum pipeline depth is 50-60, corresponding to FO4 delays of 4-5

Title Bullet