CS 7960-4 Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Lecture: Pipelining Basics

Alpha Microarchitecture Onur/Aditya 11/6/2001.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

CS Lecture 24 Exceeding the Dataflow Limit via Value Prediction M.H. Lipasti, J.P. Shen Proceedings of MICRO-29 December 1996.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Lecture: Out-of-order Processors

Lecture: Pipelining Basics

Lecture: Out-of-order Processors

Lecture: Pipelining Basics

Lecture: SMT, Cache Hierarchies

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture: SMT, Cache Hierarchies

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 18: Pipelining Today’s topics:

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Out-of-Order Execution Structures Optimizations

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lecture: Pipelining Basics

Sizing Structures Fixed relations Empirical (simulation-based)

Presentation transcript:

CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals Proceedings of MICRO-32 November 1999

Register File Design Considerations Number of ports = 3 x issue width Number of entries = window size + logical-regs Multiple threads  more registers (more power) Wire delays, clock speeds  multiple cycle access Pipelining a RAM structure is hard

Register Allocation FetchRenameIssueCompleteWake-up assign pr7 cycle 4cycle 15 write pr7 cycle 30 Commit read pr7 cycle 50 release pr7 cycle 80 useful time – 20 cycno result – 26 cycno activity – 30 cyc

Two-Level Register File Base regfileTwo-level regfile

Virtual-Physical Registers lr3  vr7 vr7  vr7   vr7 Register map table Virtual map table

Virtual-Physical Registers lr3  vr7 vr7  vr7   vr7 Register map table Instruction issues Virtual map table

Virtual-Physical Registers lr3  vr7, pr9 vr7  pr9  vr7 (pr9) Register map table Instruction completes Is assigned pr9 vr7, pr9Virtual map table

Virtual-Physical Registers lr3  vr7, pr9 vr7  pr9  vr7 (pr9)  pr9 Register map table Virtual map table

Lack of Registers Finishes, has no register, keeps re-executing Has physical register Has no physical register In-flight window

Lack of Registers Finishes, has no register, keeps re-executing Has physical register Has no physical register In-flight window cycle tcycle t+1 commits gets reg

Deadlock Finishes, has no register, keeps re-executing Has physical register Has no physical register In-flight window Who will generate a register for this instr? Solution: Reserve a register for the oldest instruction

Sequential Execution Has physical register Has no physical register In-flight window Oldest instr has reserved register

Sequential Execution Has physical register Has no physical register In-flight window instr commits, releases another reg, that is then reserved for the new oldest instr

Sequential Execution Has physical register Has no physical register In-flight window instr commits, releases another reg, that is then reserved for the new oldest instr Behaves like an in-order processor

Reserving All Registers Has physical register Has no physical register Allows quick progress, but almost behaves like a conventional processor

Register Stealing Has physical register Has no physical register In-flight window Instr finishes; steals register from the youngest finished instr No reservation of regs The younger instrs may have to execute twice Note the pre-execution effect

Implementation Finished instructions have to remain in issueq in case they have to re-execute Issued dependents of the victim instruction need not re-execute The VP tag of the victim has to be broadcast so that unissued dependents can reset the ready bit Can benefit from an instruction reuse buffer? Pre-execution without explicitly attempting it

Results Improves the base case by 5% (Int programs) and 24% (FP programs) FP programs have more ILP, better branch prediction, and are more limited by cache misses Re-executions: 10% (int) 58% (fp) Steals: 5% (int) 12% (fp) For the same IPC, VP registers employ 25% fewer registers

Next Week’s Paper “Pipeline Gating: Speculation Control for Energy Reduction”, S. Manne, A. Klauser, D. Grunwald, Proceedings of ISCA-25, June 1998

Harmonic and Arithmetic Means HM of IPC = N / (1/IPC a + 1/ IPC b + 1/ IPC c ) = N / (CPI a + CPI b + CPI c ) = 1 / AM of CPI Weight each benchmark as if they all execute one instruction If you want to assume each benchmark executes for the same time, HM of CPI or AM of IPC is appropriate

Title Bullet