Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Slides:



Advertisements
Similar presentations
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.
PipeliningPipelining Computer Architecture (Fall 2006)
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
PowerPC 604 Superscalar Microprocessor
Lecture: Out-of-order Processors
Morgan Kaufmann Publishers The Processor
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Out-of-Order Commit Processor
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Ka-Ming Keung Swamy D Ponpandi
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Sampoorani, Sivakumar and Joshua
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Patrick Akl and Andreas Moshovos AENAO Research Group
Lecture 10: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Sizing Structures Fixed relations Empirical (simulation-based)
Presentation transcript:

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 237 – 248, MICRO Nathir Rawashdeh University of Massachusetts, Amherst Low Power Architecture, Professor Moritz Note : This presentation is, to a large extent, a reproduction of slides created buy the School of Electrical Engineering at Korea University. I have altered them and added new slides to better suit my audience. Nathir Rawashdeh (3 November 2003)

Contents  Motivation  Reduce register file size  Two Level Register File (1 st Technique)  Reduce port complexity  Banked Organization (2 nd Technique)  Evaluation  Two-Level Register File Evaluation  Banked Register File Evaluation  Combining the Two Techniques

Motivation  Modern high-performance processors use an out-of-order superscalar core to dynamically extract instruction level parallelism (ILP) from running applications.  Examine large window of in-flight instructions to find/issue multiple ready and independent instructions every cycle.  A larger instruction window: –Achieves better ILP –Requires a larger register file, issue queue, and reorder buffer.  Large multi-ported register file can potentially compromise clock cycle time in future wire-limited technologies.  Suggested two Methods in this Paper:  Two-Level Register File Organization to reduce register file size requirements.  Banked Organization to reduces port complexity.

Motivation  Conventional Register File Organization Logical registers are renamed to physical registers At 1 and 2 : lr5 is renamed to pr18 Branch at 3 is predicted not taken -> must keep pr18 in case of misprediction. Lr5 at 5 must be allocated a new reg. pr27 Pr18 can only released to the free-list after 5 commits. Then lr5 at5 will be remapped to pr27

Two-Level Register File (1 st Technique)  Level One (L1) Register File : Leaves register values that have potential readers.  Level Two (L2) Register File : Keeps other register values waiting to be released after their instructions commit.  Effects:  Reduced register file access time. Because a smaller portion (L1) of the register file is on the critical path.  More energy needed to copy register contents between L1 and L2.

Two-Level Register File  Microarchitectural Changes  Assumption : 8-way issue processor  During rename, register renames correspond only to L1 Physical registers, L2 registers are hidden from the rename process.

Two-Level Register File  Usage Table  Monitors the usage statistics for each L1 physical registers.  Maintaining Information –Pending consumer counter : keeps track of the number of pending consumers of that value.  Increment : during rename, an instruction that sources the register increments the counter  Decrement : during issue, the same instruction decrements the counter or if the instruction is squashed after a mispredict. –Overwrite bit (single bit)  Set when the physical register is no longer the latest mapping for its logical register. (the lr’s mapping changed to a different pr) –Another “result-written” bit  Indicates if a result has been written into the physical register. –Sequence number counter (sequence number 1)  For the branch immediately following the instruction that writes to this physical register. – Sequence number counter (sequence number 2)  For the branch immediately preceding the next instruction that writes to the same logical register. Sequence number counter size : log 2 (ROB size).

Two-Level Register File  Single L2 ID valid bit  Added to each ROB entry.  Indicates that the destination register ID in that entry corresponds to an L2 register.

Two-Level Register File  Copy List  Keeps track of L1-L2 copies for recovery from a branch mispredict.  Maintaining Information for each L2 entry: –The L1 physical register name that had earlier contained the value. –The sequence number for the branch immediately following the instruction that writes to this physical register. –The sequence number for the branch immediately preceding the next instruction that writes to the same logical register. Two branch sequence numbers stored indicate the live period of a physical register value, the period during which instructions sourcing this value are dispatched.

Minimally-Ported Banked Register File (2nd Technique)  Motivation  The large number of register file ports (in a wide-issue processor) –Increase complexity -> more power consumption –Increase reg. file access time -> will limit clock speed in future wire- limited technologies.  The number of ports required on average are a lot fewer than the actual port count (that supports the worst case). Reasons: –Many operands are read off the bypass network, not form the reg. file. –Many instructions only have a single register operand. –A number of instructions produce results that are not written to the register file (branches, stores, effective address computation part of a load or store)

Minimally-Ported Banked Register File

Evaluation  Metrics used to evaluate the Two-Level Register File Organization and Banked Register File Organization.  IPC : instructions per cycle  IPS : instructions per second = IPC/Access Time  Assume register file access time is the bottleneck, IPS is a better measure than IPC

Two-Level Register File Evaluation   IPC (single vs. two-level reg. file) 1.63 Gap between the two lines : Addition of L2 frees up more L1 registers Two-level organization has IPC = (1.67) with just 80 L1 registers (and 80 L2) Single-level organization requires as many as 140 registers to attain an IPC of   out of 140 physical registers, only about 80 are active at any given time. Renaming 60 don’t have any consumers unless there is a misprediction or exception and they can be move away to the L

Two-Level Register File Evaluation For single level register file, IPS peaks for a 100-entry register file. For two-level register file, peak IPS value is seen for 60-entry L1. Optimal IPS with two-level organization is 17% better than the optimal IPS with a single-level register file ( better access time with two-level design).   IPS (single vs. two-level reg. file) max

Two-Level Register File Evaluation  IPS on individual applications. The 100-L1 has the longest access time, but it’s IPS is not always worse than the 60-L1. In those cases, the 100-L1’s IPC out ways the access time penalty. Two-level organization achieves best IPS because it maintains low access time and an IPC comparable (within 1%) to the single-level 100-L1 design.

Banked Register File Evaluation  Reg. file with a single read and single write port with N banks. Base Case: “Single bank,4rd,4wr” is within 2% of 24-ported case Third Bar : penalty by conflicts for read ports. 1% IPC degradation Fourth Bar : additional penalty by write port conflicts. 5% IPC degradation Worst port contention for apps with high ILP

Banked Register File Evaluation  Reducing conflicts  move from 4 to 8 banks With 8 banks -> almost no IPC degradation due to read/write port conflicts (compared to 4 banks in previous figure) Still 2% IPC loss over 24-ported design

Combining the Two Techniques

Summary of various Organizations Two-level organization has slightly lower IPC than single-level, but 17% better IPS due to shorter L1 access times. Energy penalty due to copying between L1 & L2. Banked (single port per bank) reg. file has shorter access time (>2 factor) and needs 18 times less energy than a conventional organization. The Choice of technique dependant on design goals