Complexity-Effective Issue Queue Design Under Load-Hit Speculation Tali Moreshet and R. Iris Bahar Brown University Division of Engineering.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Lynn Choi Dept. Of Computer and Electronics Engineering
PowerPC 604 Superscalar Microprocessor
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
CS203 – Advanced Computer Architecture
Lu Peng, Jih-Kwon Peir, Konrad Lai
Hyperthreading Technology
Half-Price Architecture
Lecture: SMT, Cache Hierarchies
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
CS 152 Computer Architecture & Engineering
Computer Architecture Lecture 3
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Computer Architecture Lecture 4 17th May, 2006
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Midterm 2 review Chapter
Lecture: SMT, Cache Hierarchies
Lecture 10: ILP Innovations
Conceptual execution on a processor which exploits ILP
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Sizing Structures Fixed relations Empirical (simulation-based)
Presentation transcript:

Complexity-Effective Issue Queue Design Under Load-Hit Speculation Tali Moreshet and R. Iris Bahar Brown University Division of Engineering

Brown UniversityWCED 2002 Motivation Pipelines are getting deeper  Higher clock frequencies  Increased architectural complexity Speculatively issued instructions are particularly sensitive to pipeline depth  Branch prediction  Load hit prediction

Brown UniversityWCED 2002 Pipeline Register File Functional Units Register Rename Unit Data Cache Instruction Cache Issue Queue Load Resolution Loop FetchDecodeIssueExecute forwarding

Brown UniversityWCED 2002 Load Hit Prediction Issue instructions dependent on load as soon as possible  Assume load hits in DL1 BUT… Load hit status is known only after dependent instructions may issue

Brown UniversityWCED 2002 Example Exec Issue Exec Cycle: LOAD MULT SUB ADD Issue Speculative window Exec

Brown UniversityWCED 2002 Example ExecIssueExec Cycle: LOAD ADD Speculative window ExecIssue Exec MULT SUB Exec

Brown UniversityWCED 2002 Example IssueExec Cycle: LOAD ADD ExecIssue Speculative window MULT SUB Exec

Brown UniversityWCED 2002 What Happens On a Load Miss? Re-issue instructions in speculative window after a load miss Keep post-issue instructions in issue queue long enough to ensure re-issuing will not be necessary

Brown UniversityWCED 2002 Complexity-Effective Load Hit Speculation As pipeline depth increases:  Retain performance benefit  Consider complexity of re-issue and prediction policies  Consider impact on issue queue design

Brown UniversityWCED 2002 Re-Issue Policies 4 different load hit speculation policies: 1) No load hit speculation 2) Perfect load hit speculation 3) Replay only instructions dependent on load that missed 4) Replay all instructions in speculative window Load hit/miss predictor to limit re-issuing

Brown UniversityWCED 2002 Performance Impact

Brown UniversityWCED 2002 Impact on Issue Queue Occupancy

Brown UniversityWCED 2002 Impact on Issue Queue Occupancy

Brown UniversityWCED 2002 Impact on Issue Queue Occupancy As pipeline depth increases:  Issue queue gets cluttered with post-issue instructions(average 55%)  Limits the available ILP  Inefficient use of complexity in instruction bid/grant arbitration logic

Brown UniversityWCED 2002 The Bid / Grant Loop Prioritize & Select M entries Issue Queue req grant N-wide Bid for issue slot Broadcast grant...

Brown UniversityWCED 2002 Issue Queue Utilization Problem Complexity of bid/grant arbitration logic increases with size of the IQ IQ consists largely of post-issue instructions Limiting the available ILP that a large IQ is supposed to provide Not a complexity-effective design

Brown UniversityWCED 2002 IQ Design Options Increase the IQ size Improve performance – increase available ILP  Increase complexity Simplify arbitration logic – use slower circuitry Reduce complexity  Hurt performance Reduce IQ size Reduce complexity  Hurt performance

Brown UniversityWCED 2002 Double Latency of Issue Queue

Brown UniversityWCED 2002 Smaller IQ (48 Entry)

Brown UniversityWCED 2002 Complexity-Effective Issue Queue Goal  Reduce complexity  Do not degrade performance Solution: The Dual Issue Queue  Move post-issue instructions from main queue to separate replay queue  Increase available ILP  Reduce size of main IQ

Brown UniversityWCED 2002 Dual Issue Queue Register File Functional Units Register Rename Unit Data Cache Main Issue Queue Replay Issue Queue from Fetch unit Replay_req MIQ RIQ

Brown UniversityWCED 2002 Dual Issue Queue Performance

Brown UniversityWCED 2002 Conclusion Load hit speculation is critical for high performance in deeper pipelines Larger percentage of post-issue instructions in issue queue Complexity-effective issue queue scheme addresses utilization problem For deepest pipelines, overall performance improves while reducing complexity of IQ