Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Power and Frequency Analysis for Data and Control Independence in Embedded Processors Farzad Samie Amirali Baniasadi Sharif University of Technology University.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Lynn Choi Dept. Of Computer and Electronics Engineering

Out-of-Order Commit Processors

Half-Price Architecture

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Superscalar Processors & VLIW Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Out-of-Order Commit Processors

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Out-of-Order Execution Structures Optimizations

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University

2 Outline Motivation Case study: Cyclone Towards high-performance wakeup- free scheduler  A general model  Employing pre-check scheme  A segmented issue queue Conclusions and future work

3 Superscalar Issue Queue rdyL opd tagL opd tagR rdyR rdyL opd tagL opd tagR rdyR = = = = OR tag1 tagIW instN-1inst0 Wakeup Logic Delay = T tagdrive + T tagmatch + T matchOR T tagdirve = c 0 + (c 1 + c 2 xIW)xN + (c 3 + c 4 xIW + c 5 xIW 2 )xN 2 T tagmatch,T matchOR = c 0 + c 1 xIW + c 2 xIW 2 S. Palacharla et al., ISCA24

4 Superscalar Issue Queue Selection Logic T selection = c 0 + c 1 xlog 4 N S. Palacharla et al., ISCA24 req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb Issue Queue req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 enb from/to other subtrees root cell

5 Challenges in Dynamic Instruction Scheduling Broadcast-based dynamic scheduler  Higher complexity  Power hungry  A major limiter to clock frequency: increasing issue queue size, issue width, wire delay, and shorten logic levels per pipeline stage Complexity Effective Issue  Speculative wakeup [Stark et.al.]  Dependency chain based ordering [Canal/Gonzalez ICS 00//01; Michaud/Seznec HPCA01;  Segmented Issue queue [Raasch et.al. ISCA 2002]  Wakeup-free dynamic scheduler [Ernst ISCA 2003 et.al.] Lower complexity Lower power consumption Better scalability  Have to trade performance loss

6 Our Goals Explore the predictability of instruction issue latency Identify the performance impediments in wakeup-free architectures Design high-performance wakeup-free schedulers

7 Cyclone: Conflict in the Main Queue FP benchmarksInt benchmarks Order Enforced Enforce ordered placement to avoid conflict between instructions with different latencies

8 Possible Structural Problems Instruction promotion/forwarding incurs conflict along the path Very limited instruction pool for selection  Only entries in column 0 in the main queue can be issued  Ready instructions (not in column 0) are delayed due to conflict Limited number of issue ports has less tolerance to mispredicted ready instructions  Waste issue port  Prevent ready instruction from issue  Complete with newly decoded instructions due to replay

9 A General Model: WF-Replay lat Wakeup-Free Issue Queue lat register file ready bits replay? Rename Pre- schedule From decoder Timing Table to FUs Selection Logic from FUs Collapsing issue queue without promotion. Conventional random selection logic Given much wider issue width How to relax the structural constraints? Instruction is removed if no replay is needed

10 Instruction Pre-scheduling I0I0 I1I1 I2I2 I3I3 Rename/ PSCHED 0 max + reschedule? Timing Table PSCHED 1 max + + dep check MUX control Register Mapping Table + lat0 lat1 lat2 lat3 Adapted from Cyclone, D. Ernst et. al., ISCA’03

11 Latency Triggered Selection lat Wakeup-Free Issue Queue lat req0grant0req1grant1req2grant2req3grant3 reqenb lat req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 enb root cell

12 WF-Replay IPC (F4-I8 vs F4- I4) Issue Width: 8Issue Width: 4 WF-Replay loses 9.7% performance (IPC) to Base as the issue width reduces to 4 instruction per cycle

13 Competition at Issue Ports? Issue Width: 8Issue Width: 4

14 Precheck to Avoid Competition Competition at issue port may delay ready (predictive) instructions Delayed instructions may again compete with instructions dependent on them Causing more instructions falsely ready or to be delayed Wider issue port can avoid unnecessary competition at cost of higher complexity Solution: preventing falsely ready instructions from selection by pre-checking register ready bits

15 WF-Precheck Scheduler lat Wakeup-Free Issue Queue Issuing Rename Pre- schedule From decoder Timing Table to FUs Register Ready Bit Register from Mem. Selection Logic rylatrylatrylatrylatrylatry Precheck register ready bits when predicted latency reaches 0 Selection request is filtered by ‘ry’ bit Trade replay for pre-check Only issue truly ready instructions

16 Complexity of Pre-checking On the average, 40.2% instructions have both source operands ready and 45.4% instructions have one source operand ready at pre-schedule stage. Pre-check request is less than 2 per cycle.

17 Issue Port Competition (F4-I4)

18 WF-Precheck IPC (F4-I4)

19 Impact of Load Related Predictions

20 How about Selection Logic? Selection Logic T selection = c 0 + c 1 xlog 4 N S. Palacharla et al., ISCA24 req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 reqenb Issue Queue req0grant0req1grant1req2grant2req3grant3 reqenb req0grant0req1grant1req2grant2req3grant3 enb from/to other subtrees root cell

21 WF-Segment Issue Queue Selection Logic ry 4 issue ports to FUs Dispatch Routing >4 Rename / Pre- scheduling Time Table Register Ready Bits from FUsMem. from decoder Switchback path

22 WF-Segment Issue Queue On the average, WF-Segment trades 3% IPC loss to WF-Precheck and 5% loss to the Base for optimizing selection logic.

23 Conclusions Explore and identify the performance impediments in wakeup-free scheduling High-performance wakeup-free dynamic schedulers  WF-Replay : eliminates structural constraints  WF-Precheck : avoids unnecessary competition at issue ports  WF-Segment : optimizes selection logic for high clock speed

24 Future Work Routing complexity analysis in WF- Segment scheduler Power analysis for wakeup-free schedulers Sophisticated pre-scheduler

25

26 Wire Delay Challenges Increasing pipeline depth for high performance Clock period (FO4) decreases dramatically Cross-chip wire delay will be up to 10 cycles as technology shrinks M. S. Hrishikesh et al, ISCA29 Stephen W. Keckler et al, ISSCC’03

27 Precheck as A Single Stage

28 Load/Store Dependence Predictor