CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Simultaneous Multithreading
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
The Microarchitecture of the Pentium 4 processor
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
/ Computer Architecture and Design
Lecture: SMT, Cache Hierarchies
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 22: Multithreading
Conceptual execution on a processor which exploits ILP
Presentation transcript:

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez UPC-Barcelona IJPP ’01

Bottlenecks Recap from “Complexity-Effective Superscalars” Wakeup+Select and Bypass have the longest delays and represent atomic operations Pipelining will prevent back-to-back operations Increased issue width / window size / wire delays exacerbate the problem (also for the register file and cache)

Dependence-Based Microarchitecture r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r4  r3 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2r9  r1 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r3  r1 + r2 r7  r6 + r2 r6  r4 + r2r9  r1 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands r1  r2 

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r4  r3 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r3  r9 

Dependence-Based Microarchitecture r8  r5 + r2 r5  r4 + r2 r7  r6 + r2 r6  r4 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r4 

Dependence-Based Microarchitecture r8  r5 + r2r7  r6 + r2 r3  r1 + r2 r4  r3 + r2 r5  r4 + r2 r6  r4 + r2 r7  r6 + r2 r8  r5 + r2 r9  r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r5  r6 

Pros and Cons Wakeup and select over a subset of issue queue entries (only FIFO heads) Under-utilization as FIFOs do not get filled (causes about 5% IPC loss) – but it is not hard to increase their sizes You still need an operand-rdy table

Clustered Microarchitectures

Simplifies wakeup+select and bypassing Dependence-based, hence most communication is local Low porting requirements on register file, issue queue IPC loss of 6.3%, but a clock speed improvement

Clustered Microarchitectures Two primary motivations:  hard to design 8-way machines in future technologies  the FP cluster is idle most of the time Advantages:  Few entries, few ports  low delays  fast clocks, simple pipelines  Every instruction is not penalized for wire delays  Potential for large windows and high ILP  Design and verification costs do not scale up (?)

Dependences During rename, steer dependent instructions to the same cluster However, we do not know about converging chains (can have workarounds – traces/compilers) If the assigned cluster is full, do we stall or go elsewhere? – not clarified in the paper r1  r2 + r3 cl-1 r4  r1 + r2 cl-1 r5  r6 + r7 cl-2 r8  r5 + r1 ?

Load Imbalance All instructions in 1 cluster  zero communication, but zero utilization of other resources Six ready instructions in cl-1 and two in cl-2  more contention and wasted issue slots Ready instructions in each should be equal – however, instruction readiness happens long after instruction steering

Load Imbalance Metrics Metrics:  Instrs in each cluster  Unissued instrs that could have issued elsewhere (note latency between steer & issue) The second metric does not help much

Instruction Assignment Reg-rename & Instr steer IQ Regfile FF IQ Regfile FF r1  r2 + r3 r4  r1 + r2 r5  r6 + r7 r8  r1 + r5 p21  p2 + p3 p22  p21 + p2 p42  p21 p41  p56 + p57 p43  p42 + p41 40 regs in each cluster r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs

Assignment by the Compiler ISA modification Less accurate notion of load Depends on good branch prediction, memory dependence prediction, cache miss prediction, contention modeling, etc. Dynamic mechanisms can add pipeline stages

Steering Heuristics Simple Register Mapping Based Steering (Simple-RMBS): if communication cannot be avoided, pick a random cluster Balanced-RMBS: if communication cannot be avoided, pick the less-loaded cluster Advanced-RMBS: if significant imbalance, pick the less-loaded cluster, else use Balanced-RMBS Modulo-steering: assignment alternates between clusters

Results Modulo steering: too much communication Balanced and Simple RMBS do well (27 and 22% better than the base) – less than 3 comms per 100 instructions (a single bus is enough) – assuming zero comm-cost isolates effect of workload imbalance Advanced RMBS performs 35% better than base The max possible improvement (UB model) is 44%

Other Results Scheduling constraints limit improvements for FP programs The compiler can do better than what Fig.10 indicates Palacharla algorithm doesn’t do as well – no load considerations and few FIFOs  more communication

Optimizations Information on converging chains (slices) First-fit and Mod-N Identify critical source operands Interconnect-sensitive steering Stalls in dispatch

Future Trends Increased wire delays and more transistors   each cluster is smaller  more clusters  latency across clusters is higher Load imbalance and communication become worse – the best heuristic/threshold will depend on the assumed model/latency Data cache access time increases

Dynamic Cluster Allocation At some point, using more clusters can increase communication costs and worsen performance More clusters  larger windows/FUs  more ILP  more communication penalties Steering heuristic should take degree of ILP into account (ISCA ’03)

Other Recent Papers Hierarchical interconnect designs – Aggarwal and Franklin Distributed data caches – UPC Power-efficiency of clustered designs – Zyuban and Kogge TRIPS processor – UT-Austin (compiler mapping)

Important Problems L1D L2 FEFE FEFE FEFE FEFE Cluster allocation to threads Design of interconnects Latency tolerance Exploiting heterogeneity 3D design Power efficiency and temperature Branch fan-out

Next Week’s Paper “The Optimal Logic Depth per Pipeline Stage is 6 to 8 FO4 Inverter Delays”, UT-Austin/Compaq, ISCA’02

Title Bullet