Download presentation
Presentation is loading. Please wait.
Published byKalyn Faux Modified over 9 years ago
1
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez UPC-Barcelona IJPP ’01
2
Bottlenecks Recap from “Complexity-Effective Superscalars” Wakeup+Select and Bypass have the longest delays and represent atomic operations Pipelining will prevent back-to-back operations Increased issue width / window size / wire delays exacerbate the problem (also for the register file and cache)
3
Dependence-Based Microarchitecture r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
4
Dependence-Based Microarchitecture r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
5
Dependence-Based Microarchitecture r4 r3 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
6
Dependence-Based Microarchitecture r5 r4 + r2 r4 r3 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
7
Dependence-Based Microarchitecture r5 r4 + r2 r4 r3 + r2 r3 r1 + r2r6 r4 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
8
Dependence-Based Microarchitecture r5 r4 + r2 r4 r3 + r2 r3 r1 + r2 r7 r6 + r2 r6 r4 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
9
Dependence-Based Microarchitecture r8 r5 + r2 r5 r4 + r2 r4 r3 + r2 r3 r1 + r2 r7 r6 + r2 r6 r4 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
10
Dependence-Based Microarchitecture r8 r5 + r2 r5 r4 + r2 r4 r3 + r2 r3 r1 + r2 r7 r6 + r2 r6 r4 + r2r9 r1 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands
11
Dependence-Based Microarchitecture r8 r5 + r2 r5 r4 + r2 r4 r3 + r2 r3 r1 + r2 r7 r6 + r2 r6 r4 + r2r9 r1 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 0 … FIFOs Rdy Operands r1 r2
12
Dependence-Based Microarchitecture r8 r5 + r2 r5 r4 + r2 r4 r3 + r2 r7 r6 + r2 r6 r4 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r3 r9
13
Dependence-Based Microarchitecture r8 r5 + r2 r5 r4 + r2 r7 r6 + r2 r6 r4 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r4
14
Dependence-Based Microarchitecture r8 r5 + r2r7 r6 + r2 r3 r1 + r2 r4 r3 + r2 r5 r4 + r2 r6 r4 + r2 r7 r6 + r2 r8 r5 + r2 r9 r1 + r2 r1 1 r2 1 r3 1 … FIFOs Rdy Operands r5 r6
15
Pros and Cons Wakeup and select over a subset of issue queue entries (only FIFO heads) Under-utilization as FIFOs do not get filled (causes about 5% IPC loss) – but it is not hard to increase their sizes You still need an operand-rdy table
16
Clustered Microarchitectures
17
Simplifies wakeup+select and bypassing Dependence-based, hence most communication is local Low porting requirements on register file, issue queue IPC loss of 6.3%, but a clock speed improvement
18
Clustered Microarchitectures Two primary motivations: hard to design 8-way machines in future technologies the FP cluster is idle most of the time Advantages: Few entries, few ports low delays fast clocks, simple pipelines Every instruction is not penalized for wire delays Potential for large windows and high ILP Design and verification costs do not scale up (?)
19
Dependences During rename, steer dependent instructions to the same cluster However, we do not know about converging chains (can have workarounds – traces/compilers) If the assigned cluster is full, do we stall or go elsewhere? – not clarified in the paper r1 r2 + r3 cl-1 r4 r1 + r2 cl-1 r5 r6 + r7 cl-2 r8 r5 + r1 ?
20
Load Imbalance All instructions in 1 cluster zero communication, but zero utilization of other resources Six ready instructions in cl-1 and two in cl-2 more contention and wasted issue slots Ready instructions in each should be equal – however, instruction readiness happens long after instruction steering
21
Load Imbalance Metrics Metrics: Instrs in each cluster Unissued instrs that could have issued elsewhere (note latency between steer & issue) The second metric does not help much
22
Instruction Assignment Reg-rename & Instr steer IQ Regfile FF IQ Regfile FF r1 r2 + r3 r4 r1 + r2 r5 r6 + r7 r8 r1 + r5 p21 p2 + p3 p22 p21 + p2 p42 p21 p41 p56 + p57 p43 p42 + p41 40 regs in each cluster r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs
23
Assignment by the Compiler ISA modification Less accurate notion of load Depends on good branch prediction, memory dependence prediction, cache miss prediction, contention modeling, etc. Dynamic mechanisms can add pipeline stages
24
Steering Heuristics Simple Register Mapping Based Steering (Simple-RMBS): if communication cannot be avoided, pick a random cluster Balanced-RMBS: if communication cannot be avoided, pick the less-loaded cluster Advanced-RMBS: if significant imbalance, pick the less-loaded cluster, else use Balanced-RMBS Modulo-steering: assignment alternates between clusters
25
Results Modulo steering: too much communication Balanced and Simple RMBS do well (27 and 22% better than the base) – less than 3 comms per 100 instructions (a single bus is enough) – assuming zero comm-cost isolates effect of workload imbalance Advanced RMBS performs 35% better than base The max possible improvement (UB model) is 44%
26
Other Results Scheduling constraints limit improvements for FP programs The compiler can do better than what Fig.10 indicates Palacharla algorithm doesn’t do as well – no load considerations and few FIFOs more communication
27
Optimizations Information on converging chains (slices) First-fit and Mod-N Identify critical source operands Interconnect-sensitive steering Stalls in dispatch
28
Future Trends Increased wire delays and more transistors each cluster is smaller more clusters latency across clusters is higher Load imbalance and communication become worse – the best heuristic/threshold will depend on the assumed model/latency Data cache access time increases
29
Dynamic Cluster Allocation At some point, using more clusters can increase communication costs and worsen performance More clusters larger windows/FUs more ILP more communication penalties Steering heuristic should take degree of ILP into account (ISCA ’03)
30
Other Recent Papers Hierarchical interconnect designs – Aggarwal and Franklin Distributed data caches – UPC Power-efficiency of clustered designs – Zyuban and Kogge TRIPS processor – UT-Austin (compiler mapping)
31
Important Problems L1D L2 FEFE FEFE FEFE FEFE Cluster allocation to threads Design of interconnects Latency tolerance Exploiting heterogeneity 3D design Power efficiency and temperature Branch fan-out
32
Next Week’s Paper “The Optimal Logic Depth per Pipeline Stage is 6 to 8 FO4 Inverter Delays”, UT-Austin/Compaq, ISCA’02
33
Title Bullet
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.