Lecture 7: Register Renaming. 2 A: R1 = R2 + R3 B: R4 = R1 * R4 5 5 -2 9 9 3 3 R1 R2 R3 R4 Read-After-Write 7 7 -2 9 9 3 3 7 7 9 9 21 A A B B 5 5 -2 9.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

CS 6290 Instruction Level Parallelism. Instruction Level Parallelism (ILP) Basic idea: Execute several instructions in parallel We already do pipelining…
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Goal: Reduce the Penalty of Control Hazards
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
ECE 4100/6100 Advanced Computer Architecture Lecture 8 Dynamic Scheduling (II) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.
ECE 2162 Instruction Level Parallelism. 2 Instruction Level Parallelism (ILP) Basic idea: Execute several instructions in parallel We already do pipelining…
CMPE 421 Parallel Computer Architecture
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
CS203 – Advanced Computer Architecture ILP and Speculation.
Lecture: Out-of-order Processors
COSC3330 Computer Architecture
Smruti R. Sarangi IIT Delhi
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
CSE 502: Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Microprocessor Microarchitecture Dynamic Pipeline
Morgan Kaufmann Publishers The Processor
Sequential Execution Semantics
High-level view Out-of-order pipeline
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
ECE 2162 Reorder Buffer.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Super-scalar.
How to improve (decrease) CPI
Instruction Level Parallelism (ILP)
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Instruction-Level Parallelism (ILP)
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Lecture 7: Register Renaming

2 A: R1 = R2 + R3 B: R4 = R1 * R R1 R2 R3 R4 Read-After-Write A A B B R1 R2 R3 R B B A A A: R1 = R3 / R4 B: R3 = R2 * R4 Write-After-Read R1 R2 R3 R A A B B R1 R2 R3 R A A B B Write-After-Write A: R1 = R2 + R3 B: R1 = R3 * R R1 R2 R3 R A A B B R1 R2 R3 R A A B B

Register Data Dependencies (this lecture) –Output dependence (WAW), also  o –Anti-dependence (WAR),  a –True dependence (RAW),  t –Why is RAR not a dependency? Memory Data Dependencies (later lecture) Control Dependencies (earlier lectures) Structural Dependencies –Instruction must wait until some “structure” is available Ex: Divider, ROB entry, Branch color/tag, etc. Lecture 7: Register Renaming 3

WAR dependencies are from reusing registers Lecture 7: Register Renaming 4 A: R1 = R3 / R4 B: R3 = R2 * R R1 R2 R3 R A A B B R1 R2 R3 R B B A A R1 R2 R3 R B B A A 4 4 R5 -6 A: R1 = R3 / R4 B: R5 = R2 * R4 X With no dependencies, reordering still produces the correct results With no dependencies, reordering still produces the correct results

WAW dependencies are also from reusing registers Lecture 7: Register Renaming R1 R2 R3 R B B A A 4 4 R A: R1 = R2 + R3 B: R1 = R3 * R R1 R2 R3 R A A B B R1 R2 R3 R A A B B A: R5 = R2 + R3 B: R1 = R3 * R4 X Same solution works

Finite number of registers –At some point, you’re forced to overwrite somewhere –Most RISC: 32 registers, x86: only 8, x86-64: 16 Loops, Code Reuse –If you write a value to R1 in a loop body, then R1 will be reused every iteration  induces many false dep’s –Loop unrolling can help a little Will run out of registers at some point anyway Trade off with code bloat –Short function calls can result in similar register reuse Inlining can help a little Lecture 7: Register Renaming 6

Add more registers to the ISA? –Changing the ISA can break binary compatibility x86-64 mostly doesn’t break compatibility, but it’s a hack –All code must be recompiled –Does not address register overwriting due to code reuse from loops and function calls –Not a scalable solution Lecture 7: Register Renaming 7 BAD!!!

Processor has more registers than specified by the ISA  temporarily map ISA registers (“logical” or “architected” registers) to the physical registers to avoid overwrites Components: –mapping mechanism –physical registers allocated vs. free registers allocation/deallocation mechanism –state maintenance (commit, mispredictions, etc.) Lecture 7: Register Renaming 8

9 R0 Architected Registers R1 R2 R3 R4 R5 R6 R7 T0 T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 T22 Tn-2 T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21 T23 Tn-1 Physical Registers R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] Original Code Renamed Code T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] WAW WAR No False Dependencies!

Lecture 7: Register Renaming 10 Dest = Src1 op Src2 MappingMechanismMappingMechanism Tag S1 op Tag S2 Src1  Tag S1 Src2  Tag S2 Unmapped Physical Registers Unmapped Physical Registers Tag D Tag D = Dest  Tag D Repeat for each instruction

Lookup Table –One entry per architected register –Entry stores physical location of most recent version of the logical register –Most recent version may be in the physical register file or in the architected register file Lecture 7: Register Renaming 11 ARF PRF RAT

Lecture 7: Register Renaming 12 R1 = R2 + R3 R0 - - R1 - - R2 - - R3 - - R4 - - R5 - - R6 - - R7 - - T13, T14, T9, T7 Free PRegs T13 = R2 + R T14, T9, T7 R5 = R4 – R1 T14 = R4 + T R1 = R1 * R5 T9, T7 T9 = T13 * T R2 = R5 / R1 T7 T7 = T14 / T

Lecture 7: Register Renaming 13 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T16T23 T39T7 T14T16 T5 X Don’t rename immediates T10 T31 T19 T6 From free register pool For N-wide superscalar: 2N RAT read-ports N RAT write-ports For N-wide superscalar: 2N RAT read-ports N RAT write-ports

Lecture 7: Register Renaming 14 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R1 R5 = Ld 12[R6] RAT T16T23 T39T7 T14T16 T5 X T10 T31 T19 T6 From free register pool This is the wrong version of R1 Should be using this version of R1

Lecture 7: Register Renaming 15 R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 RAT T16T34 T34T16 T16T34 T10T16 T31T10 T31T19 Result of sequential renaming T10 T31 T19 T6 From free register pool

Lecture 7: Register Renaming 16 From free register pool Intra-Group Dependency Checker Intra-Group Dependency Checker Inst 0 Inst 1 Inst 2 Inst 3 Src L Src R Dest T 0L T 1L T 2L T 3L T 0R T 1R T 2R T 3R Not needed since 1 st inst in a group has no earlier insts to be dependent on Not needed since 1 st inst in a group has no earlier insts to be dependent on Similarly, src 1L and src 1R cannot be dependent on dst 1, dst 2 or dst 3 Similarly, src 1L and src 1R cannot be dependent on dst 1, dst 2 or dst 3 RAT

Lecture 7: Register Renaming 17 dst 0 dst 1 dst 2 dst 3 src 0L src 0R src 1L = = R 1L T 1L src 1R = = T 1R R 1R src 2L = = T 2L R 2L = = src 2R = = T 2R R 2R = = src 3L = = T 3L = = R 3L = = = = T 3R = = = = R 3R src 3R N-wide rename has O(N) gate delay? N-wide rename has O(N) gate delay? 0 1 Total number of comparisons: 2  ( n  (n-1) ) / 2 = n 2 –n = O(n 2 ) Total number of comparisons: 2  ( n  (n-1) ) / 2 = n 2 –n = O(n 2 )

Lecture 7: Register Renaming 18 = = T 7R R 7R src 7R dst 0 dst 6 dst 7 = = = = = = = = = = = = Gate delay reduced down to O(log 2 N) Gate delay reduced down to O(log 2 N)

Lecture 7: Register Renaming 19 R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Only this mapping for R1 should be written into the RAT dst 0 dst 1 dst 2 dst 3 != use dst 1 != use dst 0 != use dst 2 use dst 3 1 Condition: use mapping if instruction is last writer to the register

Lecture 7: Register Renaming 20 ARF R3 RAT R3 PRF T42 Architected register file contains the committed/non-speculative processor state When an instruction commits, it updates the ARF with the new value The ARF now contains the correct value; update the RAT T42 is no longer needed, return to the physical register free pool Free Pool

Lecture 7: Register Renaming 21 ARF R3 RAT R3 PRF Free Pool T42 T17 Update ARF as usual Deallocate physical register Don’t touch the RAT! (Someone else is the most recent writer to R3) At some point in the future, the newer writer of R3 commits Deallocate physical register This instruction was the most recent writer, now update the RAT

Unified with the ROB Lecture 7: Register Renaming 22 inst data inst data inst data inst data inst data inst data inst data inst data inst data inst data ROBPRF ROB_head ROB_tail Instructions in program order oldest

Free registers = all entries from ROB_tail to ROB_head – 1 Instructions allocated into ROB in-order, so physical registers also allocated in same order –dst i = T [ROB_head] –dst i+1 = T [ (ROB_head +1) % ROB_size ] –dst i+2 = T [ (ROB_head +2) % ROB_size ] –… –dst i+N-1 = T [ (ROB_head +N-1) % ROB_size ] Lecture 7: Register Renaming 23

No need to explicitly manage free pool –just increment ROB_tail as physical registers are allocated, increment ROB_head as registers are deallocated Inefficiency: allocate registers to all instructions –Branches, stores (and some other insts) don’t need physical registers Asymmetric datapath – sometimes read values from ARF, sometimes from the PRF –requires both structures to be heavily ported Lecture 7: Register Renaming 24

Combine both ARF and PRF into a single register file –Before, ARF and PRF could be the same hardware structure, but they have distinct name spaces e.g., ARF (R0-R7) mapped to T0-T7 and PRF mapped to T8-T99 –For a unified RF, the committed R0 could be mapped anywhere (T0-T99) Need some way to track the “committed” state Lecture 7: Register Renaming 25

Lecture 7: Register Renaming 26 R0 Speculative RAT R1 R2 R3 R4 R5 R6 R7 R0 Committed RAT R1 R2 R3 R4 R5 R6 R7 The committed RAT along with the pointed at registers implement the logical equivalent of the ARF The speculative RAT tracks the locations of the most recent version of each architected register Both RATs may point to the same physical location (R0, R5): the most recent writer has also committed

Lecture 7: Register Renaming 27 T0 T1 T2 T3 T4 T5 T6 T7 T8 R0 Speculative RAT R1 R2 R3 R4 R5 R6 R7 R0 Committed RAT R1 R2 R3 R4 R5 R6 R7 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 Register File T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 Free Pool A: R1 = R2 + R4 T8 = T2 + T4 ROB A A B: R4 = R2 – R7 T9 = T2 + T7 B B C: R2 = R1 * R4 T10 = T8 * T9 C C D: R1 = R1 + #1 T11 = T8 + #1 D D T1 T4 E: R7 = R4 / R1 T1 = T9 + T11 E E

Previous example showed a stack data structure (LIFO) Lecture 7: Register Renaming 28 T9 T1 T34 T25 T23 T17 T8 To 4-wide Rename T9 T1 T34 T25 T23 T17 T8 T28 T13 From commit To 4-wide Rename TOS 3 regs allocated 3 regs allocated Stack HW is complex due to need to simultaneously read and write the top-of-stack Stack HW is complex due to need to simultaneously read and write the top-of-stack

A queue structure (FIFO) is easier to implement –independent reading/writing of head and tail Lecture 7: Register Renaming 29 T9 T1 T34 T25 T23 T17 T8 Pool HeadPool Tail 3 regs allocated 2 regs deallocated T13 T28 Corner case still exists when pool is empty –Either stall rename for one cycle or need more complex HW to bypass dealloc’d registers to the renamer

Lecture 7: Register Renaming 30 br ARF RAT ARF state corresponds to state prior to oldest non-committed instruction As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction On a branch misprediction, wrong-path instructions are flushed from the machine ?!? The RAT is left with an invalid set of mappings corresponding to the wrong- path instruction state

Lecture 7: Register Renaming 31 br ARF RAT ?!? Correct path instructions from fetch; can’t rename because RAT is wrong foo X ARF now corresponds to the state right before the next instruction to be renamed (foo) Allow all instructions to execute and commit; ARF corresponds to last committed instruction Reset RAT so that all mappings refer to the ARF Resume renaming the new correct- path instructions from fetch Pros: Very simple to implement Cons: Performance loss due to stalls Pros: Very simple to implement Cons: Performance loss due to stalls

Lecture 7: Register Renaming 32 br ARF RAT At each branch, make a copy of the RAT (register mapping at the time of the branch) RAT On a misprediction: Checkpoint Free Pool 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint foo 4. resume renaming

No need to stall front-end (?) –need to “flash copy” RAT both for making checkpoints and recovering –need some way to “hunt down” wrong-path checkpoints for deallocation can “walk” the ROB, but this may take more than one cycle which may introduce stalls; still faster than stall-and-drain More hardware –need one checkpoint per branch –what if the code has nothing but branches? worst case needs one checkpoint per ROB entry can assign one checkpoint per branch color –stall front-end when out of branch colors/checkpoints Lecture 7: Register Renaming 33

Each register-writing ROB entry tracks two physical registers 1.Its allocated destination register 2.The previous physical register mapping for it architected register Example –R1 mapped to T23 –Rename new instruction X, which overwrites R1 R1 now mapped to T19 X also records the value of an “undo mapping” of T23 –Recovery: walk ROB backwards applying the undo mappings Lower overhead: don’t need full copies of the RAT Slower?: need to walk the ROB Flexibility: can recover to any instruction; not just branches Lecture 7: Register Renaming 34

For ROB-based PRF, deallocation is simple: –ROB_tail reset to point right after the mispredicted branch For unified RF, allocated registers may be anywhere in the register file Lecture 7: Register Renaming 35 br st br PReg Free Pool Committed RAT Some sort of ROB walk still required to deallocate the wrong-path PRegs; do at same time with checkpoint deallocation Some sort of ROB walk still required to deallocate the wrong-path PRegs; do at same time with checkpoint deallocation

Lecture 7: Register Renaming 36 RAT Highly ported SRAM RAT Highly ported SRAM 3N ports: 2N read, 1N write 1 entry per architected register: includes int, FP, MMX/SSE, lo/hi (MIPS), control registers, FP status, predicate registers (IA64), flags (x86), etc. Each entry is  log 2 |PRF|  bits wide, plus 1 valid bit when RF not unified (!valid  register is in the ARF) Typical N=3,4 |ARF| = |PRF| = 100± Only bytes, but 9-12 ports SRAM latency typically quadratic w.r.t. #ports Dep Check Logic Almost full pairwise dependency checks: O(N 2 ) comparisons Dep Check Logic Almost full pairwise dependency checks: O(N 2 ) comparisons

SRAM lookup easily pipelined Dependency check is just combinatorial logic; easily pipelined Lecture 7: Register Renaming 37 REN1REN2 ABCD renamed ABCD What if there’s a dependency between groups? EFGH EFGH ABCD ABCD ABCD haven’t updated the RAT when EFGH reads the RAT ABCDABCD

Similar to intra-group dependency checking, now must perform inter-group dependency checking Lecture 7: Register Renaming 38 REN1REN2 ABCD ABCD EFGH ABCD ABCD EFGH EFGH Register mappings if no dependencies Overrides if dependency exists between ABCD Overrides if dependency exists between ABCD and EFGH ABCD EFGH

Lecture 7: Register Renaming 39 Original renaming Overhead due to pipelined rename 1ns/cycle, 1GHz 0.5ns/cycle, 2GHz Original renaming 0.32ns/cycle, 3.14GHz

More stages –higher branch mispredict penalty –a lot more implementation complexity dep check with previous group, prev-prev group, etc. pipeline control logic, latching overhead more circuits (  area,  power), more design effort Higher frequency –more performance if pipeline not overly exposed need sufficiently high branch prediction accuracy power goes up even more (P=½CV 2 f  ) –This is on top of the extra power for the extra circuits –Extra logic effectively increases the C term Lecture 7: Register Renaming 40

How big should the physical register file be? –ROB-based: PRF entries == ROB entries –Unified: ??? Should have one register per instruction –How to count instructions? –Every instruction from rename to retire instructions in fetch/decode stages haven’t been renamed, and therefore don’t need physical registers Not every instruction needs a register (branches, stores) How many instructions does this add up to? –N × Stages(Rename to Dispatch) + ROB_size –Less those expected to not need destinations Lecture 7: Register Renaming 41

Lecture 7: Register Renaming 42 IF ID REN Disp RS ROB Commit 1. No register allocated 2. Register allocated, but contents are bogus 3. Register contains valid data 4. Overwriter commits; register has stale value; deallocate This is the only time a physical storage location is really needed This is the only time a physical storage location is really needed Actually, only needed until last consumer reads the value Actually, only needed until last consumer reads the value PRF needs to be large enough for all instructions in Region 2, but none of the registers will contain anything useful! PRF needs to be large enough for all instructions in Region 2, but none of the registers will contain anything useful!