Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Krste Asanovic Electrical Engineering and Computer Sciences
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Electrical and Computer Engineering
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
2/28/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.
March 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering.
CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
March 4, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
CS 252 Graduate Computer Architecture Lecture 5: Instruction-Level Parallelism (Part 2) Krste Asanovic Electrical Engineering and Computer Sciences University.
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Dynamic Branch Prediction
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
March 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming (con’t) Prediction (Branches, Return Addrs) February 17 th, 2010 John Kubiatowicz Electrical.
Arvind and Joel Emer Computer Science and Artificial Intelligence Laboratory M.I.T. Branch Prediction.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Branch Prediction and Load-Store Queues.
February 2, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II (Branches, Exceptions) Krste Asanovic Electrical.
CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming (con’t) Prediction (Branches, Return Addrs) John Kubiatowicz Electrical Engineering and.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Krste Asanovic Electrical Engineering and Computer Sciences
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
© Krste Asanovic, 2015CS252, Fall 2015, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Advanced Out-of-Order Superscalar Designs.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #12 (10/27/15) Course.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #13 (10/28/15) Course.
March 1, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction John Wawrzynek Electrical Engineering.
Computer Architecture
PowerPC 604 Superscalar Microprocessor
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
Part IV Data Path and Control
Dr. George Michelogiannakis EECS, University of California at Berkeley
Chapter 4 The Processor Part 4
Part IV Data Path and Control
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS252 Graduate Computer Architecture Lecture 8 Prediction/Speculation (Branches, Return Addrs) February 14th, 2011 John Kubiatowicz Electrical Engineering.
Electrical and Computer Engineering
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste Asanovic Electrical Engineering.
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Branch Prediction: Direction Predictors
/ Computer Architecture and Design
Branch Prediction: Direction Predictors
Adapted from the slides of Prof
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 – Branch Prediction and Advanced Out-of-Order Superscalars.
Presentation transcript:

Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring

Average Run-Length between Branches Asanovic/Devadas Spring Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult 13 % load 26 % 23 % store 9 % 9 % branch 16 % 8 % other 10 % 12 % SPECint92: compress, eqntott, espresso, gcc, li SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor What is the average run length between branches?

Reducing Control Transfer Penalties Asanovic/Devadas Spring Software solution loop unrolling Increases the run length instruction scheduling Compute the branch condition as early as possible (limited) Hardware solution delay slots replaces pipeline bubbles with useful work (requires software cooperation) branch prediction & speculative execution of instructions beyond the branch

Effect of Control Transfer on Pipelined Execution Asanovic/Devadas Spring Control transfer instructions require insertion of bubbles in the pipeline. The number of bubbles depends upon the number of cycles it takes to determine the next instruction address, and to fetch the next instruction

DLX Branches and Jumps Asanovic/Devadas Spring Must know (or guess) both target address and whether taken to execute branch/jump. Instruction Taken known? Target known? BEQZ/BNEZ After Reg. Fetch After Inst. Fetch J Always Taken After Inst. Fetch JR Always Taken After Reg. Fetch

Branch penalty Asanovic/Devadas Spring PC I-cache Fetch Buffer Issue Buffer Func. Units Result Buffer Arch. State Decode Fetch Execute Commit Branch executed Next fetch started

Branch Penalties in Modern Pipelines Asanovic/Devadas Spring UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000) APFBIJREAPFBIJRE PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+another 6 stages) Branch Target Address Known Branch Direction & Jump Register Target Known

Branch Prediction Asanovic/Devadas Spring Motivation: branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: In-order machines: kill instructions following branch in pipeline Out-of-order machines: shadow registers and memory buffers for each speculated branch

Static Branch Prediction Asanovic/Devadas Spring (Encode prediction as part of branch instruction) Probability a branch is taken (~60-70% overall): Can predict all taken, or backwards taken/forward not-taken (use offset sign bit) ISA can attach additional semantics to branches about preferred direction, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction (HP PA-RISC, Intel IA-64) backward 90% forward 50%

Dynamic Branch Prediction Asanovic/Devadas Spring learning based on past behavior Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution)

Branch Prediction Bits Asanovic/Devadas Spring Assume 2 BP bits per instruction Change the prediction after two consecutive mistakes! BP state: (predict take/¬take) x (last prediction right/wrong) ¬take wrong ¬take right take right take wring ¬taken taken ¬taken taken

Branch History Table Asanovic/Devadas Spring K-entry BHT, 2 bits/entry, ~80-90% correct predictions I-Cache Opcodeoffset Fetch PC Instruction Branch?Target PC ?taken/¬taken ? BHT Index taken/¬taken ? 2k-entry BHT, 2 bits/entry k

Exploiting Spatial Correlation Asanovic/Devadas Spring Yeh and Patt, 1992 if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition false, second condition also false History bit: H records the direction of the last branch executed by the processor Two sets of BHT bits (BHT0 & BHT1) per branch Instruction H = 0 (not taken) ⇒ consult BHT0 H = 1 (taken) ⇒ consult BHT1

Two-Level Branch Predictor Asanovic/Devadas Spring Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 2-bit global branch history shift register Shift in Taken/¬Taken results of each branch Taken/¬Taken? k Fetch PC k

Limitations of BHTs Asanovic/Devadas Spring Cannot redirect fetch stream until after branch instruction is fetched and decoded, and target address determined (UltraSPARC-III fetch pipeline) APFBIJREAPFBIJRE PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+another 6 stages) Correctly predicted taken branch penalty Jump Register penalty

Branch Target Buffer (BTB) Asanovic/Devadas Spring k-entry direct-mapped BTB I-Cache PC (can also be associative) Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jumps held in BTB Next PC determined before branch fetched and decoded ValidEntry PCpredicted target PC matchvalidtarget k

Combining BTB and BHT Asanovic/Devadas Spring BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) BHT can hold many more entries and is more accurate BTB/BHT only updated after branch resolves in E stage A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB BHT BHT in later pipeline stage corrects when BTB misses a predicted taken branch

Uses of Jump Register (JR) Asanovic/Devadas Spring Switch statements (jump to address of matching case) Dynamic function call (jump to run-time function address) Subroutine returns (jump to return address) How well does BTB work for each of these cases?

BTB Performance Asanovic/Devadas Spring Switch statements (jump to address of matching case) BTB works well if same case used repeatedly Dynamic function call (jump to run-time function address) BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call) Subroutine returns (jump to return address) BTB works well if usually return to the same place ⇒ Often one function called from many different call sites!

Subroutine Return Stack Asanovic/Devadas Spring Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. fa() { fb(); } fb() { fc(); } fc() { fd(); } Push call address when function call executed Pop return address when subroutine return decoded k entries (typically k=8-16) &fd() &fc() &fb()

Speculating Both Directions Asanovic/Devadas Spring An alternative to branch prediction is to execute both directions of a branch speculatively resource requirement is proportional to the number of concurrent speculative executions only half the resources engage in useful work when both directions of a branch are executed speculatively branch prediction takes less resources than speculative execution of both paths With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction

Mispredict Recovery Asanovic/Devadas Spring In-order execution machines: – Assume no instruction issued after branch can write-back before branch resolves – Kill all instructions in pipeline behind mispredicted branch Out-of-order execution? – Multiple instructions following branch in program order can complete before branch resolves

In-Order Commit for Precise Exceptions Asanovic/Devadas Spring Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order ( ⇒ out-of-order completion) Commit (write-back to architectural state, regfile+memory) is in-order Temporary storage in ROB holds results before commit FetchDecodeReorder Buffer Commit Kill Inject handler PC Execute Exception? In-order Out-of-order Exception !

ROB for Precise Exceptions Asanovic/Devadas Spring add fields in the instruction template commit instructions to reg file and memory in program order ⇒ buffers can be maintained circularly on exception, clear reorder buffer by resetting ptr1=ptr2 (stores must wait for commit before updating memory) ptr2 next to commit ptr1 next available Inst# use exec op p1 src1 p2 src2 pd dest data cause

Branch Misprediction Recovery Asanovic/Devadas Spring On mispredict Roll back “next available” pointer to just after branch Reset use bits Flush mis-speculated instructions from pipelines Restart fetch on correct branch path Inst# use exec op p1 src1 p2 src2 pd dest data cause ptr2 next to commit rollback next available ptr1 next available

Branch Misprediction in Pipeline Asanovic/Devadas Spring Can have multiple unresolved branches in ROB Can resolve branches out-of-order Fetch DecodeReorder Buffer Commit Kill restart fetch on correct path Execute Branch Prediction PC Branch Prediction Complete

Killing Speculative Instructions Asanovic/Devadas Spring Each instruction tagged with single “speculative” bit, carried throughout all pipelines Decode stage stalls if second branch encountered before first speculative branch resolves When speculative branch resolves: – Prediction incorrect, kill all instructions tagged as speculative – Prediction correct, clear speculative bit on all instructions To allow speculation past multiple branches, add multiple bits per instruction indicating on which outstanding branches it is speculative – speculation bits reclaimed when corresponding branch resolves and is oldest speculative branch in machine (manage allocation of speculation tag bits circularly)

Recovering Renaming Table Asanovic/Devadas Spring Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted Rename Table Reorder buffer Rename Snapshots Register File FU Load Unit store Unit Commit