Electrical and Computer Engineering

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Krste Asanovic Electrical Engineering and Computer Sciences
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
2/28/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 8 Instruction-Level Parallelism – Part 1 Benjamin Lee Electrical and Computer Engineering Duke.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.
March 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP I Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
March 4, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
CS 252 Graduate Computer Architecture Lecture 5: Instruction-Level Parallelism (Part 2) Krste Asanovic Electrical Engineering and Computer Sciences University.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
March 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
Arvind and Joel Emer Computer Science and Artificial Intelligence Laboratory M.I.T. Branch Prediction.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Branch Prediction and Load-Store Queues.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism – Part 2 Benjamin Lee Electrical and Computer Engineering Duke.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
CS 152 Computer Architecture and Engineering Lecture 15 - Out-of-Order Memory, Complex Superscalars Review Krste Asanovic Electrical Engineering and Computer.
© Krste Asanovic, 2015CS252, Fall 2015, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Advanced Out-of-Order Superscalar Designs.
Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
March 1, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture: Out-of-order Processors
CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction John Wawrzynek Electrical Engineering.
/ Computer Architecture and Design
Smruti R. Sarangi IIT Delhi
PowerPC 604 Superscalar Microprocessor
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
Dr. George Michelogiannakis EECS, University of California at Berkeley
Lecture: Out-of-order Processors
Lecture 6: Advanced Pipelines
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Electrical and Computer Engineering
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste Asanovic Electrical Engineering.
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Adapted from the slides of Prof
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 11 – Out-of-Order Execution Krste Asanovic Electrical Engineering.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 – Branch Prediction and Advanced Out-of-Order Superscalars.
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

Electrical and Computer Engineering ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 10 Instruction-Level Parallelism – Part 3 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

ECE552 Administrivia 27 September – Homework #2 Due Assignment on web page. Teams of 2-3. Submit soft copies to Sakai. Use Piazza for questions 2 October – Class Discussion Roughly one reading per class. Do not wait until the day before! Srinivasan et al. “Optimizing pipelines for power and performance” Mahlke et al. “A comparison of full and partial predicated execution support for ILP processors” Palacharla et al. “Complexity-effective superscalar processors” Yeh et al. “Two-level adaptive training branch prediction” ECE 552 / CPS 550

ECE552 Administrivia 4 October – Midterm Exam 75 minutes, in-class Closed book, closed notes exam Performance metrics – performance, power, yield Technology – trends that changed architectural design History – Instruction sets (accumulator, stack, index, general-purpose) CISC – microprogramming, writing microprogram fragments Pipelining – Performance, hazards and ways to resolve them Instruction-level Parallelism – mechanisms to dynamically detect data dependences and to manage instruction flow (Scoreboard, Tomasulo, Physical Register File) Speculative Execution – exception handling, branch prediction Readings – High-level questions, not details ECE 552 / CPS 550

Tomasulo’s Implementation Load Unit FU Store < t, result > Ins# use exec op p1 src1 p2 src2 t1 t2 . tn Renaming Table & Register File Reorder Buffer Decode stage allocates instruction template (i.e., tag t) and stores tag in register file. When instruction completes, tag is de-allocated. ECE 552 / CPS 550

Tomasulo’s Structures Reorder Buffer (ROB) -- buffers in-flight instructions in program order -- supports in-order commit, precise exceptions -- e.g., instruction#, use, exec, op Reservation Stations -- tracks renamed source operands -- if operands ready, contains value (e.g., v1) -- if operands pending, contains tag (e.g., t1) -- may be combined with ROB (e.g., our example) -- may be distributed across functional units Renaming Table -- if write committed, points to register file (e.g., F1) -- if write pending, points to ROB entry (e.g., t1) -- Rename registers (e.g., F1) with ROB tags (e.g., t1) Register File -- contains architected state, committed values Common Data Bus (CDB) -- functional units broadcast computed values -- broadcast includes <tag, result> ECE 552 / CPS 550

Tomasulo’s Pipeline Fetch Dispatch -- Decode instruction -- Stall if structural hazard in ROB -- Allocate ROB entry and rename using ROB tags -- Read source operands when they are ready Execute -- Issue instruction when all operands ready -- Instructions may issue out-of-order Complete -- Stall if structural hazard on the common data bus -- Broadcast tag and completed result -- Mark ROB entry as complete -- Instructions may complete out-of-order Retire -- Stall if oldest instruction (head of ROB) not complete -- Handle any interrupts -- Write-back value for oldest instruction to register file or mem -- Free ROB entry -- Instructions retire in-order ECE 552 / CPS 550

Physical Register File Tomasulo Performance Limitations -- Too much data movement on common data bus -- Multi-input multiplexors, long buses impact clock frequency Alternative Approach to Register Renaming -- Eliminate architectural register file (e.g., R0-R31, F1-F8) -- Add larger physical register file, which holds all values (e.g., P0-Pn, n>>32) -- Modify rename table to map architected registers to physical registers -- Add free list to manage unallocated physical registers -- Reorder buffer tracks ready operands, supports in-order retire, supports free list management (but does not rename) ECE 552 / CPS 550

Lifetime of Physical Registers -- Architected registers are those defined by the instruction set architecture -- Register renaming can be implemented in two ways -- Rename with buffer tags -- insert speculatively computed values into ROB -- Rename with physical registers – hold both committed and speculative values With Architected Registers With Physical Registers 1. ld R1, (R3) ld P1, (Px) 2. add R3, R1, 4 add P2, P1, 4 3. sub R6, R7, R9 sub P3, Py, Pz 4. add R3, R3, R6 add P4, P2, P3 5. ld R6, (R1) ld P5, (P1) 6. add R6, R6, R3 add P6, P5, P4 7. st R6, (R1) st P6, (P1) 8. ld R6, (R11) ld P7, (Pw) -- Every instruction’s destination register R renamed to physical register P -- When do we reuse physical register? When next write of same architected register commits. Example: Reuse P2 when instruction 4 commits ECE 552 / CPS 550

Physical Register Management Rename Table -- Maps architected registers (e.g., R*) to physical registers (e.g., P*) -- Rename table identifies physical register P* that contains the value of R* -- Example: MIPS has 32 architected registers so rename table has 32 entries. -- Microarchitecture might have N >> 32 physical registers. Physical Registers -- Contain N>>32 registers that can hold committed data. -- Committed data is present when “p” flag is set. Free List -- List of physical registers available for renaming. -- Stall pipeline if there are insufficient physical registers Reorder Buffer -- Issue logic checks buffer to determine if operand values present -- Tracks sequence of renamings for the same architected register ECE 552 / CPS 550

Physical Register Management -- After the fetch stage, instruction enters decode stage. -- Decode stage (1) extracts architected registers, (2) renames to physical registers, and (3) inserts instruction into reorder buffer. Every instruction’s destination register is renamed! Eliminates WAW/WAR hazards. Renaming for instruction “op Rd, R1, R2” requires the following steps: Lookup source registers (R1,R2) in rename table. Insert corresponding physical register (P<y>,P<z>) into ROB. ii. If values for P<y>, P<z> are present, set “p” flag in ROB. iii. Lookup destination register (Rd) in rename table. Suppose Rd already renamed to P<w>. Because we rename Rd in step iii, this is the last instruction for which P<w> is valid. Denote P<w> as last physical register (LPRd) in ROB. Required for managing free list. iv. Rename Rd to P<x>, which is next available register from free list. Denote as current physical register (PRd) in ROB. Issue logic sends instruction to execution units when both source registers present ECE 552 / CPS 550

Renaming R1 to P0 Rename Table Physical Regs Free List 1. ld R1, 0(R3) Pn P1 P2 P3 P4 Physical Regs p <R1> P8 Free List P0 P1 P3 P2 P4 1. ld R1, 0(R3) 2. add R3, R1, 4 3. sub R6, R7, R6 4. add R3, R3, R6 5. ld R6, 0(R1) P0 op p1 PR1 p2 PR2 ex use Rd PRd LPRd ROB PR1/2: src physical regs p1/2: set when physical reg values are present Rd: dest architected reg LPRd: last dest physical reg PRd: new dest physical reg X ld p P7 R1 P0 P8 ECE 552 / CPS 550

Renaming R3 to P1 Rename Table Physical Regs Free List 1. ld R1, 0(R3) Pn P1 P2 P3 P4 Physical Regs p <R1> P8 Free List P0 P1 P3 P2 P4 1. ld R1, 0(R3) 2. add R3, R1, 4 3. sub R6, R7, R6 4. add R3, R3, R6 5. ld R6, 0(R1) P0 P1 op p1 PR1 p2 PR2 ex use Rd PRd LPRd ROB PR1/2: src physical regs p1/2: set when physical reg values are present Rd: dest architected reg LPRd: last dest physical reg PRd: new dest physical reg x ld p P7 R1 P0 P8 x add P0 R3 P1 P7 ECE 552 / CPS 550

Renaming R6 to P3 Rename Table Physical Regs Free List 1. ld R1, 0(R3) Pn P1 P2 P3 P4 Physical Regs p <R1> P8 Free List P0 P1 P3 P2 P4 1. ld R1, 0(R3) 2. add R3, R1, 4 3. sub R6, R7, R6 4. add R3, R3, R6 5. ld R6, 0(R1) P0 P1 P3 op p1 PR1 p2 PR2 ex use Rd PRd LPRd ROB PR1/2: src physical regs p1/2: set when physical reg values are present Rd: dest architected reg LPRd: last dest physical reg PRd: new dest physical reg x ld p P7 R1 P0 P8 x add P0 R3 P1 P7 x sub p P6 p P5 R6 P3 P5 ECE 552 / CPS 550

Renaming R3 to P2 Rename Table Physical Regs Free List 1. ld R1, 0(R3) Pn P1 P2 P3 P4 Physical Regs p <R1> P8 Free List P0 P1 P3 P2 P4 1. ld R1, 0(R3) 2. add R3, R1, 4 3. sub R6, R7, R6 4. add R3, R3, R6 5. ld R6, 0(R1) P0 P1 P2 P3 op p1 PR1 p2 PR2 ex use Rd PRd LPRd ROB PR1/2: src physical regs p1/2: set when physical reg values are present Rd: dest architected reg LPRd: last dest physical reg PRd: new dest physical reg x ld p P7 R1 P0 P8 x add P0 R3 P1 P7 x sub p P6 p P5 R6 P3 P5 x add P1 P3 R3 P2 P1 ECE 552 / CPS 550

Renaming R6 to P4 Rename Table Physical Regs Free List 1. ld R1, 0(R3) Pn P1 P2 P3 P4 Physical Regs p <R1> P8 Free List P0 P1 P3 P2 P4 1. ld R1, 0(R3) 2. add R3, R1, 4 3. sub R6, R7, R6 4. add R3, R3, R6 5. ld R6, 0(R1) P0 P1 P2 P3 P4 op p1 PR1 p2 PR2 ex use Rd PRd LPRd ROB PR1/2: src physical regs p1/2: set when physical reg values are present Rd: dest architected reg LPRd: last dest physical reg PRd: new dest physical reg x ld p P7 R1 P0 P8 x add P0 R3 P1 P7 x sub p P6 p P5 R6 P3 P5 x add P1 P3 R3 P2 P1 x ld P0 R6 P4 P3 ECE 552 / CPS 550

Physical Register Management Rename Table <R6> P5 <R7> P6 <R3> P7 P0 Pn P1 P2 P3 P4 Physical Regs p <R1> P8 Free List P0 P1 P3 P2 P4 p <R1> 1. ld R1, 0(R3) 2. add R3, R1, 4 3. sub R6, R7, R6 4. add R3, R3, R6 5. ld R6, 0(R1) P0 P1 P2 P8 P3 P4 op p1 PR1 p2 PR2 ex use Rd PRd LPRd ROB Execute & Commit x ld p P7 R1 P0 x ld p P7 R1 P0 x P8 x add P0 R3 P1 P7 x sub p P6 p P5 R6 P3 P5 x add P1 P3 R3 P2 P1 x ld P0 R6 P4 P3 ECE 552 / CPS 550

Physical Register Management Rename Table <R6> P5 <R7> P6 <R3> P7 P0 Pn P1 P2 P3 P4 Physical Regs p P8 Free List P0 P1 P3 P2 P4 p <R1> p <R3> 1. ld R1, 0(R3) 2. add R3, R1, 4 3. sub R6, R7, R6 4. add R3, R3, R6 5. ld R6, 0(R1) P0 P1 P2 P8 P7 P3 P4 op p1 PR1 p2 PR2 ex use Rd PRd LPRd ROB x x ld p P7 R1 P0 P8 Execute & Commit x add P0 R3 P1 x add P0 R3 P1 x P7 x sub p P6 p P5 R6 P3 P5 x add P1 P3 R3 P2 P1 x ld P0 R6 P4 P3 ECE 552 / CPS 550

Active Instruction Window in ROB (older instructions) … ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) Commit Fetch Cycle (t + 1) Execute ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) (newer instructions) Cycle (t) ECE 552 / CPS 550

Superscalar Register Renaming -- During decode, instruction is allocated new physical register for dest -- Instruction’s source registers renamed to physical register with newest value -- Execution unit only sees physical register numbers. -- Does this work? Inst 1 Inst 2 Op Src1 Src2 Dest Op Src1 Src2 Dest Update Mapping Rename Table Read Addresses Register Free List Write Ports Read Data Op PSrc1 PSrc2 PDest Op PSrc1 PSrc2 PDest ECE 552 / CPS 550

Superscalar Register Renaming Inst 1 Inst 2 Op Src1 Src2 Dest Op Src1 Src2 Dest Update Mapping Rename Table Read Addresses Register Free List Write Ports =? =? Read Data Must check for RAW hazards between instructions issuing in same cycle. If RAW hazard, pass Inst1’s Pdest to Inst2’s PSrc1 or PSrc2. Op PSrc1 PSrc2 PDest Op PSrc1 PSrc2 PDest ECE 552 / CPS 550

When can we execute the load? Memory Dependencies st r1, j(r2) ld r3, k(r4) When can we execute the load? ECE 552 / CPS 550

In-Order Memory Queue Execute all loads and stores in program order Load and store cannot leave ROB and commit architected state until all previous loads and stores have completed execution Can still execute loads speculatively and out-of-order with respect to other instructions. ECE 552 / CPS 550

Out-of-order Loads Conservative out-of-order load execution st r1, j(r2) ld r3, k(r4) -- Split execution of store instruction into two phases -- Address calculation and data write -- Can execute load before store if addresses known and j(r2) != k(r4) -- Each load address compared with addresses of previous uncommitted stores -- Don’t execute load if any previous store address not known ECE 552 / CPS 550

Speculative Loads st r1, j(r2) ld r3, k(r4) -- Guess that j(r4) != k(r2) -- Execute load before store address is known -- Need to hold all completed but uncommitted load/store addresses in program order -- Later, if we find r4 == r2, squash load and all following instructions -- Large penalty for inaccurate address speculation ECE 552 / CPS 550

Speculative Stores Just like register updates, stores should not modify the memory until after the instruction is committed. A speculative store buffer is a structure introduced to hold speculative store data ECE 552 / CPS 550

Speculative Store Buffer Load Address L1 Data Cache Tag Data S V Tags Data Tag Data S V Tag Data S V Tag Data S V Tag Data S V Tag Data S V Store Commit Path Load Data -- On store execute: mark entry valid (V) and speculative (S), save data and tag of instruction -- On store commit: clear speculative bit and eventually move data to cache -- On store abort: clear valid bit ECE 552 / CPS 550

Speculative Store Buffer Load Address L1 Data Cache Tag Data S V Tags Data Tag Data S V Tag Data S V Tag Data S V Tag Data S V Tag Data S V Store Commit Path Load Data -- If data in both store buffer and cache, which should we use? -- Speculative store buffer -- If same address in store buffer twice, which should we use? -- Youngest store that is older than load ECE 552 / CPS 550

Speculative Datapath kill PC Fetch Decode & Rename Reorder Buffer Update predictors Branch Prediction Branch Resolution kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit ALU MEM Store Buffer D$ Execute ECE 552 / CPS 550

Branch Prediction Motivation Hardware Support -- Branch penalties limit performance of deeply pipelined processors -- Modern branch predictors have high accuracy (>95%) and can significantly reduce branch penalties Hardware Support -- Prediction structures: branch history tables, branch target buffer, etc. -- Mispredict recovery mechanisms: -- Separate instruction execution and instruction commit -- Kill instructions following branch in pipeline -- Restore architectural state to correct path of execution ECE 552 / CPS 550

Static Branch Prediction JZ JZ backward 90% forward 50% On average, probability a branch is taken is 60-70%. But branch direction is a good predictor. ISA can attach preferred direction semantics to branches (e.g., Motorola MC8810, bne0 prefers taken, beq0 prefers not taken). ISA can allow choice of statically predicted direction (e.g., Intel IA-64). Can be 80% accurate. ECE 552 / CPS 550

Dynamic Branch Prediction Learn from past behavior Temporal Correlation -- The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial Correlation -- Several branches may resolve in a highly correlated manner (preferred path of execution in the application) ECE 552 / CPS 550

2-bit Branch Predictor Use two-bit saturating counter. Changes prediction after two consecutive mistakes. ECE 552 / CPS 550

Branch History Table (BHT) Fetch PC k BHT Index 2k-entry BHT, 2 bits/entry Taken/¬Taken? I-Cache Opcode offset Instruction Branch? Target PC + BHT is an array of 2-bit branch predictors, indexed by branch PC 4K-entry branch history table, 80-90% accurate ECE 552 / CPS 550

Two-Level Branch Prediction Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) k Fetch PC 2-bit global branch history shift register Shift in Taken/¬Taken results of each branch Taken/¬Taken? ECE 552 / CPS 550

Branch Target Buffer (BTB) predicted BPb target Branch Target Buffer (2k entries) IMEM k PC target BP BHT only predicts branch direction (taken, not taken). Cannot redirect instruction flow until after branch target determined. Store target with branch predictions. During fetch – if (BP == taken) then nPC=target, else nPC=PC+4 Later – update BHT, BTB ECE 552 / CPS 550

Branch Target Buffer (BTB) – v2 I-Cache PC k Valid valid Entry PC = match predicted target target PC Keep both branch PC and target PC in the BTB If match fails, PC+4 is fetched Only taken branches and jumps held in BTB ECE 552 / CPS 550

Mispredict Recovery In-order execution No instruction following branch can commit before branch resolves Kill all instructions in pipeline behind mis-predicted branch Out-of-order execution Multiple instructions following branch can complete before one branch resolves ECE 552 / CPS 550

In-order Commit In-order Out-of-order In-order Commit Fetch Decode Reorder Buffer Kill Kill Kill Exception? Execute Inject handler PC -- Instructions fetched, decoded in-order (entering the reorder buffer -- ROB) -- Instructions executed out-of-order -- Instructions commit in-order (write back to architectural state) -- Temporary storage needed in ROB to hold results before commit ECE 552 / CPS 550

Branch Misprediction in Pipeline Inject correct PC Branch Prediction Branch Resolution Kill Kill Kill Commit PC Fetch Decode Reorder Buffer Complete Execute -- Can have multiple unresolved branches in reorder buffer -- ROB -- Can resolve branches out-of-order by killing all instructions in ROB that follow a mispredicted branch ECE 552 / CPS 550

Mispredict Recovery t v Rename Table t v Rename Snapshots Register File t v t v r1 r2 Ins# use exec op p1 src1 p2 src2 pd dest data t1 t2 . tn Ptr2 next to commit rollback next available Ptr1 next available Reorder Buffer Commit Load Unit Store Unit FU FU FU < t, result > Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted ECE 552 / CPS 550

Acknowledgements These slides contain material developed and copyright by Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) Alvin Lebeck (Duke) David Patterson (UCB) Daniel Sorin (Duke) ECE 552 / CPS 550