EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

EECS 470 ILP and Exceptions Lecture 7 Coverage: Chapter 3.

Lecture 9: R10K scheme, Tclk

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.

EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.)

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

EECS 470 Lecture 5 Branches: Address prediction and recovery (And interrupt recovery too.)

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

EECS 470 Register Renaming Lecture 8 Coverage: Chapter 3.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

CDA 5155 Out-of-order execution: Pentium Pro/II/III Week 7.

CDA 5155 Week 3 Branch Prediction Superscalar Execution.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

CS5100 Advanced Computer Architecture Hardware-Based Speculation

Microprocessor Microarchitecture Dynamic Pipeline

EECS 470 Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 – Winter 2018 Slides developed in part by Profs. Austin, Brehob,

Tomasulo With Reorder buffer:

Out of Order Processors

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 18: Pipelining Today’s topics:

ECE 2162 Reorder Buffer.

Lecture 11: Memory Data Flow Techniques

Lecture 18: Pipelining Today’s topics:

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Control unit extension for data hazards

Adapted from the slides of Prof

Instruction-Level Parallelism (ILP)

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

Warning: Crazy times coming HW2 due on Tuesday 2/3 –Will do group formation and project materials on Tuesday too. P3 is due on Sunday 2/8 –It’s a lot of work (20 hours?) Proposal is due on Tuesday (2/10) –It’s not a lot of work (1 hour?) to do the write-up, but you’ll need to meet with your group and discuss things. Don’t worry too much about getting this right. You’ll be allowed to change (we’ll meet the following Friday). Just a line in the sand. HW3 is due on Thursday 2/12 –It’s a fair bit of work (3 hours?) –Answers will be posted right after class. 20 minute group meetings on Friday 2/13 (rather than inlab) Midterm is on Thursday 2/12 in the evening (6-8pm) –Exam Q&A on the weekend before (time/location TBA) –Q&A in class 2/12 –Best way to study is look at old exams (posted on-line!)

Last time: Covered branch predictors –Direction –Address

General speculation Control speculation –“I think this branch will go to address 90004” Data speculation –“I’ll guess the result of the load will be zero” Memory conflict speculation –“I don’t think this load conflicts with any proceeding store.” Error speculation –“I don’t think there were any errors in this calculation”

Speculation in general Need to be 100% sure on final correctness! –So need a recovery mechanism –Must make forward progress! Want to speed up overall performance –So recovery cost should be low or expected rate of occurrence should be low. –There can be a real trade-off on accuracy, cost of recovery, and speedup when correct. Should keep the worst case in mind…

MEM Precise Interrupts and branches via the Reorder Alloc –Allocate result storage at Sched –Get inputs (ROB T-to-H then ARF) –Wait until all inputs WB –Write results/fault to ROB –Indicate result is CT –Wait until Head is done –If fault, initiate handler –Else, write results to ARF –Deallocate entry from ROB IFID AllocSched EX ROB CT HeadTail PC Dst regID Dst value Except? Reorder Buffer (ROB) –Circular queue of spec state –May contain multiple definitions of same register In-order Any order ARF

Reorder Buffer Example Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 – r2 Initial Conditions - reorder buffer empty - f2 = f3 = r2 = 6 - r3 = 5 ROB Time HT regID: f1 result: ? Except: ? HT regID: f1 result: ? Except: ? regID: r3 result: ? Except: ? HT regID: f1 result: ? Except: ? regID: r3 result: 11 Except: N regID: r4 result: ? Except: ? r3 regID: r8 result: 2 Except: n regID: r8 result: 2 Except: n regID: r8 result: 2 Except: n

Reorder Buffer Example Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 – r2 Initial Conditions - reorder buffer empty - f2 = f3 = r2 = 6 - r3 = 5 ROB Time HT regID: f1 result: ? Except: ? regID: r3 result: 11 Except: n regID: r4 result: 5 Except: n HT regID: f1 result: ? Except: y regID: r3 result: 11 Except: n regID: r4 result: 5 Except: n regID: r8 result: 2 Except: n regID: r8 result: 2 Except: n HT regID: f1 result: ? Except: y regID: r3 result: 11 Except: n regID: r4 result: 5 Except: n

Reorder Buffer Example Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 – r2 Initial Conditions - reorder buffer empty - f2 = f3 = r2 = 6 - r3 = 5 ROB Time HT HT first inst of fault handler

There is more complexity here Rename table needs to be cleared –Everything is in the ARF –Really do need to finish everything which was before the faulting instruction in program order. What about branches? –Would need to drain everything before the branch. Why not just squash everything that follows it?

And while we’re at it… Does the ROB replace the RS? –Is this a good thing? Bad thing?

ROB ROB –ROB is an in-order queue where instructions are placed. –Instructions complete (retire) in-order –Instructions still execute out-of-order –Still use RS Instructions are issued to RS and ROB at the same time Rename is to ROB entry, not RS. When execute done instruction leaves RS –Only when all instructions in before it in program order are done does the instruction retire.

Adding a Reorder Buffer

Tomasulo Data Structures (Timing Free Example, “P6 scheme”) Map Table RegTag r0 r1 r2 r3 r4 Reservation Stations (RS) TFUbusyopRoBT1T2V1V CDB TV ARF RegV r0 r1 r2 r3 r4 Instruction r0=r1*r2 r1=r2*r3 Branch if r1=0 r0=r1+r1 r2=r2+1 Reorder Buffer (RoB) RoB Number Dest. Reg. Value

Review Questions Could we make this work without the RS? –If so, why do we do that? Why is it important to retire in order? Why must branches wait until retirement before they announce their mispredict? –Any other ways to do this?

More review questions 1.What is the purpose of the RoB? 2.Why do we have both a RoB and a RS? –Yes, that was pretty much on the last page… 3.Misprediction a)When to we resolve a mis-prediction? b)What happens to the main structures (RS, RoB, ARF, Rename Table) when we mispredict? 4.What is the whole purpose of OoO execution?

And yet more review questions! 1.What is the purpose of the RoB? 2.Why do we have both a RoB and a RS? 3.Misprediction a)When to we resolve a mis-prediction? b)What happens to the main structures (RS, RoB, ARF, Rename Table) when we mispredict? 4.What is the whole purpose of OoO execution?

When an instruction is dispatched how does it impact each major structure? Rename table? ARF? RoB? RS?

When an instruction completes execution how does it impact each major structure? Rename table? ARF? RoB? RS?

When an instruction retires how does it impact each major structure? Rename table? ARF? RoB? RS?

Topic change Why on earth are we doing this? –Why do we think it helps? Homework 2 problems 5 and 6 made the argument. –Only need to obey true data dependencies. Huge speedup potential.

Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed –Reduce the cycles to execute an instruction –Reduce the clock period Our first focus: Reducing CPI –Approach: Instruction Level Parallelism (ILP)

Why ILP? Vs. Requirements –Parallelism –Large window –Limited control deps –Eliminate “false” deps –Find run-time deps

How Much ILP is There? (Chapter 3.10)

How Large Must the “Window” Be?

ALU Operation GOOD, Branch BAD Expected Number of Branches Between Mispredicts E(X) ~ 1/(1-p) E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts

How Accurate are Branch Predictors?

Impact of Physical Storage Limitations Each instruction “in flight” must have storage for its result –Really worse than this because of mispeculation…

Registers GOOD, Memory BAD Benefits of registers –Well described deps –Fast access –Finite resource Memory loses these benefits for flexibility *p = … *q = … … = *p ?

“Bottom Line” for an Ambitious Design