CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Dynamic Branch Prediction
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
CC 423: Advanced Computer Architecture Limits to ILP
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Advantages of Dynamic Scheduling
Lecture 5: VLIW, Software Pipelining, and Limits to ILP
Tomasulo With Reorder buffer:
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CPE 631: Branch Prediction
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Dynamic Branch Prediction
John Kubiatowicz Electrical Engineering and Computer Sciences
Advanced Computer Architecture
September 20, 2000 Prof. John Kubiatowicz
Larry Wittie Computer Science, StonyBrook University and ~lw
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Tomasulo Organization
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Adapted from the slides of Prof
Chapter 3: ILP and Its Exploitation
September 20, 2000 Prof. John Kubiatowicz
Adapted from the slides of Prof
Dynamic Hardware Prediction
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Overcoming Control Hazards with Dynamic Scheduling & Speculation
CPE 631 Lecture 12: Branch Prediction
Presentation transcript:

CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP

In-order issue permits us to analyze data flow of program Why issue in-order? In-order issue permits us to analyze data flow of program Know which results flow to which subsequent instructions If we issued out-of-order, we would confuse RAW and WAR hazards! This idea works perfectly well “in principle” with multiple instructions issued per clock: Need to multi-port “rename table” and be able to rename a sequence of instructions together Need to be able to issue to multiple reservation stations in a single cycle. Need to have 2x number of read ports and x number of write ports in register file. However, even with these enhancements, in-order issue can be serious bottleneck when issuing multiple instructions

Now what about exceptions??? Out-of-order commit really messes up our chance to get precise exceptions! Register file contains results from later instructions while earlier ones have not completed yet. What if need to cause exception on one of those early instructions?? Need to “rollback” register file to consistent state: Recall: “precise” interrupt means that there is some PC such that: all instructions before have committed results and none after have committed results. Technique for precise exceptions: in-order completion or commit Must commit instruction results in same order as issue

HW support for precise interrupts Need HW buffer for results of uncommitted instructions: reorder buffer 3 fields: instr, destination, value Reorder buffer can be operand source => more registers like RS Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit Once operand commits, result is put into register Instructionscommit As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Op Queue FP Adder Res Stations FP Regs

Four Steps of Speculative Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch or interrupt flushes reorder buffer (sometimes called “graduation”)

Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest F0 LD F0,10(R2) N Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue F0 ADDD F0,F4,F6 N F4 LD F4,0(R3) -- BNE F2,<…> F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 6 0+R3 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 ROB5 ST 0(R3),F4 ADDD F0,F4,F6 N F4 LD F4,0(R3) BNE F2,<…> F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 6 0+R3 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y N F4 LD F4,0(R3) BNE F2,<…> F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6) 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 M[10] --- ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 LD F4,0(R3) BNE F2,<…> N F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Tomasulo With Reorder buffer: Done? FP Op Queue -- F0 M[10] --- ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 LD F4,0(R3) BNE F2,<…> N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest What about memory hazards??? Reorder Buffer F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N Oldest F0 LD F0,10(R2) N Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers

Memory Disambiguation: Handling RAW Hazards in memory Question: Given a load that follows a store in program order, are the two related? (Alternatively: is there a RAW hazard between the store and the load)? Eg: st 0(R2),R5 ld R6,0(R3) Can we go ahead and start the load early? Store address could be delayed for a long time by some calculation that leads to R2 (divide?). We might want to issue/begin execution of both operations in same cycle. Two techiques: No Speculation: we are not allowed to start load until we know for sure that address 0(R2)  0(R3) Speculation: We might guess at whether or not they are dependent (called “dependence speculation”) and use reorder buffer to fixup if we are wrong.

Hardware Support for Memory Disambiguation Need buffer to keep track of all outstanding stores to memory, in program order. Keep track of address (when becomes available) and value (when becomes available) FIFO ordering: will retire stores from this buffer in program order When issuing a load, record current head of store queue (know which stores are ahead of you). When have address for load, check store queue: If any store prior to load is waiting for its address, stall load. If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: store value available  return value store value not available  return ROB number of source Otherwise, send out request to memory Actual stores commit in order, so no worry about WAR/WAW hazards through memory.

Memory Disambiguation: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer -- LD F4, 10(R3) N F2 ST 10(R3), F5 N F0 LD F0,32(R2) N Oldest -- <val 1> ST 0(R3), F4 Y Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory Dest Reservation Stations 2 32+R2 4 ROB3 FP adders FP multipliers

What about FETCH? Independent “Fetch” unit Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback On Branch Results Stream of Instructions To Execute Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch

Branches must be resolved quickly for loop overlap! In our loop-unrolling example, we relied on the fact that branches were under control of “fast” integer unit in order to get overlap! Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop What happens if branch depends on result of multd?? We completely lose all of our advantages! Need to be able to “predict” branch outcome. If we were to predict that branch was taken, this would be right most of the time. Problem much worse for superscalar machines!

Prediction: Branches, Dependencies, Data Prediction has become essential to getting good performance from scalar instruction streams. We will discuss predicting branches. However, architects are now predicting everything: data dependencies, actual data, and results of groups of instructions: At what point does computation become a probabilistic operation + verification? We are pretty close with control hazards already… Why does prediction work? Underlying algorithm has regularities. Data that is being operated on has regularities. Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems. Prediction  Compressible information streams?

Dynamic Branch Prediction Prediction could be “Static” (at compile time) or “Dynamic” (at runtime) For our example, if we were to statically predict “taken”, we would only be wrong once each pass through loop Is dynamic branch prediction better than static branch prediction? Seems to be. Still some debate to this effect Today, lots of hardware being devoted to dynamic branch predictors.

Simple dynamic prediction: Branch Target Buffer (BTB) Address of branch index to get prediction AND branch address (if taken) Must check for branch match now, since can’t use wrong branch address Grab predicted PC from table since may take several cycles to compute Update predicted PC when branch is actually resolved Return instruction addresses predicted with stack Branch PC Predicted PC PC of instruction FETCH =? Predict taken or untaken

Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Misprediction  Flush Reorder Buffer Branch History Table: Lower bits of PC address index table of 1-bit values Says whether or not branch taken last time No address check Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iteratios before exit): End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping

Dynamic Branch Prediction Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264) Red: stop, not taken Green: go, taken Adds hysteresis to decision making process T NT Predict Taken Predict Taken T NT T NT Predict Not Taken Predict Not Taken T NT

Mispredict because either: BHT Accuracy Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table (in Alpha 211164)

Two possibilities; Current branch depends on: Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch Two possibilities; Current branch depends on: Last m most recently executed branches anywhere in program Produces a “GA” (for “global address”) in the Yeh and Patt classification (e.g. GAg) Last m most recent outcomes of same branch. Produces a “PA” (for “per address”) in same classification (e.g. PAg) Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry A single history table shared by all branches (appends a “g” at end), indexed by history value. Address is used along with history to select table entry (appends a “p” at end of classification) If only portion of address used, often appends an “s” to indicate “set-indexed” tables (I.e. GAs)

Correlating Branches For instance, consider global history, set-indexed BHT. That gives us a GAs history table. (2,2) GAs predictor First 2 means that we keep two bits of history Second means that we have 2 bit counters in each slot. Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Note that the original two-bit counter solution would be a (0,2) GAs predictor Note also that aliasing is possible here... Branch address 2-bits per branch predictors Prediction Each slot is 2-bit counter 2-bit global branch history register

Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 0% 18% Frequency of Mispredictions

if (x) then A = B op C else NOP HW support for More ILP Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. EPIC: 64 1-bit condition fields selected so conditional execution Drawbacks to conditional instructions Still takes a clock even if “annulled” Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline

Limits to Multi-Issue Machines Inherent limitations of ILP 1 branch in 5: How to keep a 5-way superscalar busy? Latencies of units: many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent instructions to keep fully busy Increase ports to Register File VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg Increase ports to memory Current state of the art: Many hardware structures (such as issue/rename logic) has delay proportional to square of number of instructions issued/cycle

Conflicting studies of amount Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanims with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX Motorola AltaVec Supersparc Multimedia ops, etc.

Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided 2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle

Upper Limit to ILP: Ideal Machine FP: 75 - 150 Integer: 18 - 60 IPC

More Realistic HW: Branch Impact Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle FP: 15 - 45 Integer: 6 - 12 IPC Perfect Pick Cor. or BHT BHT (512) Profile No prediction

More Realistic HW: Register Impact (rename regs) FP: 11 - 45 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC Infinite 256 128 64 32 None

More Realistic HW: Alias Impact Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 IPC Perfect Global/Stack perf; heap conflicts Inspec. Assem. None

Realistic HW for ‘9X: Window Impact Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: 8 - 45 IPC Integer: 6 - 12 Infinite 256 128 64 32 16 8 4

Braniac vs. Speed Demon(1993) 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Helps cache misses as well Summary #1/2 Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Dynamic hardware schemes can unroll loops dynamically in hardware Summary #2/2 Dynamic hardware schemes can unroll loops dynamically in hardware Dependent on renaming mechanism to remove WAR and WAW hazards Reorder Buffer: Provides generic mechanism for “undoing” computation Instructions placed into Reorder buffer in issue order Instructions exit in same order – providing in-order-commit Trick: Don’t want to be canceling computation too often! Branch prediction very important to good performance Depends on ability to cancel computation (Reorder Buffer) Superscalar and VLIW: CPI < 1 (IPC > 1) Dynamic issue vs. Static issue More instructions issue at same time => larger hazard penalty Limitation is often number of instructions that you can successfully fetch and decode per cycle  “Flynn barrier”