Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
COMP4611 Tutorial 6 Instruction Level Parallelism
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Tomasulo’s Approach and Hardware Based Speculation
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CMPE 421 Parallel Computer Architecture
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Use of Pipelining to Achieve CPI < 1
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
Concepts and Challenges
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CC 423: Advanced Computer Architecture Limits to ILP
Advantages of Dynamic Scheduling
Lecture 6: Advanced Pipelines
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
September 20, 2000 Prof. John Kubiatowicz
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Loop-Level Parallelism
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Presentation transcript:

Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP

Speculative ExecutionCS510 Computer ArchitecturesLecture Trace Scheduling Parallelism across IF branches vs. LOOP branches –Trace scheduling works when the behavior of the branches is fairly predictable at compile time Two steps: –Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code –Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong

Speculative ExecutionCS510 Computer ArchitecturesLecture Trace Scheduling * See the kinds of exceptions in page 179 Trace Compaction by speculation - Move the code associated with B and C to make VLIW word(s) before the branch - This may cause exceptions when executed X A[i] = A[i]+B[i] B[i]= C[i]= A[i]=0 T F Select this Trace If True branch is taken more frequently Speculation should not introduce any new exception*

Speculative ExecutionCS510 Computer ArchitecturesLecture HW Support for More ILP Conditional Instructions Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP –If false, then neither stores result nor causes exception* –Expanded ISA of Alpha, MIPS, SPARC have conditional move; PA- RISC can annul any following instr. Drawbacks to conditional instructions –Still takes a clock even if “annulled” –Stall if condition is evaluated late –Complex conditions reduce effectiveness; condition becomes known late in pipeline * See the kinds of exceptions in page 179

Speculative ExecutionCS510 Computer ArchitecturesLecture HW Support for More ILP Conditional Instructions LWC must have no effect if the condition is not satisfied. LWC cannot write the result nor cause any exceptions if the condition is not satisfied. Two-issue superscalar, combination of one M reference and one ALU(or Br) operations First instruction slot Second instruction slot LW R1,40(R2) ADD R3,R4,R5 ADD R6,R3,R7 BEQZ R10,L LW R8,20(R10) LW R9,0(R8) Waste of the Green slot. Data dependence in Reds. Example BNZ R1,LCMOVZR2,R3,R1 MOVR2,R3 L: First instruction slot Second instruction slot LW R1,40(R2) ADD R3,R4,R5 LWC R8,20(R10),R10 ADD R6,R3,R7 BEQZ R10,L LW R9,0(R8) Execute LW only when [R10] = 0, i.e., LWC is same as LW unless 3rd operand is 0.

Speculative ExecutionCS510 Computer ArchitecturesLecture HW Support for More ILP Speculation Speculation Allow an instruction to issue that is dependent on a branch (predicted to be taken) without any consequences(including exceptions). If branch is not actually taken (“HW undo”) –allows the execution of an instruction before the processor knows that the instruction should execute(i.e., it avoids control dependence stall) Often try to combine with dynamic scheduling Tomasulo Separate speculative bypassing of results from real bypassing of results –When an instruction is no longer speculative, write its results (instruction commit) –execute out-of-order but commit in order

Speculative ExecutionCS510 Computer ArchitecturesLecture Compiler Speculation with HW Support: (1) HW-SW Cooperation for Speculation HW undo for miss prediction –simply handle all resumable exceptions when exception occurs –simply return an undefined value for any exception that would cause termination the compiled code using compiler-based speculation LWR1, 0(R3); load A LWR14, 0(R2) ; speculative load B BEQZR1, L3; other branch of the if ADDR14, R1, 4; the else clause L3:SW0(R3), R14; nonspeculative store if (A==0) A =B; else A = A + 4; compiled code LW R1, 0(R3); load A BNEZ R1,L1; test A LWR1, 0(R2); if clause JL2; skip else L1:ADDR1,R1,4; else clause L2: SW0(R3), R1; store A * Assume the then clause is almost always executed. Register renaming; Need for an extra register

Speculative ExecutionCS510 Computer ArchitecturesLecture Compiler Speculation with HW Support: (2) Speculation with Poison Bits Speculation with Poison Bits –allows compiler speculation with less change to the exception behavior –a poison bit is added to every register –another bit is added to every instruction to indicate whether the instruction is speculative LWR1, 0(R3); load A LW*R14, 0(R2) ; speculative load B BEQZR1, L3; other branch of the if ADDR14, R1, 4; the else clause L3:SW0(R3), R14; nonspeculative store If the speculative LW* generates a terminating exception, the poison bit of R14 will be set. When the nonspeculative SW instruction occurs, it will raise an exception if the poison bit for R14 is on.

Speculative ExecutionCS510 Computer ArchitecturesLecture Compiler Speculation with HW Support The main disadvantages of the two previous schemes –the need to introduce copies to deal with register renaming –the possibility of exhausting the registers Speculative Instructions with Renaming (Boosting) –flagging the instructions which are moved past branches as speculative –providing renaming and buffering in the HW

Speculative ExecutionCS510 Computer ArchitecturesLecture Compiler Speculation with HW Support: (3) Speculative Instructions with Renaming Extra register is no longer necessary Result of the boosted instruction is not written into R1 until after branch Other boosted instructions could use the results of the boosted load LWR1, 0(R3); load A LW+R1, 0(R2) ;;boosted load B BEQZR1, L3; other branch of the if ADDR1, R1, 4; the else clause L3:SW0(R3), R1; nonspeculative store written to R1 never written to R1

Speculative ExecutionCS510 Computer ArchitecturesLecture Hardware-based Speculation –dynamic branch prediction –speculation to allow the execution of instructions before the control dependencies are resolved –dynamic scheduling to deal with the scheduling of different combinations of basic blocks Advantages –dynamic runtime disambiguation of memory addresses –hardware-based branch prediction –a completely precise exception model –does not require compensation or bookkeeping code –does not require different code sequences to achieve good performance for different implementation

Speculative ExecutionCS510 Computer ArchitecturesLecture HW-based Speculation Need HW buffer for results of uncommitted instructions: reorder buffer –Reorder buffer can be operand source –Once operand commits, result is found in register –3 fields: instr. type, destination, value –Use reorder buffer number instead of reservation station –Instructions commit in order –As a result, it is easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Regs FP Op Queue FP Adder Res Stations From M(LD)

Speculative ExecutionCS510 Computer ArchitecturesLecture Steps of Speculative Tomasulo Algorithm 1.Issue: Get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. to the RS 2.Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute 3.Write result: Finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit: Update register with reorder result When an instruction is at the head of reorder buffer & result present, update register with result (or store to memory) and remove the instruction from reorder buffer.

Speculative ExecutionCS510 Computer ArchitecturesLecture Limits to ILP Conflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about: –Benchmarks (vectorized Fortran FP vs. integer C programs) –Hardware sophistication –Compiler sophistication

Speculative ExecutionCS510 Computer ArchitecturesLecture Limits to ILP HW Model for ultimate issue performance; MIPS compilers 1. Register renaming: Infinite virtual registers and all WAW & WAR hazards are avoided 2. Branch prediction: Perfect; no mispredictions 3. Jump prediction: All jumps perfectly predicted => machine with perfect speculation and an unbounded buffer of instructions available 4. Memory-address alias analysis: addresses are known and a store can be moved before a load provided addresses are not equal 1 cycle latency for all instructions

Speculative ExecutionCS510 Computer ArchitecturesLecture Upper Limit to ILP Programs Instruction Issues per cycle gccespressolifppppdoducdtomcatv Integer programs Floating point programs

Speculative ExecutionCS510 Computer ArchitecturesLecture Limitations on Window Size and Maximum Issue Count Window : the set of instructions examined for simultaneous execution –n instructions: to determine whether they have any register dependencies among them 2n n = n 2 -n 2000 instructions -- 4 million comparisons 50 instructions comparisons –current technology : window size - 4 to 32 requires about 900 comparisons Multiple Issues -- lengthen the clock cycle –typically have clock cycles that are 1.5 to 3 times longer –typically have CPIs that are 2 to 3 times lower

Speculative ExecutionCS510 Computer ArchitecturesLecture Window Size Impact

Speculative ExecutionCS510 Computer ArchitecturesLecture More Realistic HW: Branch Impact window of 2000 and maximum issue of 64 instructions/clock cycle

Speculative ExecutionCS510 Computer ArchitecturesLecture Selective History Predictor 8192 x 2 bits 2048 x 4 x 2 bits Branch Addr Global History Taken/Not Taken 8K x 2 bit Selector Choose Non-correlator Choose Correlator Taken 10 ” 01 Not Taken 00 ” Non-correlating predictor Correlating predictor

Speculative ExecutionCS510 Computer ArchitecturesLecture More Realistic HW: Register Impact 2000 instr window, 64 instr issue, 8K 2-level Prediction

Speculative ExecutionCS510 Computer ArchitecturesLecture More Realistic HW: Alias Impact 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers Program Instruction issues per cycle gccespressolifppppdoducdtomcatv PerfectGlobal/stack Perfect +Inspection #None * * All memory accesses are assumed to conflict + Ongoing research # Most commercial compilers

Speculative ExecutionCS510 Computer ArchitecturesLecture Realistic HW for 90s: Window Impact Realistic HW in 90s: Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

Speculative ExecutionCS510 Computer ArchitecturesLecture Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster. –sophisticated pipelines typically have slower clock rates than processors with simple pipelines –example : IBM Power-2(low CPI) : two FP and two load-store, clock rate 71.5 MHz(slower clock rate) Dec Alpha 21604(high CPI) : dual-issue with one load-store and one FP, 200 MHz(faster clock rate)

Speculative ExecutionCS510 Computer ArchitecturesLecture Braniac vs. Speed Demon 6-scalar IBM 71.5 MHz (5 stage pipe) vs. 2-scalar 200 MHz (7 stage pipe)

Speculative ExecutionCS510 Computer ArchitecturesLecture Recent High Performance Processors Issue capabilitySPEC YearInitial(measure shipped inclock rate Issue Schedul- Maxi-Load-Integer or Processor systems(MHz) structureing mumstoreALUFPBranchestimate) IBM DynamicStatic int Power-2270 FP Intel199466DynamicStatic int Pentium65 FP DEC Alpha StaticStatic int FP Sun Ultra DynamicStatic int 305 FP Intel P DynamicDynamic31211>200 int PowerPC DynamicDynamic int FP MIPS DynamicDynamic int R FP HP DynamicStatic42221>360 int >550 FP