Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Slides:

Advertisements

Similar presentations

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

COMP25212 Advanced Pipelining Out of Order Processors.

Computer Architecture Lec 8 – Instruction Level Parallelism.

Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Kubiatowicz © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Tomasulo With Reorder buffer:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

Computer Architecture Lecture 18 Superscalar Processor and High Performance Computing.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.

Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

IBM System 360. Common architecture for a set of machines

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

/ Computer Architecture and Design

COMP 740: Computer Architecture and Implementation

Out of Order Processors

Dynamic Scheduling and Speculation

Step by step for Tomasulo Scheme

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

CS5100 Advanced Computer Architecture Hardware-Based Speculation

Lecture 12 Reorder Buffers

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Tomasulo With Reorder buffer:

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

CS 704 Advanced Computer Architecture

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Larry Wittie Computer Science, StonyBrook University and ~lw

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Adapted from the slides of Prof

Chapter 3: ILP and Its Exploitation

September 20, 2000 Prof. John Kubiatowicz

Lecture 7 Dynamic Scheduling

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Hardware-Based Speculation

Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way to overcome control dependencies is with speculation – Make a guess and execute program as if our guess is correct – Need mechanisms to handle the case when the speculation is incorrect Can do some speculation in the compiler – We saw this previously with reordered / duplicated instructions around branches

Hardware-Based Speculation Extends the idea of dynamic scheduling with three key ideas: 1.Dynamic branch prediction 2.Speculation to allow the execution of instructions before control dependencies are resolved 3.Dynamic scheduling to deal with scheduling different combinations of basic blocks What we saw earlier was within a basic block Modern processors started using speculation around the introduction of the PowerPC 603, Intel Pentium II and extend Tomasulo’s approach to support speculation

Speculating with Tomasulo Separate execution from completion – Allow instructions to execute speculatively but do not let instructions update registers or memory until they are no longer speculative Instruction Commit – After an instruction is no longer speculative it is allowed to make register and memory updates Allow instructions to execute and complete out of order but force them to commit in order Add a hardware buffer, called the reorder buffer (ROB), with registers to hold the result of an instruction between completion and commit – Acts as a FIFO queue in order issued

Original Tomasulo Architecture

Tomasulo and Reorder Buffer Sits between Execution and Register File Source of operands In this case integrated with Store buffer Reservation stations use ROB slot as a tag Instructions commit at head of ROB FIFO queue – Easy to undo speculated instructions on mispredicted branches or on exceptions

ROB Data Structure Instruction Type Field – Indicates whether the instruction is a branch, store, or register operation Destination Field – Register number for loads, ALU ops, or memory address for stores Value Field – Holds the value of the instruction result until instruction commits Ready Field – Indicates if instruction has completed execution and the value is ready

Instruction Execution 1.Issue: Get an instruction from the Instruction Queue  If the reservation station and the ROB has a free slot (no structural hazard), issue the instruction to the reservation station and the ROB, send operands to the reservation station if available in the register file or the ROB. The allocated ROB slot number is sent to the reservation station to use as a tag when placing data on the CDB. 2.Execution: Operate on operands (EX)  When both operands ready then execute; if not ready, watch CDB for result 3.Write result: Finish execution (WB)  Write on CDB to all awaiting units and to the ROB using the tag; mark reservation station available 4.Commit: Update register or memory with the ROB result  When an instruction reaches the head of the ROB and results are present, update the register with the result or store to memory and remove the instruction from the ROB  If an incorrectly predicted branch reaches the head of the ROB, flush the ROB, and restart at the correct successor of the branch Blue text = Change from Tomasulo

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue State Dest Oldest Newest From Memory Dest Reorder Buffer Registers ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 F0LD F0, 10(R2)I ROB1 Dest 110+R2

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue State Dest Oldest Newest From Memory Dest Reorder Buffer Registers ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 F0LD F0, 10(R2)E ROB1 Dest 110+R2

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers ROB7 ROB6 ROB5 ROB4 ROB3 F10ADDD F10,F4,F0I ROB2 F0LD F0, 10(R2)E ROB1 2ADDDR(F4),1 Dest 110+R2 State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers ROB7 ROB6 ROB5 ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0I ROB2 F0LD F0, 10(R2)E ROB1 2ADDDR(F4),1 Dest 3MULD2,R(F6) 110+R2 State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers ROB7 F0ADDD F0,F4,F6I ROB6 F4LD F4,0(R3)E ROB5 --BNE F0, 0, LI ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0I ROB2 F0LD F0, 10(R2)E ROB1 2ADDDR(F4),1 6ADDD5,R(F6) Dest 3MULD2,R(F6) 110+R2 50+R3 State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] ROB5ST F4, 0(R3)I ROB7 F0ADDD F0,F4,F6I ROB6 F4LD F4,0(R3)E ROB5 --BNE F0, 0, LI ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0I ROB2 F0LD F0, 10(R2)E ROB1 2ADDDR(F4),1 6ADDD5,R(F6) Dest 3MULD2,R(F6) 110+R2 50+R3 State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0ADDD F0,F4,F6I ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LI ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0I ROB2 F0LD F0, 10(R2)E ROB1 2ADDDR(F4),1 6ADDD V1,R(F6) Dest 3MULD2,R(F6) 110+R2 State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0ADDD F0,F4,F6E ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LI ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0I ROB2 F0LD F0, 10(R2)E ROB1 2ADDDR(F4),1 Dest 3MULD2,R(F6) 110+R2 State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LI ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0I ROB2 F0LD F0, 10(R2)E ROB1 2ADDDR(F4),1 Dest 3MULD2,R(F6) 110+R2 State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LI ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0I ROB2 F0V3LD F0, 10(R2)W ROB1 2ADDDR(F4), V3 Dest 3MULD2,R(F6) State

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LE ROB4 F2MULD F2,F10,F6I ROB3 F10ADDD F10,F4,F0E ROB2 F0V3LD F0, 10(R2)C ROB1 Dest 3MULD2,R(F6) State F0=V3

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LW ROB4 F2MULD F2,F10,F6I ROB3 F10V4ADDD F10,F4,F0W ROB2 F0V3LD F0, 10(R2)C ROB1 Dest 3MULDV4,R(F6) State F0=V3

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LW ROB4 F2MULD F2,F10,F6E ROB3 F10V4ADDD F10,F4,F0C ROB2 F0V3LD F0, 10(R2)C ROB1 Dest State F0=V3 F10=V4

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LW ROB4 F2V5MULD F2,F10,F6W ROB3 F10V4ADDD F10,F4,F0C ROB2 F0V3LD F0, 10(R2)C ROB1 Dest State F0=V3 F10=V4

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LW ROB4 F2V5MULD F2,F10,F6C ROB3 F10V4ADDD F10,F4,F0C ROB2 F0V3LD F0, 10(R2)C ROB1 Dest State F0=V3 F10=V4 F2=V5

Tomasulo With ROB To Memory FP adders FP multipliers Reservation Stations FP Op Queue Dest Oldest Newest From Memory Dest Reorder Buffer Registers [R3] V1ST F4, 0(R3)W ROB7 F0V2ADDD F0,F4,F6W ROB6 F4V1LD F4,0(R3)W ROB5 --BNE F0, 0, LC ROB4 F2V5MULD F2,F10,F6C ROB3 F10V4ADDD F10,F4,F0C ROB2 F0V3LD F0, 10(R2)C ROB1 Dest State F0=V3 F10=V4 F2=V5

Avoiding Memory Hazards A store only updates memory when it reaches the head of the ROB – Otherwise WAW and WAR hazards are possible – By waiting to reach the head memory is updated in order and no earlier loads or stores can still be pending If a load accesses a memory location written to by an earlier store then it cannot perform the memory access until the store has written the data – Prevents RAW hazard through memory

Reorder Buffer Implementation In practice – Try to recover as early as possible after a branch is mispredicted rather than wait until branch reaches the head – Performance in speculative processors more sensitive to branch prediction Higher cost of misprediction Exceptions – Don’t recognize the exception until it is ready to commit – Could try to handle exceptions as they arise and earlier branches resolved, but more challenging