Tomasulo’s Approach and Hardware Based Speculation

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
ILP: Software Approaches
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
ENGS 116 Lecture 111 ILP: Software Approaches 2 Vincent H. Berk October 14 th Reading for monday: 3.10 – 3.15, Reading for today: 4.2 – 4.6.
A scheme to overcome data hazards
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
COMP4611 Tutorial 6 Instruction Level Parallelism
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Out of Order Processors
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Advantages of Dynamic Scheduling
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
September 20, 2000 Prof. John Kubiatowicz
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Adapted from the slides of Prof
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Tomasulo’s Approach and Hardware Based Speculation ENGS 116 Lecture 10 Tomasulo’s Approach and Hardware Based Speculation Vincent H. Berk October 22nd Reading for Today: 3.1 – 3.7

Hardware Schemes for ILP ENGS 116 Lecture 8 Hardware Schemes for ILP Why in hardware at run time? Works when dependence is not known at run time Simplifies compiler Allows code for one machine to run well on another Key idea: Allow instructions behind stall to proceed DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 Enables out-of-order execution  out-of-order completion ID stage checks for both structural hazards and data dependences

Hardware Schemes for ILP ENGS 116 Lecture 8 Hardware Schemes for ILP Out-of-order execution divides ID stage: 1. Issue — decode instructions, check for structural hazards 2. Read operands — wait until no data hazards, then read operands

Tomasulo’s Algorithm For IBM 360/91 about 3 years after CDC 6600 ENGS 116 Lecture 8 Tomasulo’s Algorithm For IBM 360/91 about 3 years after CDC 6600 Goal: High performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instruction vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 Differences between Tomasulo’s Algorithm & Scoreboard Control & buffers (called “reservation stations”) distributed with functional units vs. centralized in scoreboard Registers in instructions replaced by pointers to reservation station buffer HW renaming of registers to avoid WAR, WAW hazards Common data bus (CDB) broadcasts results to functional units Load and stores treated as functional units as well Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, ...

Three Stages of Tomasulo Algorithm ENGS 116 Lecture 8 Three Stages of Tomasulo Algorithm 1. Issue: Get instruction from FP operation queue If reservation station free, issues instruction & sends operands (renames registers). 2. Execution: Operate on operands (EX) When operands ready then execute; if not ready, watch common data bus for result. 3. Write result: Finish execution (WB) Write on common data bus to all awaiting units; mark reservation station available. Common data bus: data + source (“come from” bus)

Tomasulo Organization ENGS 116 Lecture 8 Tomasulo Organization From Instruction Unit From Memory FP Registers Load Buffers FP Op Queue Store Buffers Operand Bus To Memory Operation Bus Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel FP Add Res. Station FP Mul Res. Station Reservation Stations FP Adders FP Multipliers Common data bus (CDB) 6

Reservation Station Components ENGS 116 Lecture 8 Reservation Station Components Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – Reservation stations producing source registers Vj, Vk – Value of source operands Rj, Rk – Flags indicating when Vj, Vk are ready Busy – Indicates reservation station and FU is busy Register result status – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

Tomasulo Example Cycle 1 ENGS 116 Lecture 8 Tomasulo Example Cycle 1

Tomasulo Example Cycle 2 ENGS 116 Lecture 8 Tomasulo Example Cycle 2

Tomasulo Example Cycle 3 ENGS 116 Lecture 8 Tomasulo Example Cycle 3 Register names are renamed in reservation stations Load1 completing — who is waiting for Load1?

Tomasulo Example Cycle 4 ENGS 116 Lecture 8 Tomasulo Example Cycle 4 Load2 completing — who is waiting for it?

Tomasulo Example Cycle 5 ENGS 116 Lecture 8 Tomasulo Example Cycle 5

Tomasulo Example Cycle 6 ENGS 116 Lecture 8 Tomasulo Example Cycle 6

ENGS 116 Lecture 8 Tomasulo Summary Reservation stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units get ahead, beyond branches) Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP- PA 8000; Alpha 21264

Hardware-Based Speculation Instead of just instruction fetch and decode, also execute instructions based on prediction of branch. Execute instructions out of order as soon as their operands are available. Wait with instruction commit until branch is decided. Re-order instructions after execution and commit them in order reorder buffer or ROB register file not updated until commit Do not raise exceptions until instruction is committed ROB holds and provides operands until commit.

Tomasulo with Speculation Issue – Empty reservation station and an empty ROB slot. Send operands to reservation station from register file or from ROB. This stage is often referred to as: dispatch Execute – Monitor CDB for operands, check RAW hazards. When both operands are available, then execute. Write Result – When available, write result to CDB through to ROB and any waiting reservation stations. Stores write to value field in ROB. Commit – Three cases: Normal Commit: write registers, in order commit Store: update memory Incorrect branch: flush ROB, reservation stations and restart execution at correct PC

Problems with speculation Multi Issue Machines: Must be able to commit multiple instructions from ROB More registers, more renaming How much speculation: How many branches deep? What to do on a cache miss? TLB miss? Cache interference due to incorrect branch prediction

Figure: 3.41 Number of registers available for renaming.

Figure: 3.45 Window size: the number of instructions the issue unit may look ahead and schedule from.

ILP: Software Approaches 2 ENGS 116 Lecture 11 ILP: Software Approaches 2

HW Support for More ILP X A = B op C Avoid branch prediction by turning branches into conditionally executed instructions: If (X) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction. IA-64: 61 1-bit condition fields selected so conditional execution of any instruction Drawbacks to conditional instructions Still takes a clock even if “annulled” Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline X A = B op C

Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop Iteration 1 2 3 4 pipelined Software- iteration

SW Pipelining Example Before: Unrolled 3 times After: Software Pipelined 1 LD F0, 0 (R1) LD F0, 0 (R1) 2 ADDD F4, F0, F2 ADDD F4, F0, F2 3 SD 0 (R1), F4 LD F0, –8 (R1) 4 LD F6, –8 (R1) 1 SD 0 (R1), F4 Stores M[i] 5 ADDD F8, F6, F2 2 ADDD F4, F0, F2 Adds to M[i-1] 6 SD –8, (R1), F8 3 LD F0, –16 (R1) Loads M[i-2] 7 LD F10, –16 (R1) 4 SUBI R1, R1, #8 8 ADDD F12, F10, F2 5 BNEZ R1, LOOP 9 SD –16 (R1), F12 SD 0 (R1), F4 10 SUBI R1, R1, #24 ADDD F4, F0, F2 11 BNEZ R1, LOOP SD –8 (R1), F4 Read F4 Read F0 SD IF ID EX Mem WB Write F4 ADD IF ID EX Mem WB LD IF ID EX Mem WB Write F0

SW Pipelining Example Software Pipelining Loop Unrolling Symbolic Loop Unrolling Smaller code space Overhead paid only once vs. each iteration in loop unrolling 100 iterations = 25 loops with 4 unrolled iterations each Software Pipelining Number of overlapped operations (a) Software pipelining Time Loop Unrolling Number of overlapped operations Time (b) Loop unrolling

Trace Scheduling Focus on critical path (trace selection) Compiler has to decide what the critical path (the trace) is Most likely basic blocks are put in the trace Loops are unrolled in the trace Now speed it up (trace compaction) Focus on limiting instruction count Branches are seen as jumps into or out of the trace Problem: Significant overhead for parts that are not in the trace Unclear if it is feasible in practice

Superblocks Similar to Trace Scheduling but: Single entrance, multiple exits Tail duplication: Handle cases that exited the superblock Residual loop handling Could in itself be a superblock Problem: Code size Worth the hassle?

Conditional instructions Instruction that is executed depending on one of its arguments: Instruction is executed but results are not always written. Should only be used for very small sequences, else use normal branch. BNEZ R1, L ADDU R2, R3, R0 L: CMOVZ R2, R3, R1

Speculation Compiler moves instructions before branch if: Data flow is not affected (optionally with use of renaming) Preserve exception behavior Avoid load/store address conflicts (no renaming for memory loc.) Preserving exception behavior Mechanism to indicate an instruction is speculative Poison bit: raise exception when value is used Using Conditional instructions: Requires In-Order instruction commit Register renaming Writeback at commit Forwarding Raise exceptions at commit

Speculation if (A==0) A=B; else A=A+4; LD R1, 0(R3) ; load A BNEZ R1, L1 ; test A LD R1, 0(R2) ; then J L2 ; skip else L1: DADDI R1, R1, #4 ; else L2: SD R1, 0(R3) ; store A LD R1, 0(R3) ; load A LD R14, 0(R2) ; load B (speculative) BEQZ R1, L3 ; branch if DADDI R14, R1, #4 ; else L3: SD R14, 0(R3) ; store A