CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Slides:

Advertisements

Similar presentations

Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.

Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

COMP25212 Advanced Pipelining Out of Order Processors.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Computer Architecture

EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)

1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.

CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

Instruction-Level Parallelism Dynamic Scheduling

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,

2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.

Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:

MS108 Computer System I Lecture 6 Scoreboarding Prof. Xiaoyao Liang 2015/4/3 1.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –

COMP25212 Advanced Pipelining Out of Order Processors.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

The University of Adelaide, School of Computer Science

/ Computer Architecture and Design

Tomasulo’s Algorithm Born of necessity

Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm

Approaches to exploiting Instruction Level Parallelism (ILP)

Out of Order Processors

Dynamic Scheduling and Speculation

Step by step for Tomasulo Scheme

CS203 – Advanced Computer Architecture

Lecture 6 Score Board And Tomasulo’s Algorithm

Lecture 10 Tomasulo’s Algorithm

Lecture 12 Reorder Buffers

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

CS5100 Advanced Computer Architecture Dynamic Scheduling

Adapted from the slides of Prof

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

Presentation transcript:

CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

MIPS FP Unit using Tomasulo’s Algorithm

MIPS Processor with Scoreboard

Three Steps in Execution for Tomasulo’s Alg. 1. Issue ─ if no structural hazards 2. Execute ─ if both operands are available 3. Write result on CDB (from there into reservation stations waiting for results) Recall that for Scoreboard: Four Steps in Execution 1. Issue ─ if no structural nor WAW hazards 2. Read operands ─ if no RAW hazards 3. Execute ─ if both operands are received 4. Write result ─ if no WAR hazards

How Hazards are Handled Structural Hazards ─ Reservation stations allow more instructions to be issued RAW Hazards ─ An instruction is executed only when its operands are available WAR and WAW Hazards ─ Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instruction that depends on an earlier value of an operand

Tags Tag is a 4-bit quantity that denotes one of five reservation stations or one of six load buffers Tag fields are found in the reservation stations, the register file, and the store buffers

Example L.DF6,34(R2) L.DF2,45(R3) MUL.DF0,F2,F4 SUB.DF8,F2,F6 DIV.DF10,F0,F6 ADD.DF6,F8,F2

Three Tables (1st table is not part of hardware; 2nd and 3rd tables are distributed) 1. Instruction status ─ indicates which of three steps of instruction 2. Reservation stations ─ busy, op, Vj, Vk, Qj, Qk, A (V = value; Q = reservation station) 3. Register status ─ indicates which reservation station will write this register

Figure 0.0 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1No Add2No Add3No Mult1YesMultReg[F4]Load2 Mult2No F0F2F4F6F8F10F12…F30 QiMult1Load2Load1

Figure 0.1 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6 ADD.D F6,F8,F2 NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1YesSubLoad2Load1 Add2No Add3No Mult1YesMultReg[F4]Load2 Mult2No F0F2F4F6F8F10F12…F30 QiMult1Load2Load1Add1

Figure 0.2 (Suppose LD is slow) InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2 NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1YesSubLoad2Load1 Add2No Add3No Mult1YesMultReg[F4]Load2 Mult2YesDivMult1Load1 F0F2F4F6F8F10F12…F30 QiMult1Load2Load1Add1Mult2

Figure 0.3 (Suppose LD is slow) InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√ NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1YesSubLoad2Load1 Add2YesAddAdd1Load2 Add3No Mult1YesMultReg[F4]Load2 Mult2YesDivMult1Load1 F0F2F4F6F8F10F12…F30 QiMult1Load2Add2Add1Mult2

Figure 3.3 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√ NameBusyOpVjVkQjQkA Load1No Load2YesLoad45+Reg[R3] Add1YesSubMem[34+Reg[R2]]Load2 Add2YesAddAdd1Load2 Add3No Mult1YesMultReg[F4]Load2 Mult2YesDivMem[34+Reg[R2]]Mult1 F0F2F4F6F8F10F12…F30 QiMult1Load2Add2Add1Mult2

Figure 0.4 (2 nd load just completes) InstructionIssueExecuteWrite Result L.D F6,34(R2)√√√ L.D F2,45(R3)√√√ MUL.D F0,F2,F4√√ SUB.D F8,F2,F6√√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√ NameBusyOpVjVkQjQkA Load1No Load2No Add1YesSubMem[45+Reg[R3]]Mem[34+Reg[R2]] Add2YesAddMem[45+Reg[R3]]Add1 Add3No Mult1YesMultMem[45+Reg[R3]]Reg[F4] Mult2YesDivMem[34+Reg[R2]]Mult1 F0F2F4F6F8F10F12…F30 QiMult1Add2Add1Mult2

Figure 3.4 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√√ L.D F2,45(R3)√√√ MUL.D F0,F2,F4√√ SUB.D F8,F2,F6√√√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√√√ NameBusyOpVjVkQjQkA Load1No Load2No Add1No Add2No Add3No Mult1YesMultMem[45+Reg[R3]]Reg[F4] Mult2YesDivMem[34+Reg[R2]]Mult1 F0F2F4F6F8F10F12…F30 QiMult1Mult2

Loop-Based Example Loop:L.DF0,0(R1) MUL.DF4,F0,F2 S.DF4,0(R1) DADDIUR1,R1,#−8 BNER1,R2,Loop

Figure 0.5. One active iteration of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2 MUL.D F4,F0,F22 S.D F4,0(R1)2 NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2No Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2No Store1YesStoreMult1Reg[R1] Store2No F0F2F4F6F8F10F12…F30 QiLoad1Mult1

Figure 0.6. One+ active iteration of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2√ MUL.D F4,F0,F22 S.D F4,0(R1)2 NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2No Store1YesStoreMult1Reg[R1] Store2No F0F2F4F6F8F10F12…F30 QiLoad2Mult1

Figure 0.7. One++ active iteration of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2√√ MUL.D F4,F0,F22√ S.D F4,0(R1)2 NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2YesMultReg[F2]Load2 Store1YesStoreMult1Reg[R1] Store2No F0F2F4F6F8F10F12…F30 QiLoad2Mult12

Figure 3.6. Two active iterations of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2√√ MUL.D F4,F0,F22√ S.D F4,0(R1)2√ NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2YesMultReg[F2]Load2 Store1YesStoreMult1Reg[R1] Store2YesStoreMult2Reg[R1]-8 F0F2F4F6F8F10F12…F30 QiLoad2Mult12

IBM 360/91 Great ideas:  Data tagging  Register renaming  Dynamic detection of memory hazards  Generalized forwarding Ideas broadly used now in microprocessors Was 360/91 successful commercially?

IBM 360/85 (1968) First commercial computer with a cache:  Slower clock time (80ns versus 60ns)  Less memory interleaving (4 versus 16)  Slower main memory (1.04 μs versus 0.75 μs)  Cheaper in price Which machine was faster on applications?