CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Slides:

Advertisements

Similar presentations

Out-of-Order Execution & Register Renaming

Advertisements

Tomasulo without Re-order Buffer Opcode Operand1 Operand2 Reservation station MUL1 RS MUL2RS Store1 Multiply unit 1 Mul unit 2 Store unit 1 RS Store2 Store.

Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.

Instruction Level Parallelism

MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Complex Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.

EXAMPLE 3 DIV Unit is not Pipelined. So second instruction waits in ID stage although it is independent. DIV.D F0,F1,F2 IFID DIV1DIV1 DIV2DIV2 DIV3DIV3.

Kosarev Nikolay MIPT Apr, 2010

CPSC 330 Fall 1999 HW #1 Assigned September 1, 1999 Due September 8, 1999 Submit in class Use a word processor (although you may hand-draw answers to Problems.

Target Code Generation

Class Addressing modes

Chapter 3 – Dynamic Scheduling

Instruction-Level Parallelism

ILP: Software Approaches

CMPUT429/CMPE382 Amaral 1/17/01 CMPUT429/CMPE382 Winter 2001 Topic9: Software Pipelining (Some slides from David A. Patterson’s CS252, Spring 2001 Lecture.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.

Pipelining and Control Hazards Oct

Lecture Objectives: 1)Define branch prediction. 2)Draw a state machine for a 2 bit branch prediction scheme 3)Explain the impact on the compiler of branch.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Dynamic Branch Prediction

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

EECC551 - Shaaban #1 lec # 7 Fall Hardware Dynamic Branch Prediction Simplest method: –A branch prediction buffer or Branch History Table.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)

CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Dynamic Branch Prediction

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

A STUDY OF BRANCH PREDICTION STRATEGIES JAMES E.SMITH Presented By: Prasanth Kanakadandi.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Dynamic Branch Prediction

Instruction-Level Parallelism and Its Dynamic Exploitation

Computer Architecture Chapter (14): Processor Structure and Function

CS203 – Advanced Computer Architecture

Dynamic Branch Prediction

Chapter 11 Instruction Sets

CS 704 Advanced Computer Architecture

Approaches to exploiting Instruction Level Parallelism (ILP)

Lecture 10 Tomasulo’s Algorithm

Chapter 3: ILP and Its Exploitation

CMSC 611: Advanced Computer Architecture

Computer Programming Machine and Assembly.

Dynamic Hardware Branch Prediction

CPE 631: Branch Prediction

Dynamic Branch Prediction

CS-401 Computer Architecture & Assembly Language Programming

Advanced Computer Architecture

pipelining: static branch prediction Prof. Eric Rotenberg

Adapted from the slides of Prof

Dynamic Hardware Prediction

Wackiness Algorithm A: Algorithm B:

ENERGY 211 / CME 211 Lecture 11 October 15, 2008.

Presentation transcript:

CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

One More Example on Tomasulo’s Algorithm L.DF0,0(R0) ADD.DF0,F0,F2 MUL.DF0,F0,F4 ADD.DF0,F0,F2 MUL.DF0,F0,F4 S.DF0,0(R0) ADD.DF0,F4,F2

IBM 360 Assembly Language Only two operands. Advantage? Disadvantage? Example: L.DF0,0(R0) ADD.DF0,F2 MUL.DF0,F4 ADD.DF0,F2 MUL.DF0,F4 S.DF0,0(R0)…

Figure 0.1 InstructionIssueExecuteWrite Result L.D F0,0(R0)√ ADD.D F0,F0,F2 MUL.D F0,F0,F4 ADD.D F0,F0,F2 MUL.D F0,F0,F4 S.D F0,0(R0) ADD.D F0,F4,F2 NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1No Add2No Add3No Mult1No Mult2No Store1No F0F2F4F6F8F10F12…F30 QiLoad1

Figure 0.2 InstructionIssueExecuteWrite Result L.D F0,0(R0)√√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4 ADD.D F0,F0,F2 MUL.D F0,F0,F4 S.D F0,0(R0) ADD.D F0,F4,F2 NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1YesAddReg[F2]Load1 Add2No Add3No Mult1No Mult2No Store1No F0F2F4F6F8F10F12…F30 QiAdd1

Figure 0.3 InstructionIssueExecuteWrite Result L.D F0,0(R0)√√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ ADD.D F0,F0,F2 MUL.D F0,F0,F4 S.D F0,0(R0) ADD.D F0,F4,F2 NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1YesAddReg[F2]Load1 Add2No Add3No Mult1YesMultReg[F4]Add1 Mult2No Store1No F0F2F4F6F8F10F12…F30 QiMult1

Figure 0.4 InstructionIssueExecuteWrite Result L.D F0,0(R0)√√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4 S.D F0,0(R0) ADD.D F0,F4F2 NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1YesAddReg[F2]Load1 Add2YesAddReg[F2]Mult1 Add3No Mult1YesMultReg[F4]Add1 Mult2No Store1No F0F2F4F6F8F10F12…F30 QiAdd2

Figure 0.5 InstructionIssueExecuteWrite Result L.D F0,0(R0)√√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ S.D F0,0(R0) ADD.D F0,F4,F2 NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1YesAddReg[F2]Load1 Add2YesAddReg[F2]Mult1 Add3No Mult1YesMultReg[F4]Add1 Mult2YesMultReg[F4]Add2 Store1No F0F2F4F6F8F10F12…F30 QiMult2

Figure 0.6 InstructionIssueExecuteWrite Result L.D F0,0(R0)√√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ S.D F0,0(R0)√ ADD.D F0,F4,F2 NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1YesAddReg[F2]Load1 Add2YesAddReg[F2]Mult1 Add3No Mult1YesMultReg[F4]Add1 Mult2YesMultReg[F4]Add2 Store1YesStoreMult20+Reg[R0] F0F2F4F6F8F10F12…F30 QiMult2

Figure 0.7 InstructionIssueExecuteWrite Result L.D F0,0(R0)√√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ S.D F0,0(R0)√ ADD.D F0,F4,F2√ NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1YesAddReg[F2]Load1 Add2YesAddReg[F2]Mult1 Add3YesAddReg[F4]Reg[F2] Mult1YesMultReg[F4]Add1 Mult2YesMultReg[F4]Add2 Store1YesStoreMult20+Reg[R0] F0F2F4F6F8F10F12…F30 QiAdd3

Figure 0.8 InstructionIssueExecuteWrite Result L.D F0,0(R0)√√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ ADD.D F0,F0,F2√ MUL.D F0,F0,F4√ S.D F0,0(R0)√ ADD.D F0,F4,F2√√√ NameBusyOpVjVkQjQkA Load1YesLoad0+Reg[R0] Add1YesAddReg[F2]Load1 Add2YesAddReg[F2]Mult1 Add3No Mult1YesMultReg[F4]Add1 Mult2YesMultReg[F4]Add2 Store1YesStoreMult20+Reg[R0] F0F2F4F6F8F10F12…F30 Qi

Modified Loop-Based Example Loop:L.DF0,0(R1) MUL.DF0,F0,F2 ADD.DF0,F0,F4 S.DF0,0(R1) DADDIUR1,R1,#−8 BNER1,R2,Loop

Figure 0.1. One active iteration of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F0,F0,F21√ ADD.D F0,F0,F41√ S.D F0,0(R1)1√ L.D F0,0(R1)2 MUL.D F0,F0,F22 ADD.D F0,F0,F42 S.D F0,0(R1)2 NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2No Add1YesAddReg[F4]Mult1 Add2No Mult1YesMultReg[F2]Load1 Mult2No Store1YesStoreAdd1Reg[R1] Store2No F0F2F4F6F8F10F12…F30 QiAdd1

Figure 0.2. Two active iterations of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F0,F0,F21√ ADD.D F0,F0,F41√ S.D F0,0(R1)1√ L.D F0,0(R1)2√√ MUL.D F0,F0,F22√ ADD.D F0,F0,F42√ S.D F0,0(R1)2√ NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1YesAddReg[F4]Mult1 Add2YesAddReg[F4]Mult2 Mult1YesMultReg[F2]Load1 Mult2YesMultReg[F2]Load2 Store1YesStoreAdd1Reg[R1] Store2YesAdd2Reg[R1]-8 F0F2F4F6F8F10F12…F30 QiAdd2

Figure 0.2. Two active iterations of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F0,F0,F21√ ADD.D F0,F0,F41√ S.D F0,0(R1)1√ L.D F0,0(R1)2√√ MUL.D F0,F0,F22√ ADD.D F0,F0,F42√ S.D F0,0(R1)2√ NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1YesAddReg[F4]Mult1 Add2YesAddReg[F4]Mult2 Mult1YesMultReg[F2]Load1 Mult2YesMultReg[F2]Load2 Store1YesStoreAdd1Reg[R1] Store2YesAdd2Reg[R1]-8 F0F2F4F6F8F10F12…F30 QiAdd2

Dynamic Branch Prediction Static branch prediction in Appendix A Branch Prediction Buffer: a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not The prediction bit may have been placed there by another instruction

Figure A Branch Prediction Buffer Use the 4 low-order address bits of the branch (word address) to choose a row.

Nested Loops Loop1:L.DF2,1600(R1) DADDIUR2,R0,#80 Loop2:L.DF0,1000(R2) ADD.DF0,F0,F2 S.DF0,1000(R2) DADDIUR2,R2,#−8 BNEZR2,Loop2 DADDIUR1,R1,#−8 BNEZR1,Loop1

Figure 3.7. States in 2-bit Prediction Scheme

Figure 3.8. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer for SPEC89 Benchmarks

Figure 3.9. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer versus an infinite 2-bit Prediction Buffer for SPEC89