1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
COMP25212 Advanced Pipelining Out of Order Processors.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Lecture 6: Pipelining MIPS R4000 and More Kai Bu
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
CMPE 421 Parallel Computer Architecture
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
CS203 – Advanced Computer Architecture Pipelining Review.
IBM System 360. Common architecture for a set of machines
Images from Patterson-Hennessy Book
/ Computer Architecture and Design
Out of Order Processors
CS203 – Advanced Computer Architecture
Pipeline Implementation (4.6)
Microprocessor Microarchitecture Dynamic Pipeline
Advantages of Dynamic Scheduling
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Checking for issue/dispatch
How to improve (decrease) CPI
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Static vs. dynamic scheduling
Instruction Level Parallelism (ILP)
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 7 Dynamic Scheduling
Lecture 5: Pipeline Wrap-up, Static ILP
Conceptual execution on a processor which exploits ILP
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)

2 Instruction-Level Parallelism Relevant Book Reading (HP3): Dynamic Scheduling (in hardware): Appendix A & Chapter 3 Dynamic Scheduling (in hardware): Appendix A & Chapter 3 Compiler Scheduling (in software): Chapter 4 Compiler Scheduling (in software): Chapter 4

3 Hardware Schemes for ILP  Why do it in hardware at run time? Works when can’t know dependences at compile time Works when can’t know dependences at compile time Simpler compiler Simpler compiler Code for one machine runs well on another machine Code for one machine runs well on another machine  Key idea: Allow instructions behind stall to proceed DIV.DF0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 Enables out-of-order execution Enables out-of-order execution Implies out-of-order completion Implies out-of-order completion ID stage check for both structural and data dependences ID stage check for both structural and data dependences

4 Dynamic Scheduling DIV.DF0, F2, F4 ADD.DF10, F0, F8 SUB.DF12, F8, F14 DIV.DF0, F2, F4 ADD.DF10, F0, F8 SUB.DF12, F8, F14 7-cycle divider 4-cycle adder  Instructions are issued in order (leftmost I )  Execution can begin out of order (leftmost E )  Execution can terminate out of order ( W )  What is I?

5 Explanation of I  To be able to execute the SUB.D instruction A function unit must be available A function unit must be available  Adder is free in example There should be no data hazards preventing early execution There should be no data hazards preventing early execution  None in this example We must be able to recognize the two previous conditions We must be able to recognize the two previous conditions  Must examine several instructions before deciding on what to execute  I represents the instruction window (or issue window ) in which this examination happens If every instruction starts execution in order, then I is superfluous If every instruction starts execution in order, then I is superfluous Otherwise: Otherwise:  Instruction enter the issue window in order  Several instructions may be in issue window at any instant  Execution can begin out of order

6 Out-of-order Execution and Renaming  WAW hazard on register F10: prevents out-of-order execution on machine like CDC 6600  If processor was capable of register renaming: the WAW hazard would be eliminated the WAW hazard would be eliminated  SUB.D could execute early as before example: IBM 360/91 example: IBM 360/91 DIV.DF0, F2, F4 ADD.DF10, F0, F8 SUB.DF10, F8, F14 DIV.DF0, F2, F4 ADD.DF10, F0, F8 SUB.DF10, F8, F14

7 Memory Consistency  Memory consistency refers to the order of main memory accesses as compared to the order seen in sequential (unpipelined) execution Strong memory consistency: All memory accesses are made in strict program order Strong memory consistency: All memory accesses are made in strict program order Weak memory consistency: Memory accesses may be made out of order, provided that no dependences are violated Weak memory consistency: Memory accesses may be made out of order, provided that no dependences are violated  Weak memory consistency is more desirable leads to increased performance leads to increased performance  In what follows, ignore register hazards  Q: When can two memory accesses be re-ordered?

8 Four Possibilities for Load/Store Motion Load-Load LWR1, (R2) LWR3, (R4) Load-Load LWR1, (R2) LWR3, (R4)  Load-Load can always be interchanged (if no volatile s)  Load-Store and Store-Store are never interchanged  Store-Load is the only promising program transformation Load is done earlier than planned, which can only help Load is done earlier than planned, which can only help Store is done later than planned, which should cause no harm Store is done later than planned, which should cause no harm  Two variants of transformation If load is independent of store, we have load bypassing If load is independent of store, we have load bypassing If load is dependent on store through memory (e.g., (R1) == (R4)), we have load forwarding If load is dependent on store through memory (e.g., (R1) == (R4)), we have load forwarding Load-Store LWR1, (R2) SW(R3), R4 Load-Store LWR1, (R2) SW(R3), R4 Store-Store SWR1, (R2) SWR3, (R4) Store-Store SWR1, (R2) SWR3, (R4) Store-Load SW(R1), R2 LWR3, (R4) Store-Load SW(R1), R2 LWR3, (R4)

9 More on Load Bypassing and Forwarding  Either transformation can be performed at compile time if the memory addresses are known, or at run-time if the necessary hardware capabilities are available  Compiler performs load bypassing in loop unrolling example (next lecture)  In general, if compiler is not sure, it should not do the transformation  Hardware is never “not sure” S1 L2 S1 L2 S1 L2 S1 Load Bypassing S1: M[R1]  R2 L2: R4  M[R1] S1: M[R1]  R2 L2: R4  M[R1] R4  R2 M[R1]  R2 R4  R2 M[R1]  R2 Load Forwarding

10 Load Bypassing in Hardware  Requires two separate queues for LOADs and STOREs  Every LOAD has to be checked for every STORE waiting in the store queue to determine whether there is a hazard on a memory location assume that processor knows original program order of all these memory instructions assume that processor knows original program order of all these memory instructions  In general, LOAD has priority over STORE  For the selected LOAD instruction, if there exists a STORE instruction in the store queue such that … LOAD is behind STORE (in program order), and LOAD is behind STORE (in program order), and their memory addresses are the same their memory addresses are the same … then the LOAD cannot be sent to memory, and must wait to be executed only after the store is executed

11 Example of Load Bypassing  Memory access takes four cycles  Actions at various points in time End of cycle 1: LQ = [(0)]; SQ = [ ]; execute first load End of cycle 1: LQ = [(0)]; SQ = [ ]; execute first load End of cycle 5: LQ = [(3), (4)]; SQ = [(1)]; execute first load End of cycle 5: LQ = [(3), (4)]; SQ = [(1)]; execute first load End of cycle 9: LQ = [(4), (5)]; SQ = [(1), (7)]; execute first load End of cycle 9: LQ = [(4), (5)]; SQ = [(1), (7)]; execute first load End of cycle 13: LQ = [(5)]; SQ = [(1), (7)]; load yields to store End of cycle 13: LQ = [(5)]; SQ = [(1), (7)]; load yields to store  We are assuming that no LOADs or STOREs issue between instructions 7 and 22

12 History of Dynamic Scheduling  First implemented on CDC 6600 and IBM 360/91 in the early 1960s Fetched and decoded single instr. at a time in program order Fetched and decoded single instr. at a time in program order Between decoding and execution, instructions stayed in issue window where hazards were detected and resolved Between decoding and execution, instructions stayed in issue window where hazards were detected and resolved Some hazards resolved before issuing (e.g., availability of FU) Some hazards resolved before issuing (e.g., availability of FU) Most data hazards resolved only after issuing Most data hazards resolved only after issuing Hazard resolution done with sophisticated hardware scheme Hazard resolution done with sophisticated hardware scheme  For each instruction in issue window, separately track, monitor, enforce, and eventually resolve all hazards affecting the instruction before it could begin execution Result: Instructions started execution when they were ready to execute, rather than starting in program order Result: Instructions started execution when they were ready to execute, rather than starting in program order