Instruction Level Parallelism (ILP)

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Compiler techniques for exposing ILP
COMP4611 Tutorial 6 Instruction Level Parallelism
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Instruction-Level Parallelism (ILP)
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CS203 – Advanced Computer Architecture Pipelining Review.
Use of Pipelining to Achieve CPI < 1
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
CSL718 : Superscalar Processors
Computer Architecture Principles Dr. Mike Frank
Concepts and Challenges
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
CS203 – Advanced Computer Architecture
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Levels of Parallelism within a Single Processor
Computer Architecture
Checking for issue/dispatch
Siddhartha Chatterjee Spring 2008
How to improve (decrease) CPI
Static vs. dynamic scheduling
Static vs. dynamic scheduling
Control unit extension for data hazards
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Extending simple pipeline to multiple pipes
CS203 – Advanced Computer Architecture
Levels of Parallelism within a Single Processor
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
CSE 586 Computer Architecture Lecture 3
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Instruction Level Parallelism (ILP) Pipelining is one form of concurrency With several pipelines, opportunity for executing several (pipelined) instructions concurrently With more sophisticated control units, possibility of issuing several instructions in the same cycle The order of execution is determined by the compiler: static scheduling for superscalar processors The order of execution is determined at run-time: dynamic scheduling for out-of-order execution processors 1/17/2019 CSE 471 ILP

ILP improves (reduces) CPI Recall: Single Pipeline CPI = 1 (Ideal CPI) + CPI contributed by stalls With allowing n instruction issues per cycle CPIn = 1/n (Ideal CPI) + CPI contributed by stalls ILP will increase structural hazards ILP makes reduction in other stalls even more important A “bubble” costs more than the loss of a single instruction issue 1/17/2019 CSE 471 ILP

Where can we optimize? (control) The unit of code where one can find ILP is the basic block Contiguous block of instructions with a single entry point and a single exit point Basic blocks are small (branches occur about 30% of the time) ILP can be increased (CPI due to control stalls decreased) by Reducing the number of branches: loop unrolling (compiler) Branch prediction: Static (compiler) or dynamic (hardware) Code movement: based on trace scheduling (compiler) and/or other methods (will not be covered in this course) Predication (compiler and hardware): see later 1/17/2019 CSE 471 ILP

Where can we optimize? (data) CPI contributed by data hazards can be decreased by compiler optimizations load scheduling, dependence analysis, software pipelining, trace scheduling Hardware (run-time) techniques forwarding (we know all about that!) register renaming 1/17/2019 CSE 471 ILP

Compiler optimizations ( a sample) The processor (number of pipelines, latencies of various units, number of registers) is exposed to the compiler Loop unrolling (see next slide) to increase potential for ILP Avoid dependencies (data, name, control) at the instruction level to increase ILP Data dependence translates into RAW load scheduling and code reorganization Name dependence (register allocation): anti-dependence ( WAR) and output dependence (WAW) Control dependence = branches (predicated execution) 1/17/2019 CSE 471 ILP

Loop unrolling Replicate the body of the loop and reduce correspondingly the number of iterations (software pipelining) Pros Decrease loop overhead (branches, counter settings) Allows better scheduling (longer basic blocks hence better opportunities to find ILP and to hide latency of “long” operations) Cons Increases register pressure Increases code length (I-cache occupancy) Requires prologue or epilogue (beginning or end of loop) 1/17/2019 CSE 471 ILP

Data dependencies: RAW Instruction j is dependent on instruction i preceding it in sequential program order if the result of i is a source operand of j Transitivity: Instruction j dependent on k and k dependent on i Dependence is a program property Hazards (RAW in this case) and their (partial) removals is a pipeline organization property Code optimization goal Maintain dependence and avoid hazard (e.g., scheduling) Eliminate dependence by modifying code (e.g., register allocation) 1/17/2019 CSE 471 ILP

Name dependencies: WAR and WAW Anti dependence: WAR Instruction i: R1 <- R2 (long operator) R3 Instruction j: R2 <- R4 (short operator) R5 This is a WAR hazard if instruction j finishes first Output dependence Instruction j: R1 <- R4 (short operator) R5 This is a WAW hazard if instruction j finishes first In both cases, not really a dependence but a “naming” problem (would not happen if we had enough registers) 1/17/2019 CSE 471 ILP

Control dependencies Branches restrict the scheduling of instructions Speculation (i.e., executing an instruction that might not be needed) must be: Safe (no additional exception) Legal (the end result should be the same as without speculation) Speculation can be implemented by: Compiler (code motion) Hardware (branch prediction) Both (branch prediction, conditional operations) 1/17/2019 CSE 471 ILP

Static vs. dynamic scheduling Assumptions (for now): 1 instruction issue / cycle Several pipelines with a common IF and ID Static scheduling (optimized by compiler) When there is a stall (hazard) no further issue of instructions (example Alpha 21064 and 21164) Dynamic scheduling (enforced by hardware) Instructions following the one that stalls can “issue” if no structural hazard or dependence issue might take different meanings as we’ll see 1/17/2019 CSE 471 ILP

Dynamic scheduling Implies possibility of: Out of order issue (and hence out of order execution) Out of order completion (also true but less frequent in static scheduling) Imprecise exceptions Example (different pipes for add/sub and divide) R1 = R2/ R3 (long latency) R2 = R1 + R5 (stall because of RAW on R1) R6 = R7 - R8 (can be issued before the add, executed and completed before the other 2) 1/17/2019 CSE 471 ILP

Conceptual execution on a processor with ILP Instruction fetch and branch prediction Corresponds to IF in simple pipeline (we have seen that) Instruction decode, dependence check, dispatch, issue Corresponds (many variations) to ID Although instructions are issued (i.e., assigned to functional units), they might not execute right away (might wait for dependent instructions to forward their result) Instruction execution Corresponds to EX and/or MEM (with various latencies) Instruction commit Corresponds to WB but more complex because of speculative and out-of-order completion 1/17/2019 CSE 471 ILP