Loop-Level Parallelism

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
Compiler techniques for exposing ILP
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
COMP4611 Tutorial 6 Instruction Level Parallelism
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Instruction-Level Parallelism and Its Dynamic Exploitation
Compiler Techniques for ILP
CS203 – Advanced Computer Architecture
CS 352H: Computer Systems Architecture
Computer Architecture Principles Dr. Mike Frank
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS203 – Advanced Computer Architecture
Lecture 11: Advanced Static ILP
CSL718 : VLIW - Software Driven ILP
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Computer Architecture
Compiler techniques for exposing ILP (cont)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Loop-Level Parallelism Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }

Loop-Carried Dependences for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; } Non-circular dependences x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];

Compiler support for ILP Dependence analysis Finding dependences is important for: Good scheduling of code Determining loop-level parallelism Eliminating name dependencies Complexity Simple for scalar variable references Complex for pointers, array references Software pipelining Trace scheduling

Loop-level Parallelism Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; loop-carried, recurrent, circular dependence } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

Dependence Analysis Algorithms Assume array indexes are affine (ai + b) GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time

Example For(I=1; I<=100; I=1+1) { Y[I] = X[I] / c X[I] = X[I] + c; Z[I] = Y[I] + c; Y[I] = c - Y[I]; } For(I=1; I<=100; I=1+1) { T[I] = X[I] / c X1[I] = X[I] + c; Z[I] = T[I] + c; Y[I] = c - T[I]; }

Software Pipelining Start-up Finish-up Software pipelined iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration

Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

Trace Scheduling Find ILP across conditional branches Two-step process Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences

Trace Selection LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else . . . . SW 0(R2), . . . J join Else: . . . . X Join: . . . . SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =

Summary of Compiler Techniques Try to avoid dependence stalls Loop unrolling Reduce loop overhead Software pipelining Reduce single body dependence stalls Trace scheduling Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy

Hardware-based ILP Techniques Limitation of static techniques Ability to predict at compile time the behavior of branches Hardware-based schemes Conditional or predicated instructions Extend the ISA Speculation Static Hardware support for compiler speculation Dynamic Use branch prediction to guide the speculation process

Predicated Instructions Condition evaluation It is part of the instruction execution If true, the instruction executes normally If false, the instruction is replaced by a no-op Conditional move BNEZ R1, L MOV R2, R3 L: … If (A= = 0) S = T; CMOVZ R2, R3, R1

Limitations of Predicated Instructions Annulled conditional instructions take execution time Useful when the condition can be evaluated early Data dependences may not allow to separate conditional instruction and branch Control flow involves more than a simple alternative sequence Moving an instruction across multiple branches Conditional instructions may be more expensive

Compiler Speculation Conditional instructions can be used for limited speculative computation If compiler can predict branches accurately then it can use speculation for: Improving scheduling (eliminating stalls) Increasing the issue rate (IPC) Challenge: maintain exception behavior Resuming and terminating exceptions Methods for aggressive speculation Hardware-software cooperation Poison bits Renaming

Hardware – Software Cooperation Hardware and operating system handle exceptions: Resumable exceptions are processed as usual Terminating instructions return an undefined value Correct programs never fail (what about incorrect ones?) LW R1, 0(R3) BNEZ R1, L1 LW R1, 0(R2) J L2 L1: ADDI R1, R1, #4 L2: SW 0(R3), R1 LW R1, 0(R3) LW R14, 0(R2) BEQZ R1, L3 ADDI R14, R1, #4 L3: SW 0(R3), R14 Stores can not be speculative (only register renaming)

Poison Bits Less change to the exception behavior Incorrect programs will still cause exceptions when speculation is used Poison bits: one for each register and instruction Destination register: ON when speculative instruction produces terminating exception Instruction: ON for speculative instructions No memory poison bits => Saves are not speculative

Renaming Renaming and buffering in hardware (like Tomasulo) Boosted instructions (moved across a branch) Execute before the controlling branch Commit/abort once the branch is resolved Other boosted instructions can use temporary results LW R1, 0(R3) LW R1, 0(R2) BEQZ R1, L3 ADDI R1, R1, #4 L3: SW 0(R3), R1

Dynamic Speculation Dataflow execution Advantages Dynamic branch prediction Speculation Dynamic scheduling Advantages Memory references disambiguation Hardware-based branch prediction is better Precise exception model even for all instructions No need for compensation or bookkeeping code Portability across hardware platforms Disadvantage: complex hardware

Implementation of Dynamic Speculation Extend Tomasulo’s algorithm Execute and bypass results out of order Commit in order Reorder buffer Additional virtual registers (can be operands) Stores speculative results (before commit) Integrates the function of the store and load buffers Fields: Instruction type, Destination and Value Easy to undo speculated instructions on mispredicted branches or exceptions Precise exceptions