Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Slides:



Advertisements
Similar presentations
Lecture 7 March 1, 11 discuss HW 2, Problem 3
Advertisements

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.
ECE 667 Synthesis and Verification of Digital Circuits
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Carnegie Mellon Lecture 7 Instruction Scheduling I. Basic Block Scheduling II.Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 – 10.4 M.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
1 CS 201 Compiler Construction Machine Code Generation.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Introduction to Algorithms
9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
Carnegie Mellon Lecture 6 Register Allocation I. Introduction II. Abstraction and the Problem III. Algorithm Reading: Chapter Before next class:
Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Path Protection in MPLS Networks Ashish Gupta Design and Evaluation of Fault Tolerance Algorithms with Performance Constraints.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
1 IOE/MFG 543 Chapter 7: Job shops Sections 7.1 and 7.2 (skip section 7.3)
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
Iterative Flattening in Cumulative Scheduling. Cumulative Scheduling Problem Set of Jobs Each job consists of a sequence of activities Each activity has.
Generic Software Pipelining at the Assembly Level Markus Pister
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
EECS 583 – Class 14 Modulo Scheduling Reloaded University of Michigan October 31, 2011.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Static Process Scheduling
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Tuesday, March 19 The Network Simplex Method for Solving the Minimum Cost Flow Problem Handouts: Lecture Notes Warning: there is a lot to the network.
Principles and Modulo Scheduling Algorithm
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Superscalar Processors & VLIW Processors
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Compiler Construction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software Pipelining1

Carnegie Mellon I. Example of DoAll Loops Machine: – Per clock: 1 read, 1 write, 1 fully pipelined but 2-cycle ALU with hardware loop op and auto-incrementing addressing mode. Source code: For i = 1 to n A[i] = B[i]; Code for one iteration: 1. LD R5,0(R1++) 2. ST 0(R3++),R5 No parallelism in basic block M. LamCS243: Software Pipelining2

Carnegie Mellon Unroll 1. LD[i] 2. LD[i+1]ST[i] 3. ST[i+1] 2 iterations in 3 cycles 1. LD[i] 2. LD[i+1]ST[i] 3. LD[i+2]ST[i+1] 4. LD[i+3] ST[i+2] 5. ST[i+3] 4 iterations in 5 cycles (harder to unroll by 3) U iterations in 2+U cycles M. LamCS243: Software Pipelining3

Carnegie Mellon Better Way LD[1] loop N-1 times LD[i+1]ST[i] ST[n] N iterations in 2 + N-1 cycles – Performance of unrolling N times – Code size of unrolling twice M. LamCS243: Software Pipelining4

Carnegie Mellon Software Pipelining TIME 1. LD0 2. LD1ST0 3. LD2ST1 4. LD3ST2 Every initiation interval (in this case 1 cycle), Iteration i enters the pipeline (i.e. the first instruction starts) Iteration i-1 (maybe i-x in general) leaves the pipeline (i.e. the last instruction of iteration i-1 finishes) M. LamCS243: Software Pipelining5

Carnegie Mellon Unrolling Let’s say you have a VLIW machine with two multiply units for i = 1 to n A[i] = B[i] * C[i];

Carnegie Mellon More Complicated Example Source code: For i = 1 to n D[i] = A[i] * B[i]+ c Code for one iteration: 1. LD R5,0(R1++) 2. LD R6,0(R2++) 3. MUL R7,R5,R ADD R8,R7,R ST 0(R3++),R8

Carnegie Mellon Software Pipelined Code 1. LD 2. LD 3. MUL LD 4. LD 5. MUL LD 6. ADD LD 7. MUL LD 8. ST ADD LD 9. MUL LD 10. ST ADD LD 11. MUL 12. ST ADD ST ADD ST Unlike unrolling, software pipelining can give optimal result. Locally compacted code may not be globally optimal DOALL: Can fill arbitrarily long pipelines with infinitely many iterations (assuming infinite registers) M. LamCS243: Software Pipelining8

Carnegie Mellon Example of DoAcross Loop Loop: Sum = Sum + A[i]; B[i] = A[i] * c; Software Pipelined Code 1. LD 2. MUL 3. ADD LD 4. ST MUL 5. ADD 6. ST Doacross loops Recurrences can be parallelized Harder to fully utilize hardware with large degrees of parallelism M. LamCS243: Software Pipelining9 1.LD 2.MUL 3.ADD 4.ST

Carnegie Mellon II. Problem Formulation Goals: – maximize throughput – small code size Find: – an identical relative schedule S(n) for every iteration – a constant initiation interval (T) such that – the initiation interval is minimized Complexity: – NP-complete in general M. LamCS243: Software Pipelining10 S 0 LD 1 MUL 2 ADD LD 3 ST MUL ADD ST T=2

Carnegie Mellon Resources Bound Initiation Interval Example: Resource usage of 1 iteration; Machine can execute 1 LD, 1 ST, 2 ALU per clock LD, LD, MUL, ADD, ST Lower bound on initiation interval? for all resource i, number of units required by one iteration: n i number of units in system: R i Lower bound due to resource constraints: max i n i /R i M. LamCS243: Software Pipelining11

Carnegie Mellon Scheduling Constraints: Resource RT: resource reservation table for single iteration RT s : modulo resource reservation table RT s [i] =  t|(t mod T = i) RT[t] M. LamCS243: Software Pipelining12 LD Alu ST Iteration 1 Iteration 2 Iteration 3 Iteration 4 T=2 Time LD Alu ST Steady State T=2

Carnegie Mellon Scheduling Constraints: Precedence for (i = 0; i < n; i++) { *(p++) = *(q++) + c } Minimum initiation interval? S(n): Schedule for n with respect to the beginning of the schedule Label edges with  = iteration difference, d = delay  x T + S(n 2 ) – S(n 1 )  d M. LamCS243: Software Pipelining13

Carnegie Mellon Scheduling Constraints: Precedence for (i = 2; i < n; i++) { A[i] = A[i-2] + 1; } Minimum initiation interval? S(n): Schedule for n with respect to the beginning of the schedule Label edges with  = iteration difference, d = delay  x T + S(n 2 ) – S(n 1 )  d M. LamCS243: Software Pipelining14

Carnegie Mellon Minimum Initiation Interval For all cycles c, max c CycleLength(c) / IterationDifference (c) M. LamCS243: Software Pipelining15

Carnegie Mellon III. Example: An Acyclic Graph M. LamCS243: Software Pipelining16

Carnegie Mellon Algorithm for Acyclic Graphs Find lower bound of initiation interval: T 0 based on resource constraints For T = T 0, T 0 +1,... until all nodes are scheduled For each node n in topological order s 0 = earliest n can be scheduled for each s = s 0, s 0 +1,..., s 0 +T-1 if NodeScheduled(n, s) break; if n cannot be scheduled break; NodeScheduled(n, s) – Check resources of n at s in modulo resource reservation table Can always meet the lower bound if – every operation uses only 1 resource, and – no cyclic dependences in the loop M. LamCS243: Software Pipelining17

Carnegie Mellon Cyclic Graphs No such thing as “topological order” b  c; c  b S(c) – S(b)  1 T + S(b) – S(c)  2 Scheduling b constrains c and vice versa S(b) + 1  S(c)  S(b) – 2 + T S(c) – T + 2  S(b)  S(c) – 1 M. LamCS243: Software Pipelining18

Carnegie Mellon Strongly Connected Components A strongly connected component (SCC) – Set of nodes such that every node can reach every other node Every node constrains all others from above and below – Finds longest paths between every pair of nodes – As each node scheduled, find lower and upper bounds of all other nodes in SCC SCCs are hard to schedule – Critical cycle: no slack Backtrack starting with the first node in SCC – increases T, increases slack Edges between SCCs are acyclic – Acyclic graph: every node is a separate SCC M. LamCS243: Software Pipelining19

Carnegie Mellon Algorithm Design Find lower bound of initiation interval: T 0 based on resource constraints and precedence constraints For T = T 0, T 0 +1,..., until all nodes are scheduled E*= longest path between each pair For each SCC c in topological order s 0 = Earliest c can be scheduled For each s = s 0, s 0 +1,..., s 0 +T-1 If SCCScheduled(c, s) break; If c cannot be scheduled return false; Return true; M. LamCS243: Software Pipelining20

Carnegie Mellon Scheduling a Strongly Connected Component (SCC) SCCScheduled(c, s) Schedule first node at s, return false if fails For each remaining node n in c s l = lower bound on n based on E* s u = upper bound on n based on E* For each s = s l, s l +1, min (s l +T-1, s u ) if NodeScheduled(n, s) break; if n cannot be scheduled return false; Return true; M. LamCS243: Software Pipelining21

Carnegie Mellon Anti-dependences on Registers Traditional algorithm ignores them because can post-unroll a1 = ld[i] a2 = a1 + a1; Store[i] = a2; a1 = ld[i+1]; a2 = a1+a1; Store[i+1] = a2;

Carnegie Mellon Anti-dependences on Registers Traditional algorithm ignores them because can post-unroll (or hw support) a1 = ld[i] a2 = a1 + a1; Store[i] = a2; a3 = ld[i+1]; a4 = a3+a3; Store[i+1] = a4; Modulo variable expansion u = max r (lifetime r /T)

Carnegie Mellon Anti-dependences on Registers The code in every unrolled iteration is identical – Not ideal We unroll in two parts of the algorithm Instead, we can run the SWP algorithm for different unrolling factors. For each unroll, we pre-rename the registers but don’t ignore anti-dependences – Better potential results – But we might not find them

Carnegie Mellon Register Allocation and SWP SWP schedules use lots of registers – Different schedules may use different amount of registers Use more back-tracking than described algorithm – If allocation fails, try to schedule again using different heuristic Schedule Spills

Carnegie Mellon SWP versus No SWP

Carnegie Mellon Conclusions Numerical Code – Software pipelining is useful for machines with a lot of pipelining and numeric code with real instruction level parallelism – Compact code – Limits to parallelism: dependences, critical resource M. LamCS243: Software Pipelining27