Presentation on theme: "CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral"— Presentation transcript:
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral
CMPUT Compiler Design and Optimization2 Reading List zTiger book: chapter 20 zOther papers such as: GovindAltmanGao97, RutenbergAtAl97
CMPUT Compiler Design and Optimization3 Software Pipeline Software Pipeline is a technique that reduces the execution time of important loops by interweaving operations from many iterations to optimize the use of resources time ldf fadds stf sub cmp bg
CMPUT Compiler Design and Optimization4 Software Pipeline What limits the speed of a loop? Data dependencies: recurrence initiation interval (rec_mii) Processor resources: resource initiation interval (res_mii) Memory accesses: memory initiation interval (mem_mii) time ldf fadds stf sub cmp bg Initiation interval
CMPUT Compiler Design and Optimization5 Problem Formulation (I) Given a weighted dependence graph, derive a schedule which is “time-optimal” under a machine model M. Def: A schedule S of a loop L is time-optimal if among all “legal” schedules of L, no schedule is faster than S. Note: There may be more than one time- optimal schedule.
CMPUT Compiler Design and Optimization6 Example: The Inner Product Q = 0.0 DO k = 1, N Q = Q+Z(k)*X(k) ENDDO z 0 &Z(1) x 0 &X(1) q 0 0.0 DO k=1,N u k load z k-1 v k load x k-1 w k u k * v k q k q k-1 + w k z k z k x k x k END DO (Dehnert, J. and Towle, R. A., “Compiling for Cidra 5”) Dynamic Single Assignment (DSA): Uses an expanded virtual register (EVR) that is an infinite, linearly ordered, set of virtual registers. A program in DSA has no anti-dependencies and no output dependencies.
CMPUT Compiler Design and Optimization7 Machine Model and Resource Constraints z 0 &Z(1) x 0 &X(1) q 0 0.0 DO k=1,N u k load z k-1 MEM v k load x k-1 MEM w k u k * v k FMULT q k q k-1 + w k FADD z k z k-1 + 4ADDR x k x k-1 + 4ADDR END DO What unit each operation in the loop uses? UnitLatency MEM1 6 MEM2 6 ADDR1 1 ADDR2 1 FMULT 2 FADD 2 Machine Model Without instruction level parallelism. How long does the loop take to execute? ( )*N=18*N
CMPUT Compiler Design and Optimization8 The Resource Minimum Initiation Interval of a loop is given by: Resource Minimum Initiation Interval (resMII) Each processor resource defines a minimum initiation interval for the execution of the loop. For instance in the machine model in the previous example, a loop that requires the computation of 6 addresses has a ResMII(ADDR) = 6*1/2 = 3.
CMPUT Compiler Design and Optimization9 ResMII z 0 &Z(1) x 0 &X(1) q 0 0.0 DO k=1,N u k load z k-1 MEM v k load x k-1 MEM w k u k * v k FMULT q k q k-1 + w k FADD z k z k-1 + 4ADDR x k x k-1 + 4ADDR END DO UnitLatency MEM1 6 MEM2 6 ADDR1 1 ADDR2 1 FMULT 2 FADD 2 Machine Model There are enough units to schedule all the instructions of the loop in the same cycle. Therefore ResMII = 1. Can we execute the loop in N+C cycles (C = a small constant)?
CMPUT Compiler Design and Optimization10 Recurrence Minimum Initiation Interval (RecMII) z 0 &Z(1) x 0 &X(1) q 0 0.0 DO k=1,N (a)u k load z k-1 (b)v k load x k-1 (c)w k u k * v k (d)q k q k-1 + w k (e)z k z k (f)x k x k END DO k=1 ab c d e f k=2 ab c d e f k=3 ab c d e f ab c d e f (1)
CMPUT Compiler Design and Optimization11 Recurrence Minimum Initiation Interval (RecMII) ab c d e f (1,2) (1,1) z 0 &Z(1) x 0 &X(1) q 0 0.0 DO k=1,NUnit Lat. (a)u k load z k-1 MEM (6) (b)v k load x k-1 MEM (6) (c)w k u k * v k FMULT (2) (d)q k q k-1 + w k FADD (2) (e)z k z k-1 + 4ADDR (1) (f)x k x k-1 + 4ADDR (1) END DO (dist,lat)
CMPUT Compiler Design and Optimization12 Recurrence Minimum Initiation Interval (RecMII) ab c d e f (1,2) (1,1) (dist,lat) The recursive minimum initiation interval (rec_mii) is given by: Quiz: What is the rec_mii for the example?
CMPUT Compiler Design and Optimization13 Minimum Initiation Interval The Minimum Initiation Interval (MII) for a loop is constrained both by resources and recurrences, therefore, it is given by: In our example we have MII = max(1,2) = 2. Therefore the best that we can do without transforming the loop is to execute it in 2*N+C.
CMPUT Compiler Design and Optimization14 Module Schedule In module scheduling, we: (1) start with the first instruction (2) schedule as many instructions as we can in every cycle, limited only by the resources available and by the dependences. When a pattern emerges, we adopt the pattern as our module schedule. Instructions before this pattern form the loop prologue. Instructions after this pattern form the loop epilogue.
Recurrence Minimum Initiation Interval (RecMII) z 0 &Z(1) x 0 &X(1) q 0 0.0 DO k=1,NLat. (a)u k load z k-1 (6) (b)v k load x k-1 (6) (c)w k u k * v k (2) (d)q k q k-1 + w k (2) (e)z k z k-1 + 4(1) (f)x k x k (1) END DO
CMPUT Compiler Design and Optimization16 Why an eager scheduler fails in our example Cycles b1 0 b2 1 b3 2 b4 3 b5 4 b6 5 b7 6 c1b8 7 c2 8 b9 d1c3 9 c4 10 d2c Iterations b10 b11 b d3c7 13 c8 14 d4 15 c9 16 d d d d8 23 c6 12 b14 b15 b16 b17c10 c11b18 c12 c13 c14 c15 c16 c17 b13 Cycles
CMPUT Compiler Design and Optimization17 Why an eager scheduler fails in our example Cycles b1 0 1 b2 2 3 b3 4 5 b4 6 c1 7 b5 8 d1c2 9 b6 10 d2c Iterations d3c4 13 b8 14 d4c b9 d5c d6c d7c d8 23 c9 b7 12 b10 b11 b12 Cycles Therefore we can do it in 2*N+9 cycles.
CMPUT Compiler Design and Optimization18 Collision vectors Given the reservation tables for two operations A and B, the set of forbidden intervals, i.e., intervals at which distance the operations A and B cannot be issued is called the collision vector for the reservation tables.
CMPUT Compiler Design and Optimization19 A Simplistic Module Scheduling Algorithm 1. Compute MII as discussed 2. Use a modified list scheduling algorithm to generate a module schedule. The scheduling algorithm must obey the following restriction: If an operation P is scheduled at time t, it cannot be scheduled at any time t k*II for any k 0. The Module Reservation Table has II rows, representing the cycles of the initiation interval, and as many columns as the resources that it needs to keep track of.
CMPUT Compiler Design and Optimization20 Heuristic Method for Modulo Scheduling Why a simple variant of list scheduling may not work? Problem: Generate a module schedule of a loop by scheduling instructions until a pattern emerge.
CMPUT Compiler Design and Optimization21 AC BD (0,4) (0,2) (1,2) Counter Example I: List Scheduling May Fail There is only one cycle in the dependence graph, therefore RecMII is given by: Therefore, in a machine with infinite resources, we must be able to schedule the loop in 4 cycles.
CMPUT Compiler Design and Optimization22 Counter Example I: List Scheduling May Fail AC BD (0,4) (0,2) (1,2) CA D D AC List Scheduling: a greedy algorithm that schedules each operation at its earliest possible time B must be scheduled after the A of the current iteration and before the C of the next iteration. We are deadlocked!!! B B ???
CMPUT Compiler Design and Optimization23 Counter Example I: List Scheduling May Fail AC BD (0,4) (0,2) (1,2) CA D B D(0)A(0)C(0) A(1)B(0)C(1)……………D(N)B(N) The solution is to create a kernel with operations from different iterations, and use a prologue and an epilogue. prologue epilogue kernel
CMPUT Compiler Design and Optimization24 A1 C2 A3 A4 M5 M6 (0,2) (0,1) (0,2) (0,3) A1, A3, and A4 are non-pipelined adds that take two cycles at the adder M5 and M6 are non-pipelined multiply operations that take three cycles each on the multiplier C2 is a copy operation that uses the bus for one cycle What is the ResMII for these operations in a machine that has one adder, one multiplier and one bus? ResMII(Adder) = 6; ResMII(Multiplier) = 6 ResMII(Bus) = 1 ResMII = 6 Counter Example II: List Scheduling May Fail
CMPUT Compiler Design and Optimization25 A1 C2 A3 A4 M5 M6 (0,2) (0,1) (0,2) (0,3) Counter Example II: List Scheduling May Fail Adder Mult Bus A1 A3 A4 M6 C2 A4 ??? We cannot schedule A4 and achieve an MII = ResMII = 6!!!
CMPUT Compiler Design and Optimization26 A1 C2 A3 A4 M5 M6 (0,2) (0,1) (0,2) (0,3) Counter Example II: List Scheduling May Fail Adder Mult Bus A1 A3 A4 M6 M5 M6 M5 C2 A4 Although it seems counter-intuitive we obtain a module schedule with MII = 6 if we initially schedule both M6 and A3 one cycle later than the earliest possible time for these operations.
CMPUT Compiler Design and Optimization27 Complex Reservation Tables Consider three independent operations with the reservation tables shown below A1M2MA3 (0,2) (0,3)(0,4) Add Mult Bus What is the MII for a loop formed by this three operations? ResMII(Add) = = 2 Res MII(Mult) = = 2 ResMII(Bus) = = 2 ResMII = 2
CMPUT Compiler Design and Optimization28 Is the MII = 2 Feasible?? A1M2MA3 (0,2) (0,3)(0,4) Add Mult Bus A Adder Mult Bus A1 M2 Deadlocked. Cannot allocate MA3. Even though MII = max(ResMII, RecMII) = 2, MII = 2 is not feasible!!!!
CMPUT Compiler Design and Optimization29 Increasing MII to 3 helps? A1M2MA3 (0,2) (0,3)(0,4) Add Mult Bus A1M Adder Mult Bus A1 M2 MA3 We find a module schedule with MII = 3!!
CMPUT Compiler Design and Optimization30 Iteration Between Recurrence Constraints and Resource Constraints A1 A2 A3 A4 (0,2) (2,2) (0,2) A Add Mult Bus What is the RecMII for this loop? RecMII = ( )/2 = 4 What is the ResMII for the loop? ResMII(Add) = = 4 ResMII(Mult) = = 0 ResMII(Bus) = = 4 ResMII = 4 Therefore MII = max(ResMII,RecMII) = 4
CMPUT Compiler Design and Optimization31 Is the MII = 4 feasible? A1 A2 A3 A4 (0,2) (2,2) (0,2) A Add Mult Bus A1 A Adder Mult Bus A1 In order to finish A4 in time to produce the result for two iterations later, A3 must be scheduled at time 4. But 4 module 4 = 0, which conflicts with A1. Therefore there is no feasible schedule with MII = 4.
CMPUT Compiler Design and Optimization32 Scheduling Strategy An exhaustive search will eventually reveal that the MII calculated is not feasible, but it might take too long. In practice, we compute the MII and spend a pre-allocated budget of time trying to find a schedule with the MII. If we don’t find one, we increase the MII. In some commercial compilers, the search for the smallest feasible II is a binary search, where the II is doubled at each step until a feasible one is found, at which point a linear search between the last unfeasible II and the feasible one is conducted.
CMPUT Compiler Design and Optimization33 Previous Approaches zApproach I (Operational): y“Emulate” the loop execution under the machine model and a “pattern” will eventually occur [AikenNic88, EbciogluNic89, GaoEtAl91] zApproach II (Periodic scheduling): ySpecify the scheduling problem into a periodical scheduling problem and find optimal solution [Lam88, RauEtAl81,GovindAltmanGao94]