Presentation is loading. Please wait.

Presentation is loading. Please wait.

CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.

Similar presentations


Presentation on theme: "CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1."— Presentation transcript:

1 CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

2 Recap Last time we saw scheduling techniques max. parallelism != best performance This time how can we do better? 2

3 Pluto Strategy We want only 1D parallelism coarse-grained (outer) parallelism good data locality We want Tiling wave-front parallelism is guaranteed each tile can be executed atomically good for sequential performance 3

4 Intuition of Pluto Algorithm Skew and Tile i j i j 4

5 Tiling Hyper-Planes Another name for 1D schedule θ set of θs define tiling Defines the transform (i,j->i+j,i) corresponds to the skew in prev. slide i j θ 2 =i θ 1 =i+j 5

6 Legality of Tiling Each tiling hyper-plane must satisfy: What is difference from causality condition? note this is about affine transform, not schedule Must be weakly satisfied for each dimension! 6

7 What does the condition mean? 1. Fully Permutable recall θs define the transform all statements mapped to a common d-D space let i 1,..., i n be the new indices Weakly satisfied in all dimensions  i 1 ≥i ’ 1,..., i n ≥i ’ n for all dependences  Reformulation of the fully permutable condition  works for scheduling imperfect loop nests 7

8 What does the condition mean? 2. All statements are fused somewhat implied by fully permutability what are possible dependences from S1 to S2 from S2 to S1 Exception when S1 do not use value of S2 for i for j S1 for j S2 for i for j S1 for j S2 8

9 Selecting Tiling Hyper-planes Which is better? i j i j 9

10 Cost Functions in Pluto Formulated as: What does this capture? dep: (i,j->i+1,j-1) δ 1 = (i+1+j-1) – (i+j) = 0 δ 2 = (i+1) – (i) = 1 i j θ 2 =i θ 1 =i+j 10

11 Cost Functions in Pluto Formulated as: What does this capture? dep: (i,j->i+1,j-1) δ 1 = (i+1+j-1) – (i+j) = 0 δ 2 = (i+1-(j-1)) – (i-j) = 2 i j θ 2 =i-j θ 1 =i+j 11

12 Reuse Distance When the θ corresponds to sequential loop Two dependences (i,j->i+1,j) (i,0->i,j) : j>0 what are the δs? δ represents #iterations in the loop (corresponding to θ) until reuse via e i j θ 1 =i θ 2 =j 12

13 Communication Volume When the θ corresponds to parallel loop Let s i, s j be the tile sizes Horizontal dependence s j values to the horizontal neighbor Vertical dependence s i valeus to N/s j tiles Constant is better 0 is even better! i j 13

14 Iterative Search We need d-hyper-planes for a d-D space note that we are not looking for parallelism parallelism comes with the tile wave-fronts Approach: find one θ for each statement constraint the space to be linearly independent of the θs already found repeat 14

15 Tilable Band Band of Loops/Schedules consecutive sequence of dimensions Tilable band a band that satisfies the legality condition for a common set of dependences PLuTo tiles the outermost tilable band 15

16 So, which is better? What are the θs and δs? what is the order? i j i j 16

17 Solving with ILP Farkas Lemma again we had enough of Farkas last time There is a problem when the constraint is: 17

18 The “Practical” Choice Given the schedule prototype: Constraint the coefficients to: What does this mean? Relaxed recently by a paper on PLUTO+ and 18

19 Example 1: Jacobi 1D One example implementation: but it is rather contrived due to limitations in polyhedral compilers The dependences are simple 19 for t = 0.. T for i = 1.. N-1 S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1.. N-1 S2: A[i] = foo(B[i], B[i-1], B[i+1]); for t = 0.. T for i = 1.. N-1 S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1.. N-1 S2: A[i] = foo(B[i], B[i-1], B[i+1]); for t = 0.. T for i = 1.. N-1 S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]); for t = 0.. T for i = 1.. N-1 S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]);

20 Example 1: Jacobi 1D Prototype: θ S1 (t,i) = a 1 t+a 2 i+a 0 δ 1 =a 1 (t+1)+a 2 i+a 0 -(a 1 t+a 2 i+a 0 )=a 1 δ 2 =a 1 (t+1)+a 2 (i+1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 +a 2 δ 3 =a 1 (t+1)+a 2 (i-1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 -a 2 20 S1[t,i] -> S1[t+1,i] S1[t,i] -> S1[t+1,i+1] S1[t,i] -> S1[t+1,i-1] δ 1 =θ S1 (t+1,i)-θ S1 (t,i) δ 2 =θ S1 (t+1,i+1)-θ S1 (t,i) δ 3 =θ S1 (t+1,i-1)-θ S1 (t,i)

21 Example 1: Jacobi 1D Prototype: θ S1 (t,i) = a 1 t+a 2 i+a 0 δ 1 =a 1 (t+1)+a 2 i+a 0 -(a 1 t+a 2 i+a 0 )=a 1 δ 2 =a 1 (t+1)+a 2 (i+1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 +a 2 δ 3 =a 1 (t+1)+a 2 (i-1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 -a 2 linearly independent with the previous 21 S1[t,i] -> S1[t+1,i] S1[t,i] -> S1[t+1,i+1] S1[t,i] -> S1[t+1,i-1] δ 1 =θ S1 (t+1,i)-θ S1 (t,i) δ 2 =θ S1 (t+1,i+1)-θ S1 (t,i) δ 3 =θ S1 (t+1,i-1)-θ S1 (t,i)

22 Example 1: Jacobi 1D We have a set of hyper-planes θ S1 (t,i) = (t,t+i) 22 t i t i

23 Example 2: 2mm Simplified a bit for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; S1[i,j,k] -> S1[i,j,k+1] S2[i,j,k] -> S2[i,j,k+1] S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’ S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’ 23

24 Example 2: 2mm (dim 1) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Easy ones: Interesting case is the inter-statement dep. S1[i,j,k] -> S1[i,j,k+1] S2[x,y,z] -> S2[x,y,z+1] S1[i,j,N] -> S2[x,y,z]: i=x and j=z S1[i,j,N] -> S2[x,y,z]: i=x and j=z a 3 =0 b 3 =0 S2[i,j,k] -> S1[i,k,N]: or S2[x,y,z] -> S1[x,z,N]: 24

25 Example 2: 2mm (dim 1) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Interesting case is the inter-statement dep. b 1 x+b 2 y+b 3 z+b 0 - a 1 x+a 2 z+a 3 N+a 0 S1[i,j,N] -> S2[x,y,z]: i=x and j=z S1[i,j,N] -> S2[x,y,z]: i=x and j=z or S2[x,y,z] -> S1[x,z,N]:  (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 25  (b 1 -a 1 )x+b 2 y-a 2 z a 3 =b 3 =0

26 Example 2: 2mm (dim 1) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to a 1 +a 2 +b 1 +b 2 =0 a,b≥0 (plus weakly satisfied) We get θ S1 (i,j,k) = i θ S2 (x,y,z) = x 26 (b 1 -a 1 )x+b 2 y-a 2 z

27 Example 2: 2mm (dim 2) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous We get θ S1 (i,j,k) = j θ S2 (x,y,z) = z 27 (b 1 -a 1 )x+b 2 y-a 2 z

28 Example 2: 2mm (dim 3) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous does θ S1 =k and θ S2 =y work? a 3 =1, b 2 =1, rest 0 28 (b 1 -a 1 )x+b 2 y-a 2 z S1[i,j,N] -> S2[x,y,z]: i=x and j=z

29 Example 2: 2mm (dim 3) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous does θ S1 =k and θ S2 =y work? a 3 =1, b 2 =1, rest 0 29 (b 1 -a 1 )x+b 2 y-a 2 z S1[i,j,N] -> S2[x,y,z]: i=x and j=z or S2[x,y,z] -> S1[x,z,N]:

30 Example 2: 2mm (dim 3) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous we have to split here θ S1 =0 and θ S2 =1 30 (b 1 -a 1 )x+b 2 y-a 2 z S1[i,j,N] -> S2[x,y,z]: i=x and j=z or S2[x,y,z] -> S1[x,z,N]:

31 Example 2: 2mm (dim 4) Proceed to the 4 th dimension because the 3 rd dimension is only for statement ordering Now solve the problem independently for each statement Case S1: linearly independent with [i] and [j] Case S2: linearly independent with [x] and [z] We get [k] and [y] 31 S1[i,j,k] -> S1[i,j,k+1] S2[x,y,z] -> S2[x,y,z+1]

32 Example 2: 2mm Finally, we have a set of hyper-planes θ S1 (i,j,k) = (i,j,0,k) θ S2 (i,j,k) = (i,k,1,j) 32 Tilable Band for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; for i = 0.. N for j = 0.. N { for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for k = 0.. N S2: E[i,k] += C[i,j] * D[j,k]; } for i = 0.. N for j = 0.. N { for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for k = 0.. N S2: E[i,k] += C[i,j] * D[j,k]; }

33 Example 2: 2mm Output of Pluto 33

34 Summary of Pluto Paper in 2008 huge impact: 350+ citations already Works very well as the default strategy But, it is far from perfect! 34


Download ppt "CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1."

Similar presentations


Ads by Google