Presentation is loading. Please wait.

Presentation is loading. Please wait.

CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.

Similar presentations


Presentation on theme: "CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki."— Presentation transcript:

1 CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki

2 Today’s Agenda Look at some of the “extra-ordinary” transformations Reductions and Scans parallelizing “sequential” programs Simplifying Reductions reduces asymptotic complexity Sparse Computations what to do with indirect arrays? 2

3 Reductions in Alpha Many-to-one Mapping from body to answer bodies mapped to the same answer space is combined with the OP summation row-wise sum 3 reduce(+, (i->), A[i]); reduce(+, (i,j->i), A[i,j]);

4 Reductions in Alpha Many-to-one Mapping from body to answer bodies mapped to the same answer space is combined with the OP Some don’t directly correspond to Σ 4 reduce(+, (i,j->i+j), A[i,j]);

5 Scans Can you parallelize this code? Associativity allows for some parallelism 5 for i = 1.. N x[i] = x[i-1] + a[i]; for i = 1.. N x[i] = x[i-1] + a[i]; for i = 1.. N/2 x[i] = x[i-1] + a[i]; for i = (N/2)+1.. N x’[i] = x’[i-1] + a[i]; for i = (N/2)+1.. N x[i] = x’[i] + x[N/2]; for i = 1.. N/2 x[i] = x[i-1] + a[i]; for i = (N/2)+1.. N x’[i] = x’[i-1] + a[i]; for i = (N/2)+1.. N x[i] = x’[i] + x[N/2]; Independent Loops

6 Scans and Reductions Reduction is a special case of scans reduction: scan: In equations... The difference: intermediate value is saved 6 for (i = 0..N) x += A[i]; for (i = 0..N) x += A[i]; for (i = 0..N) x[i] = x[i-1] + A[i] for (i = 0..N) x[i] = x[i-1] + A[i] x[i] = x[i-1] + A[i] : i>0 A[0] : i=0 x[i] = x[i-1] + A[i] : i>0 A[0] : i=0

7 Reduction Detection One place where you need Alpha or expressions for each operation Redon-Feautrier [1995] polyhedral Sato-Iwasaki [2011] non-polyhedral, SVU-form Zou-Rajopadhye [2012] polyhedral + SVU-form All approaches are pattern matching 7

8 Pattern Matching Given an alpha program, look for x[i] = x[i-1] + Of course you can add more patterns to match x[i] = b[i]*x[i-1] + a[i] x[i,j] = x[i,j-1] + : j>0 = x[i-1,N] + : j=0 Good enough for many cases 8

9 State Vector Update Form Another view of the recurrence: x[i] = b[i]*x[i-1] + a[i] as matrix vector multiply 9

10 State Vector Update Form It can encode recurrences of different types Mutual recurrence x[i] = x[i-1] + y[i-1] + a[i] y[i] = x[i-1] + y[i-1] + b[i] 10

11 State Vector Update Form It can encode recurrences of different types Higher-Order Recurrence x[i] = w0 + w1*x[i-1] + w2*x[i-2] 11

12 Scans as Matrix Multiply Can you parallelize this loop? Rewrite as: Then: X i = A i X i-1  12 for i = 1.. N x[i] = x[i-1] + a[i]; for i = 1.. N x[i] = x[i-1] + a[i];

13 Standard Parallelization reduction 2 2 3 3 1 1 2 2 1 1 3 3 1 1 1 1 3 3 1 1 2 2 2 2 8 8 6 6 8 8 scan 8 8 14 22 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 2 2 5 5 6 6 8 8 9 9 12 19 20 22 13 14 16 scan 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 slides from Yun Zou 13

14 Parallelization Optimization 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 reduction 6 6 6 6 4 4 scan 6 6 12 16 scan 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 2 2 5 5 6 6 8 8 9 9 12 19 20 22 13 14 16 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 2 2 5 5 6 6 scan slides from Yun Zou 14

15 Experimental Validation Speedup for ConvolutionSpeedup for some examples slides from Yun Zou 15

16 Notes on Scan Parallelization Allows parallelization “breaking” dependences all instances are ordered BUT, you should not use this parallelism unless you have to! 16 for (i = 0.. N) x +=... for (i = 0.. N) x +=...

17 17

18 Simplifying Reductions [Gupta & Rajopadhye 2006] Automatic reduction in complexity Detection of prefix computations Example (0≤i≤N): How many loops do you need? 18

19 Prefix Sum Naïve implementation Better implementation 19 for i=0..N for k=0..i y[i] += x[k]; for i=0..N for k=0..i y[i] += x[k]; y[0] = x[0]; for i=1..N y[i] = y[i-1]+x[i]; y[0] = x[0]; for i=1..N y[i] = y[i-1]+x[i];

20 Polyhedral View Reuse among instances of reductions 20 i k x x

21 Polyhedral View Reuse among instances of reductions 21 i k x x

22 Geometric View Translate the reduction domain 22 i k x x

23 Geometric View Translate the reduction domain 23 i k x x reuse domain addition domain

24 Less Obvious Example Parallelogram (Segment Sum) 24 i k x x M

25 Less Obvious Example Parallelogram (Segment Sum) 25 i k x x addition domain ? domain reuse domain

26 Less Obvious Example Parallelogram (Segment Sum) 26 i k x x addition domain subtract domain reuse domain

27 Maximum Segment Sum Can you find more efficient implementation? 27 m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); } m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); }

28 work out on board 28

29 Simplified MSS Down to O(n) 29 m = X[0] = A[0]; for (j=1.. N) X[j] = max(A[j], A[j]+X[j-1]); m = max(m, X[j]); m = X[0] = A[0]; for (j=1.. N) X[j] = max(A[j], A[j]+X[j-1]); m = max(m, X[j]); m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); } m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); }

30 Optimality of SR Algorithm to reach optimal complexity that can be reached with SR Many options Reduction Decomposition 30 reduce(+, (i,j->),...) reduce(+, (i->), reduce(+, (i,j->i),...)) reduce(+, (j->), reduce(+, (i,j->j),...))

31 Optimality of SR Distribution Decomposition Dynamic Programming to search through all possible transformations of the reductions 31 reduce(+, (i,j->), E 1 + E 2 ) reduce(+, (i,j->i+j), a[i+j]*x[i,j]) a[i+j]*reduce(+, (i,j->i+j), x[i,j]) reduce(+, (i,j->), E 1 )+reduce(+, (i,j->), E 2 )

32 MSS in Program Derivation Original version in the 70s Liner algorithm in “Programming Perls” 1984 Used as an example for program derivation Program derivation: if you transform a program with a set of rules the result is guaranteed to preserve semantics  prove that linear version correct  reached by the optimality algorithm 32

33 UNAfold Case Study RNA Sequence Alignment algorithm Frequently used in bio-informatics O(n 4 ) complexity O(n 3 ) algorithm was proposed Never implemented Too complicated to do by hand It turns out to be an instance of SR 33

34 UNAfold 34

35 UNAfold Results 35

36 UNAfold Results 36

37 Notes on SR The author is now CEO of a company start-up that is still alive after 5+ years enough profit to sustain a small company Contractor that optimize client code combination of manual + tool assisted instances of SR found more than you expect 37

38 38

39 Sparse Computations Sparse Matrices when number of non-zero is very small using dense matrix is wasteful Sparse Computations very slow SpMV is the dominant computation Sparse Matrix Vector Multiplication y = Ax 39

40 Sparse Computations Usually the goal is to solve Ax=b where A is sparse A is too big to directly solve Iterative methods Jacobi / Gauss-Seidel Krylov sub-space based Conjugate Gradient 40

41 Krylov Subspace Space spanned by b, Ab, A 2 b,..., A k b The approximated solution to x lies in this subspace So you want to compute A k b successive SpMV Actually applies for dense matrices too 41

42 Sparse Matrix Vector Multiply Different code depending on the format Compressed Sparse Row 42 for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; traversing rows compressed matrix range of a row looking up the column start = [0,1,3,3,4] val = [1,2,3,4] col = [0,0,1,2]

43 Sparse Matrix Representations Full another domain CSR Blocked CSR Compressed Sparse Columns Compressed Diagonal Coordinate... Specialized for shapes of the sparse matrices 43

44 So what do we do? If we try to do static analyses... You can still recover parallelism But that isn’t the real problem 44 for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; forall (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; forall (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]];

45 Data Reordering Consider writes with indirection you might have bad locality behavior 45 for (i=0..N) A[B[i]] =... for (i=0..N) A[B[i]] =... 0 0 1 1 2 2 3 3 4 4 5 5 6 6 i A A0 A1 A2 A3 A4 A5 A6 A7 B[i]

46 Data Reordering Consider writes with indirection you could change how data is stored 46 for (i=0..N) A’[B’[i]] =... for (i=0..N) A’[B’[i]] =... 0 0 1 1 2 2 3 3 4 4 5 5 6 6 i A’ A0 A7 A5 A3 A4 A6 A1 A2 B’[i]

47 Inspector/Executor Strategy Run-Time Reordering Inspect the data dependences at run-time Execute with reordered data/iterations 47 for (i=0..N) A[B[i]] =... for (i=0..N) A[B[i]] =... for (i=0..N) inspect(B[i]); for (i=0..N) sigma[i] =... for (i=0..N) A’[sigma[i]] = A[i]; B’[i] = sigma[B[i]]; for (i=0..N) inspect(B[i]); for (i=0..N) sigma[i] =... for (i=0..N) A’[sigma[i]] = A[i]; B’[i] = sigma[B[i]]; for (i=0..N) A’[B’[i]] =... for (i=0..N) A’[B’[i]] =... Inspector Executor

48 Full Sparse Tiling Tiling for sparse computations Goal: temporal locality with inspector/executor Approach (oversimplified): inspect for dependence graph partition and schedule the graph execute the partitions 48

49 Iterative Solvers Iterative solver for Ax=b e.g., banded matrix 49 xb = updates:... x 4 <- (x 0,x 8 ) x 5 <- (x 1,x 9 ) x 6 <- (x 2,x 10 ) x 7 <- (x 3,x 11 ) x 8 <- (x 4,x 12 )...

50 Iterative Solvers Refine approximation of x each iteration which x to use depends on the matrix but common across time 50 t

51 Graph Partitioning Find some seed partitioning Criteria: (roughly) equal size less edges crossing paritions 51

52 Graph Partitioning And then grow Some tiles become larger due to dependences Also need to satisfy dependences 52

53 A lot in the Inspector Partitioning have many different complications load balancing parallelism across tiles overhead of the partitioning algorithm Need multiple passes to account for overhead complicated when the structure changes overtime (e.g., molecules moving around) 53

54 Sparse Polyhedral Framework Extension to the polyhedral world [Strout et al.] add uninterpreted function symbols Goal: express run-time reordering transformations generate inspector/executor code 54 [N]-> {[i,j] -> x[v] : v=col(j)} for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]];


Download ppt "CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki."

Similar presentations


Ads by Google