CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.

CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki

Today’s Agenda Look at some of the “extra-ordinary” transformations Reductions and Scans parallelizing “sequential” programs Simplifying Reductions reduces asymptotic complexity Sparse Computations what to do with indirect arrays? 2

Reductions in Alpha Many-to-one Mapping from body to answer bodies mapped to the same answer space is combined with the OP summation row-wise sum 3 reduce(+, (i->), A[i]); reduce(+, (i,j->i), A[i,j]);

Reductions in Alpha Many-to-one Mapping from body to answer bodies mapped to the same answer space is combined with the OP Some don’t directly correspond to Σ 4 reduce(+, (i,j->i+j), A[i,j]);

Scans Can you parallelize this code? Associativity allows for some parallelism 5 for i = 1.. N x[i] = x[i-1] + a[i]; for i = 1.. N x[i] = x[i-1] + a[i]; for i = 1.. N/2 x[i] = x[i-1] + a[i]; for i = (N/2)+1.. N x’[i] = x’[i-1] + a[i]; for i = (N/2)+1.. N x[i] = x’[i] + x[N/2]; for i = 1.. N/2 x[i] = x[i-1] + a[i]; for i = (N/2)+1.. N x’[i] = x’[i-1] + a[i]; for i = (N/2)+1.. N x[i] = x’[i] + x[N/2]; Independent Loops

Scans and Reductions Reduction is a special case of scans reduction: scan: In equations... The difference: intermediate value is saved 6 for (i = 0..N) x += A[i]; for (i = 0..N) x += A[i]; for (i = 0..N) x[i] = x[i-1] + A[i] for (i = 0..N) x[i] = x[i-1] + A[i] x[i] = x[i-1] + A[i] : i>0 A[0] : i=0 x[i] = x[i-1] + A[i] : i>0 A[0] : i=0

Reduction Detection One place where you need Alpha or expressions for each operation Redon-Feautrier [1995] polyhedral Sato-Iwasaki [2011] non-polyhedral, SVU-form Zou-Rajopadhye [2012] polyhedral + SVU-form All approaches are pattern matching 7

Pattern Matching Given an alpha program, look for x[i] = x[i-1] + Of course you can add more patterns to match x[i] = b[i]*x[i-1] + a[i] x[i,j] = x[i,j-1] + : j>0 = x[i-1,N] + : j=0 Good enough for many cases 8

State Vector Update Form Another view of the recurrence: x[i] = b[i]*x[i-1] + a[i] as matrix vector multiply 9

State Vector Update Form It can encode recurrences of different types Mutual recurrence x[i] = x[i-1] + y[i-1] + a[i] y[i] = x[i-1] + y[i-1] + b[i] 10

State Vector Update Form It can encode recurrences of different types Higher-Order Recurrence x[i] = w0 + w1*x[i-1] + w2*x[i-2] 11

Scans as Matrix Multiply Can you parallelize this loop? Rewrite as: Then: X i = A i X i-1  12 for i = 1.. N x[i] = x[i-1] + a[i]; for i = 1.. N x[i] = x[i-1] + a[i];

Standard Parallelization reduction 2 2 3 3 1 1 2 2 1 1 3 3 1 1 1 1 3 3 1 1 2 2 2 2 8 8 6 6 8 8 scan 8 8 14 22 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 2 2 5 5 6 6 8 8 9 9 12 19 20 22 13 14 16 scan 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 slides from Yun Zou 13

Parallelization Optimization 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 reduction 6 6 6 6 4 4 scan 6 6 12 16 scan 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 2 2 5 5 6 6 8 8 9 9 12 19 20 22 13 14 16 2 2 3 3 1 1 2 2 1 1 3 3 3 3 1 1 2 2 1 1 1 1 2 2 2 2 5 5 6 6 scan slides from Yun Zou 14

Experimental Validation Speedup for ConvolutionSpeedup for some examples slides from Yun Zou 15

Notes on Scan Parallelization Allows parallelization “breaking” dependences all instances are ordered BUT, you should not use this parallelism unless you have to! 16 for (i = 0.. N) x +=... for (i = 0.. N) x +=...

Simplifying Reductions [Gupta & Rajopadhye 2006] Automatic reduction in complexity Detection of prefix computations Example (0≤i≤N): How many loops do you need? 18

Prefix Sum Naïve implementation Better implementation 19 for i=0..N for k=0..i y[i] += x[k]; for i=0..N for k=0..i y[i] += x[k]; y[0] = x[0]; for i=1..N y[i] = y[i-1]+x[i]; y[0] = x[0]; for i=1..N y[i] = y[i-1]+x[i];

Polyhedral View Reuse among instances of reductions 20 i k x x

Polyhedral View Reuse among instances of reductions 21 i k x x

Geometric View Translate the reduction domain 22 i k x x

Geometric View Translate the reduction domain 23 i k x x reuse domain addition domain

Less Obvious Example Parallelogram (Segment Sum) 24 i k x x M

Less Obvious Example Parallelogram (Segment Sum) 25 i k x x addition domain ? domain reuse domain

Less Obvious Example Parallelogram (Segment Sum) 26 i k x x addition domain subtract domain reuse domain

Maximum Segment Sum Can you find more efficient implementation? 27 m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); } m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); }

work out on board 28

Simplified MSS Down to O(n) 29 m = X[0] = A[0]; for (j=1.. N) X[j] = max(A[j], A[j]+X[j-1]); m = max(m, X[j]); m = X[0] = A[0]; for (j=1.. N) X[j] = max(A[j], A[j]+X[j-1]); m = max(m, X[j]); m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); } m = -inf; for (i=0.. N) for (j=i.. N) { sum = 0; for (k=i.. j) sum+= A[k]; m = max(sum, m); }

Optimality of SR Algorithm to reach optimal complexity that can be reached with SR Many options Reduction Decomposition 30 reduce(+, (i,j->),...) reduce(+, (i->), reduce(+, (i,j->i),...)) reduce(+, (j->), reduce(+, (i,j->j),...))

Optimality of SR Distribution Decomposition Dynamic Programming to search through all possible transformations of the reductions 31 reduce(+, (i,j->), E 1 + E 2 ) reduce(+, (i,j->i+j), a[i+j]*x[i,j]) a[i+j]*reduce(+, (i,j->i+j), x[i,j]) reduce(+, (i,j->), E 1 )+reduce(+, (i,j->), E 2 )

MSS in Program Derivation Original version in the 70s Liner algorithm in “Programming Perls” 1984 Used as an example for program derivation Program derivation: if you transform a program with a set of rules the result is guaranteed to preserve semantics  prove that linear version correct  reached by the optimality algorithm 32

UNAfold Case Study RNA Sequence Alignment algorithm Frequently used in bio-informatics O(n 4 ) complexity O(n 3 ) algorithm was proposed Never implemented Too complicated to do by hand It turns out to be an instance of SR 33

UNAfold 34

UNAfold Results 35

UNAfold Results 36

Notes on SR The author is now CEO of a company start-up that is still alive after 5+ years enough profit to sustain a small company Contractor that optimize client code combination of manual + tool assisted instances of SR found more than you expect 37

Sparse Computations Sparse Matrices when number of non-zero is very small using dense matrix is wasteful Sparse Computations very slow SpMV is the dominant computation Sparse Matrix Vector Multiplication y = Ax 39

Sparse Computations Usually the goal is to solve Ax=b where A is sparse A is too big to directly solve Iterative methods Jacobi / Gauss-Seidel Krylov sub-space based Conjugate Gradient 40

Krylov Subspace Space spanned by b, Ab, A 2 b,..., A k b The approximated solution to x lies in this subspace So you want to compute A k b successive SpMV Actually applies for dense matrices too 41

Sparse Matrix Vector Multiply Different code depending on the format Compressed Sparse Row 42 for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; traversing rows compressed matrix range of a row looking up the column start = [0,1,3,3,4] val = [1,2,3,4] col = [0,0,1,2]

Sparse Matrix Representations Full another domain CSR Blocked CSR Compressed Sparse Columns Compressed Diagonal Coordinate... Specialized for shapes of the sparse matrices 43

So what do we do? If we try to do static analyses... You can still recover parallelism But that isn’t the real problem 44 for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; forall (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; forall (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]];

Data Reordering Consider writes with indirection you might have bad locality behavior 45 for (i=0..N) A[B[i]] =... for (i=0..N) A[B[i]] =... 0 0 1 1 2 2 3 3 4 4 5 5 6 6 i A A0 A1 A2 A3 A4 A5 A6 A7 B[i]

Data Reordering Consider writes with indirection you could change how data is stored 46 for (i=0..N) A’[B’[i]] =... for (i=0..N) A’[B’[i]] =... 0 0 1 1 2 2 3 3 4 4 5 5 6 6 i A’ A0 A7 A5 A3 A4 A6 A1 A2 B’[i]

Inspector/Executor Strategy Run-Time Reordering Inspect the data dependences at run-time Execute with reordered data/iterations 47 for (i=0..N) A[B[i]] =... for (i=0..N) A[B[i]] =... for (i=0..N) inspect(B[i]); for (i=0..N) sigma[i] =... for (i=0..N) A’[sigma[i]] = A[i]; B’[i] = sigma[B[i]]; for (i=0..N) inspect(B[i]); for (i=0..N) sigma[i] =... for (i=0..N) A’[sigma[i]] = A[i]; B’[i] = sigma[B[i]]; for (i=0..N) A’[B’[i]] =... for (i=0..N) A’[B’[i]] =... Inspector Executor

Full Sparse Tiling Tiling for sparse computations Goal: temporal locality with inspector/executor Approach (oversimplified): inspect for dependence graph partition and schedule the graph execute the partitions 48

Iterative Solvers Iterative solver for Ax=b e.g., banded matrix 49 xb = updates:... x 4 <- (x 0,x 8 ) x 5 <- (x 1,x 9 ) x 6 <- (x 2,x 10 ) x 7 <- (x 3,x 11 ) x 8 <- (x 4,x 12 )...

Iterative Solvers Refine approximation of x each iteration which x to use depends on the matrix but common across time 50 t

Graph Partitioning Find some seed partitioning Criteria: (roughly) equal size less edges crossing paritions 51

Graph Partitioning And then grow Some tiles become larger due to dependences Also need to satisfy dependences 52

A lot in the Inspector Partitioning have many different complications load balancing parallelism across tiles overhead of the partitioning algorithm Need multiple passes to account for overhead complicated when the structure changes overtime (e.g., molecules moving around) 53

Sparse Polyhedral Framework Extension to the polyhedral world [Strout et al.] add uninterpreted function symbols Goal: express run-time reordering transformations generate inspector/executor code 54 [N]-> {[i,j] -> x[v] : v=col(j)} for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]]; for (i=0..N) for (j=start[i]..start[i+1]-1) y[i] += val[j] * x[col[j]];

CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.

Similar presentations

Presentation on theme: "CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.

Similar presentations

Presentation on theme: "CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki."— Presentation transcript:

Similar presentations

About project

Feedback