Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Similar presentations


Presentation on theme: "Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University"— Presentation transcript:

1

2 Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University http://suif.stanford.edu/

3 Interprocedural, High-Level Transforms for Locality and Parallelism zProgram transformation for computational kernels ya new technique based on affine partitioning zInterprocedural analysis framework: to maximize code reuse yFlow sensitivity; context sensitivity zInterprocedural program analysis yPointer alias analysis (Steensgaard’s algorithm) yScalar/Scalar dependence, privatization, reduction recognition zParallel code generation yDefine new IR nodes for parallel code

4 Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

5 Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors

6 Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A B EPSS L L L

7 Results with Affine Partitioning + Blocking

8 New Transform Theory zDomain: arbitrary loop nesting, instruction optimized separately zUnifies yPermutation ySkewing yReversal yFusion yFission yStatement reordering zSupports blocking across all loop nests zOptimal: Max. deg. of parallelism & min. deg. of synchronization zMinimize communication by aligning the computation and pipelining zMore powerful & simpler software engineering

9 A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j];(S 1 ) B[i,j] = A[i,j-1]*B[i,j];(S 2 ) i j S1S1 S2S2

10 Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S 2 ) for i 1 = max(1,1+p) to min(n,n-1+p) do A[i 1,i 1 -p] = A[i 1,i 1 -p] + B[i 1 -1,i 1 -p];(S 1 ) B[i 1,i 1 -p+1] = A[i 1,i 1 -p] * B[i 1,i 1 -p+1];(S 2 ) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S 1 ) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.

11 Maximum Parallelism & No Communication Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find C j which maps an instance of statement j to a processor:  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) with the objective of maximizing the rank of C j Loops Array Processor ID F1 (i1)F1 (i1) F2 (i2)F2 (i2) C1 (i1)C1 (i1) C2 (i2)C2 (i2)

12 Algorithm  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) zRewrite partition constraints as systems of linear equations  use affine form of Farkas Lemma to rewrite constraints as systems of linear inequalities in C and  use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0 zFind solutions using linear algebra techniques  the null space for matrix A is a solution of C with maximum rank.

13 Pipelining Alternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N(parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))

14 Finding the Maximum Degree of Pipelining Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find T j which maps an instance of statement j to a time stage:  i j, i k B j i j  0, B k i k  0 ( i j  i k )  ( F xj ( i j ) = F xk ( i k ))  T j ( i j )  T k ( i k ) lexicographically with the objective of maximizing the rank of T j Loops Array Time Stage F1 (i1)F1 (i1) F2 (i2)F2 (i2) T1 (i1)T1 (i1) T2 (i2)T2 (i2)

15 Key Insight zChoice in time mapping => (pipelined) parallelism zDegrees of parallelism = rank(T) - 1

16 Putting it All Together zFind maximum outer-loop parallelism with minimum synchronization yDivide into strongly connected components yApply processor mapping algorithm (no communication) to program yIf no parallelism found, xApply time mapping algorithm to find pipelining xIf no pipelining found (found outer sequential loop) Repeat process on inner loops zMinimize communication yUse a greedy method to order communicating pairs yTry to find communication-free, or neighborhood only communication by solving similar equations zAggregate computations of consecutive data to improve spatial locality

17 Current Status zCompleted: yMathematics package xIntegrated Omega: Pugh’s presburger arithmetic xLinear algebra package Farkas lemma, gaussian elimination, finding null space yCan find communication-free partitions zIn progress yRest of affine partitioning yCode generation

18 Interprocedural Analysis zTwo major design choices in program analysis yAcross procedures xNo interprocedural analysis xInterprocedural: context-insensitive xInterprocedural: context-sensitive yWithin a procedure xFlow-insensitive xFlow-sensitive: interval/region based xFlow-sensitive: iterative over flow-graph xFlow-sensitive: SSA based

19 Efficient Context-Sensitive Analysis zBottom-up yA region/interval: a procedure or a loop yAn edge: call or code in inner scope ySummarize each region (with a transfer function) yFind strongly connected components (sccs) yBottom-up traversal of sccs yIteration to find fixed-point for recursive functions zTop-down yTop-down propagation of values yIteration to find fixed-point for recursive functions call inner loop scc

20 Interprocedural Framework Architecture E.g. Array summaries E.g. Pointer aliases Primitive Handlers Procedure calls and returns Regions & Statements Basic blocks Compound Handlers Bottom-up Top-down Linear traversal Driver Call graph, SCC Regions, control flow graphs Data Structures

21 Interprocedural Framework Architecture zInterprocedural analysis data structures ye.g. call graphs, SSA form, regions or intervals zHandlers: Orthogonal sets of handlers for different groups of constructs yPrimitive: user specifies analysis-specific semantics of primitives yCompound: handles compound statements and calls xUser chooses between handlers of different strengths e.g. no interprocedural analysis versus context-sensitive e.g. flow-insensitive vs. flow-sensitive cfg yAll the handlers are registered in a visitor zDriver yDriver invoked by user’s request for information (demand driven) yBuild prepass data structures yInvokes the right set of handlers in right order (e.g. bottom-up traversal of call graph)

22 Pointer Alias Analysis zSteensgaard’s pointer alias analysis (completed) yFlow-insensitive and context-insensitive, type-inference based analysis yVery efficient: near linear-time analysis yVery inaccurate

23 Parallelization Analysis zScalar analysis yMod/ref, reduction recognition: Bottom-up flow-insensitive yLiveness for privatization: Bottom-up and top-down, flow-sensitive zRegion-based array analysis yMay-write, must-write, read, upwards-exposed read: bottom-up yArray liveness for privatization: bottom-up and top-down yUses our interprocedural framework + omega zSymbolic analysis yFind linear relationships between scalar variables to improve array analysis

24 Parallel Code Generation zLoop bound generation yUse omega based on affine mappings zOutlining and cloning primitives zSpecial IR nodes to represent parallelization primitives yAllows a succint and high-level description of parallelization decision yFor communication to and from users yReduction and private variables and primitives ySynchronization and parallelization primitives SUIF+ par IR SUIF

25 Status zCompleted yCall graphs, scc ySteensgaard’s pointer alias analysis yIntegration of garbage collector with SUIF zIn progress yInterprocedural analysis framework yArray summaries yScalar dependence analysis yParallel code generation zTo be done yScalar symbolic analysis

26 Future work: Basic compiler research zA flexible and integrated platform for new optimizations yCombinations of pointers, OO, parallelization optimizations to parallelize or SIMDize (MMX) multimedia applications yInteraction between garbage collection, exception handling with back end optimizations yEmbedded compilers with application-specific additions at the source language and architectural level

27 As a Useful Compiler for High-Performance Computers zBasic ingredients of a state-of-the-art parallelizing compiler zRequires experimentation, tuning, refinement yFirst implementation of affine partitioning yInterprocedural parallelization requires many analyses working together zMissing functions yAutomatic data distribution yUser interaction needed for parallelizing large code region xSUIF Explorer - a prototype interactive parallelizer in SUIF1 xRequires tools: algorithm to guide performance tuning, program slices, visualization tools zNew techniques yExtend affine mapping to sparse codes (with permutation index arrays) zFortran 90 front end zDebugging support

28 New-Generation Productivity Tool zApply high-level program analysis to increase programmers’ productivity zMany existing analyses yHigh-level, interprocedural side effect analysis with pointers and arrays zNew analyses yFlow and context sensitive pointer alias analysis yInterprocedural control-path based analysis zExamples of tools yFind bugs in program yProve or disapprove user invariants yGenerate test cases yInteractive demand-driven analysis to aid in program debugging yCan also apply to Verilog/VHDL to improve hardware verification

29 Finally... zThe system has to be actively maintained and supported to be useful.

30 The End


Download ppt "Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University"

Similar presentations


Ads by Google