Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University http://suif.stanford.edu/

Interprocedural, High-Level Transforms for Locality and Parallelism zProgram transformation for computational kernels ya new technique based on affine partitioning zInterprocedural analysis framework: to maximize code reuse yFlow sensitivity; context sensitivity zInterprocedural program analysis yPointer alias analysis (Steensgaard’s algorithm) yScalar/Scalar dependence, privatization, reduction recognition zParallel code generation yDefine new IR nodes for parallel code

Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors

Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A B EPSS L L L

Results with Affine Partitioning + Blocking

New Transform Theory zDomain: arbitrary loop nesting, instruction optimized separately zUnifies yPermutation ySkewing yReversal yFusion yFission yStatement reordering zSupports blocking across all loop nests zOptimal: Max. deg. of parallelism & min. deg. of synchronization zMinimize communication by aligning the computation and pipelining zMore powerful & simpler software engineering

A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j];(S 1 ) B[i,j] = A[i,j-1]*B[i,j];(S 2 ) i j S1S1 S2S2

Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S 2 ) for i 1 = max(1,1+p) to min(n,n-1+p) do A[i 1,i 1 -p] = A[i 1,i 1 -p] + B[i 1 -1,i 1 -p];(S 1 ) B[i 1,i 1 -p+1] = A[i 1,i 1 -p] * B[i 1,i 1 -p+1];(S 2 ) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S 1 ) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.

Maximum Parallelism & No Communication Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find C j which maps an instance of statement j to a processor:  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) with the objective of maximizing the rank of C j Loops Array Processor ID F1 (i1)F1 (i1) F2 (i2)F2 (i2) C1 (i1)C1 (i1) C2 (i2)C2 (i2)

Algorithm  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) zRewrite partition constraints as systems of linear equations  use affine form of Farkas Lemma to rewrite constraints as systems of linear inequalities in C and  use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0 zFind solutions using linear algebra techniques  the null space for matrix A is a solution of C with maximum rank.

Pipelining Alternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N(parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))

Finding the Maximum Degree of Pipelining Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find T j which maps an instance of statement j to a time stage:  i j, i k B j i j  0, B k i k  0 ( i j  i k )  ( F xj ( i j ) = F xk ( i k ))  T j ( i j )  T k ( i k ) lexicographically with the objective of maximizing the rank of T j Loops Array Time Stage F1 (i1)F1 (i1) F2 (i2)F2 (i2) T1 (i1)T1 (i1) T2 (i2)T2 (i2)

Key Insight zChoice in time mapping => (pipelined) parallelism zDegrees of parallelism = rank(T) - 1

Putting it All Together zFind maximum outer-loop parallelism with minimum synchronization yDivide into strongly connected components yApply processor mapping algorithm (no communication) to program yIf no parallelism found, xApply time mapping algorithm to find pipelining xIf no pipelining found (found outer sequential loop) Repeat process on inner loops zMinimize communication yUse a greedy method to order communicating pairs yTry to find communication-free, or neighborhood only communication by solving similar equations zAggregate computations of consecutive data to improve spatial locality

Current Status zCompleted: yMathematics package xIntegrated Omega: Pugh’s presburger arithmetic xLinear algebra package Farkas lemma, gaussian elimination, finding null space yCan find communication-free partitions zIn progress yRest of affine partitioning yCode generation

Interprocedural Analysis zTwo major design choices in program analysis yAcross procedures xNo interprocedural analysis xInterprocedural: context-insensitive xInterprocedural: context-sensitive yWithin a procedure xFlow-insensitive xFlow-sensitive: interval/region based xFlow-sensitive: iterative over flow-graph xFlow-sensitive: SSA based

Efficient Context-Sensitive Analysis zBottom-up yA region/interval: a procedure or a loop yAn edge: call or code in inner scope ySummarize each region (with a transfer function) yFind strongly connected components (sccs) yBottom-up traversal of sccs yIteration to find fixed-point for recursive functions zTop-down yTop-down propagation of values yIteration to find fixed-point for recursive functions call inner loop scc

Interprocedural Framework Architecture E.g. Array summaries E.g. Pointer aliases Primitive Handlers Procedure calls and returns Regions & Statements Basic blocks Compound Handlers Bottom-up Top-down Linear traversal Driver Call graph, SCC Regions, control flow graphs Data Structures

Interprocedural Framework Architecture zInterprocedural analysis data structures ye.g. call graphs, SSA form, regions or intervals zHandlers: Orthogonal sets of handlers for different groups of constructs yPrimitive: user specifies analysis-specific semantics of primitives yCompound: handles compound statements and calls xUser chooses between handlers of different strengths e.g. no interprocedural analysis versus context-sensitive e.g. flow-insensitive vs. flow-sensitive cfg yAll the handlers are registered in a visitor zDriver yDriver invoked by user’s request for information (demand driven) yBuild prepass data structures yInvokes the right set of handlers in right order (e.g. bottom-up traversal of call graph)

Pointer Alias Analysis zSteensgaard’s pointer alias analysis (completed) yFlow-insensitive and context-insensitive, type-inference based analysis yVery efficient: near linear-time analysis yVery inaccurate

Parallelization Analysis zScalar analysis yMod/ref, reduction recognition: Bottom-up flow-insensitive yLiveness for privatization: Bottom-up and top-down, flow-sensitive zRegion-based array analysis yMay-write, must-write, read, upwards-exposed read: bottom-up yArray liveness for privatization: bottom-up and top-down yUses our interprocedural framework + omega zSymbolic analysis yFind linear relationships between scalar variables to improve array analysis

Parallel Code Generation zLoop bound generation yUse omega based on affine mappings zOutlining and cloning primitives zSpecial IR nodes to represent parallelization primitives yAllows a succint and high-level description of parallelization decision yFor communication to and from users yReduction and private variables and primitives ySynchronization and parallelization primitives SUIF+ par IR SUIF

Status zCompleted yCall graphs, scc ySteensgaard’s pointer alias analysis yIntegration of garbage collector with SUIF zIn progress yInterprocedural analysis framework yArray summaries yScalar dependence analysis yParallel code generation zTo be done yScalar symbolic analysis

Future work: Basic compiler research zA flexible and integrated platform for new optimizations yCombinations of pointers, OO, parallelization optimizations to parallelize or SIMDize (MMX) multimedia applications yInteraction between garbage collection, exception handling with back end optimizations yEmbedded compilers with application-specific additions at the source language and architectural level

As a Useful Compiler for High-Performance Computers zBasic ingredients of a state-of-the-art parallelizing compiler zRequires experimentation, tuning, refinement yFirst implementation of affine partitioning yInterprocedural parallelization requires many analyses working together zMissing functions yAutomatic data distribution yUser interaction needed for parallelizing large code region xSUIF Explorer - a prototype interactive parallelizer in SUIF1 xRequires tools: algorithm to guide performance tuning, program slices, visualization tools zNew techniques yExtend affine mapping to sparse codes (with permutation index arrays) zFortran 90 front end zDebugging support

New-Generation Productivity Tool zApply high-level program analysis to increase programmers’ productivity zMany existing analyses yHigh-level, interprocedural side effect analysis with pointers and arrays zNew analyses yFlow and context sensitive pointer alias analysis yInterprocedural control-path based analysis zExamples of tools yFind bugs in program yProve or disapprove user invariants yGenerate test cases yInteractive demand-driven analysis to aid in program debugging yCan also apply to Verilog/VHDL to improve hardware verification

Finally... zThe system has to be actively maintained and supported to be useful.

The End

Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Similar presentations

Presentation on theme: "Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Similar presentations

Presentation on theme: "Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University"— Presentation transcript:

Similar presentations

About project

Feedback