CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

Code Generation Completing the transformation loop Problem: how to generate code to scan a polyhedron? a union of polyhedra? how to generate tiled code? how to generate parametrically tiled code? 2

Evolution of Code Gen Ancourt & Irigoin 1991 single polyhedron scanning LooPo: Griebl & Lengauer 1996 1 st step to unions of polyhedra scan bounding box + guards Omega Code Gen 1995 generate inefficient code (convex hull + guards) then try to remove inefficiencies 3

Evolution of Code Gen LoopGen: Quilleré-Rajopadhye-Wilde 2000 efficiently scanning unions of polyhedra CLooG: Bastoul 2004 improvements to QRW algorithm robust and well maintained implementation AST Generation: Grosser 2015 Polyhedral AST generation is more than scanning polyhedra scanning is not enough! 4

Scanning a Polyhedron Scanning Polyhedra with DO Loops [1991] Problem: generate bounds on loops outermost loop: constants and params inner loop: + surrounding iterators Approach: Fourier-Motzkin elimination projecting out variables 5

Single Polyhedron Example What is the loop nest for lex. scan? 6 i j i≤N j≥0 i-j≥0 for i = 0.. N for j = 0.. i S; for i = 0.. N for j = 0.. i S;

Single Polyhedron Example What is the loop nest for permuted case? j as the outer loop 7 i j i≤N j≥0 i-j≥0 for j = 0.. N for i = j.. N S; for j = 0.. N for i = j.. N S;

Scanning Unions of Polyhedra Consider scanning two statements Naïve approach: bounding box 8 S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i+5] : 0≤i<N } S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i+5] : 0≤i<N } S1: [N]->{ [i] : 0≤i≤N } S2: [N]->{ [i] : 5≤i≤N+5 } S1: [N]->{ [i] : 0≤i≤N } S2: [N]->{ [i] : 5≤i≤N+5 } CoB for (i=0.. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2; for (i=0.. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2;

Slightly Better than BBox Make disjoint domains But this is also problematic: code size can quickly grow 9 for (i=0.. i<=4) S1; for (i=4.. i<=N) S1; S2; for (i=N+1.. i<=N+5) S2; for (i=0.. i<=4) S1; for (i=4.. i<=N) S1; S2; for (i=N+1.. i<=N+5) S2; S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i] : 0≤i<M } S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i] : 0≤i<M }

QRW Algorithm Key: Recursive Splitting Given a set of n-D domains to scan start at d=1 and context=parameters 1. Restrict the domains to the context 2. Project the domains to outer d-dimensions 3. Make the projections disjoint 4. Recurse for each disjoint projection d=d+1, context=a piece of the projection 5. Sort the resulting loops 10

Example Scan the following domains 11 i j S1 S2 d=1 context=universe Step1: Projection Step2: Separation Step3: Recurse Step4: Sort

Example Scan the following domains 12 i j S1 S2 d=1 context=universe Step1: Projection Step2: Separation Step3: Recurse Step4: Sort

Example Scan the following domains 13 i j S1 S2 d=1 context=universe Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=0..1)... for (i=0..1)... for (i=2..6)... for (i=2..6)...

Example Scan the following domains 14 i j S1 S2 d=2 context=0≤i≤2 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=0..1)... for (i=0..1)... for (i=0..1) for (j=0..4) S1; for (i=0..1) for (j=0..4) S1;

Example Scan the following domains 15 i j S1 S2 d=2 context=2≤i≤6 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=2..6)... for (i=2..6)...

Example Scan the following domains 16 i j S1 S2 d=2 context=2≤i≤6 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=2..6)... for (i=2..6)... L2 L1 L4 L3

Example Scan the following domains 17 i j S1 S2 d=2 context=2≤i≤6 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=2..6) L2 L1 L3 L4 for (i=2..6) L2 L1 L3 L4 L2 L1 L4 L3

CLooG: Chunky Loop Generator A few problems in QRW Algorithm high complexity code size is not controlled CLooG uses: pattern matching to avoid costly polyhedral operations during separation may stop the recursion at some depth and generate loops with guards to reduce size 18

Tiled Code Generation Tiling with fixed size we did this earlier Tiling with parametric size problem: non-affine! 19

Tiling Review What does the tiled code look like? 20 for (i=0; i<=N; i++) for (j=0; j<=N; j++) S; for (i=0; i<=N; i++) for (j=0; j<=N; j++) S; for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S for (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S for (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S with tile size ts

Two Approaches Use fix sized tiling if the tile is a constant, stays affine pragmatic choice by many tools Use non-polyhedral code generation much better for tuning tile sizes make sense for semi-automatic tools 21

Difficulties in Tiled Code Gen This is still a very simplified view In practice, we tile after transformation skewing, etc. Let’s see the tiled iteration space with tvis 22 for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S

Full Tiles, Inset / Outset Partial tiles have a lot of control overhead Challenges for parametric tiled code gen make sure to scan the outset but also separate the inset use efficient point loops for inset All with out polyhedral analysis 23

Point Loops for Full/Partial Tile Full Tile Point Loop Partial/Empty Tile Point Loop 24 for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++)... for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++)... for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...)... for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...)...

Progression of Parametric Tiling Perfectly nested, single loop TLoG [Renganarayana et al. 2007] Multiple levels of tiling HiTLoG [Renganarayana et al. 2007] PrimeTile [Hartono 2009] Parallelizing the tiles DynTile [Hartono 2010] D-Tiling [Kim 2011] 25

Computing the Outset We start with some domain expand in each dimension by (symbolic) tile size – 1 except for upper bounds 26 {[i,j]: 0≤i≤10 and i≤j≤i+10} {[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10} {[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10}

Computing the Inset We start with some domain shrink in each dimension by (symbolic) tile size – 1 except for lower bounds 27 {[i,j]: 0≤i≤10 and i≤j≤i+10} {[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)} {[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)}

Syntactic Manipulation We cannot use polyhedral code generators so back to modifying AST Modify the loop bounds to get loops that visit outset get guards to switch point-loops Up to here is HiTLoG/PrimeTile 28

Problem: Parallelization After tiling, there is parallelism However, it requires skewing of tiles We need non-polyehdral skewing The key equation: where d: number of tiled dimensions ti: tile origins ts: tile sizes 29

D-Tiling The equation enables skewing of tiles If one of time or tile origins are unknown, can be computed from the others Generated Code: (tix is d-1th tile origin) 30 for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid } for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid }

Distributed Memory Parallelization Problems implicitly handled by the shared memory now need explicit treatment Communication Which processors need to send/receive? Which data to send/receive? How to manage communication buffers? Data partitioning How do you allocate memory across nodes? 31

MPI Code Generator Distributed Memory Parallelization Tiling based Parameterized tile sizes C+MPI implementation Uniform dependences as key enabler Many affine dependences can be uniformized Shared memory performance carried over to distributed memory Scales as well as PLuTo but to multiple nodes 32

Related Work (Polyhedral) Polyhedral Approaches Initial idea [Amarasinghe1993] Analysis for fixed sized tiling [Claßen2006] Further optimization [Bondhugula2011] “Brute Force” polyhedral analysis for handling communication No hope of handling parametric tile size Can handle arbitrarily affine programs 33

Outline Introduction “Uniform-ness” of Affine Programs Uniformization Uniform-ness of PolyBench MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with PolyBench Conclusions and Future Work 34

Affine vs Uniform Affine Dependences: f = Ax+b Examples (i,j->j,i) (i,j->i,i) (i->0) Uniform Dependences: f = Ix+b Examples (i,j->i-1,j) (i->i-1) 35

Uniformization (i->0) (i->i-1) 36

Uniformization Uniformization is a classic technique “solved” in the 1980’s has been “forgotten” in the multi-core era Any affine dependence can be uniformized by adding a dimension [Roychowdhury1988] Nullspace pipelining simple technique for uniformization many dependences are uniformized 37

Uniformization and Tiling Uniformization does not influence tilability 38

PolyBench [Pouchet2010] Collection of 30 polyhedral kernels Proposed by Pouchet as a benchmark for polyhedral compilation Goal: Small enough benchmark so that individual results are reported; no averages Kernels from: data mining linear algebra kernels, solvers dynamic programming stencil computations 39

Uniform-ness of PolyBench 5 of them are “incorrect” and are excluded Embedding: Match dimensions of statements Phase Detection: Separate program into phases Output of a phase is used as inputs to the other StageUniform at Start After Embeddin g After Pipelining After Phase Detection Number of Fully Uniform Programs 8/25 (32%) 13/25 (52%) 21/25 (84%) 24/25 (96%) 40

Outline Introduction Uniform-ness of Affine Programs Uniformization Uniform-ness of PolyBench MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with PolyBench Conclusions and Future Work 41

Basic Strategy: Tiling We focus on tilable programs 42

Dependences in Tilable Space All in the non-positive direction 43

Wave-front Parallelization All tiles with the same color can run in parallel 44

Assumptions Uniform in at least one of the dimensions The uniform dimension is made outermost Tilable space is fully permutable One-dimensional processor allocation Large enough tile sizes Dependences do not span multiple tiles Then, communication is extremely simplified 45

Processor Allocation Outermost tile loop is distributed P0 P1 P2P3 i1 i2 46

Values to be Communicated Faces of the tiles (may be thicker than 1) i1 i2 P0 P1 P2P3 47

Naïve Placement of Send and Receive Codes Receiver is the consumer tile of the values i1 i2 P0 P1 P2P3 S S S S S S R R R R R R 48

Problems in Naïve Placement Receiver is in the next wave-front time i1 i2 P0 P1 P2P3 S S S S S S R R R R R R t=0 t=1 t=2 t=3 49

Problems in Naïve Placement Receiver is in the next wave-front time Number of communications “in-flight” = amount of parallelism MPI_Send will deadlock May not return control if system buffer is full Asynchronous communication is required Must manage your own buffer required buffer = amount of parallelism i.e., number of virtual processors 50

Proposed Placement of Send and Receive codes Receiver is one tile below the consumer i1 i2 P0 P1 P2P3 S S S S S S R R R R R R 51

Placement within a Tile Naïve Placement: Receive -> Compute -> Send Proposed Placement: Issue asynchronous receive (MPI_Irecv) Compute Issue asynchronous send (MPI_Isend) Wait for values to arrive Overlap of computation and communication Only two buffers per physical processor Overlap Recv Buffer Send Buffer 52

Evaluation Compare performance with PLuTo Shared memory version with same strategy Cray: 24 cores per node, up to 96 cores Goal: Similar scaling as PLuTo Tile sizes are searched with educated guesses PolyBench 7 are too small 3 cannot be tiled or have limited parallelism 9 cannot be used due to PLuTo/PolyBench issue 53

Performance Results 54  Linear extrapolation from speed up of 24 cores Broadcast cost at most 2.5 seconds

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

Similar presentations

Presentation on theme: "CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

Similar presentations

Presentation on theme: "CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1."— Presentation transcript:

Similar presentations

About project

Feedback