Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.

Slides:

Advertisements

Similar presentations

Continuation of chapter 6…. Nested while loop A while loop used within another while loop is called nested while loop. Q. An illustration to generate.

Advertisements

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

ALGEBRA TILES Cancellation. Write the Equation = = -x= x= -1= 1.

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.

@HC 5KK70 Platform-based Design1 Loop Transformations Motivation Loop level transformations catalogus –Loop merging –Loop interchange –Loop unrolling –Unroll-and-Jam.

OPTIMIZING C CODE FOR THE ARM PROCESSOR Optimizing code takes time and reduces source code readability Usually done for functions that are critical for.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Compiler techniques for exposing ILP

Programmability Issues

Optimizing single thread performance Dependence Loop transformations.

1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.

CS 540 Database Management Systems

INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.

Psychology 202b Advanced Psychological Statistics, II February 17, 2011.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.

Data Dependences CS 524 – High-Performance Computing.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Unit 1. Sorting and Divide and Conquer. Lecture 1 Introduction to Algorithm and Sorting.

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

Ashwani Roy Understanding Graphical Execution Plans Level 200.

Bringing Value of Big Data to Business: SAP's Integrated Strategy [1] Group 6 - Ziqi Fan, Sheng Chen.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.

Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Informix Formation Chetana Mehta PSPL, Pune.

Compiler Support for Optimizing Tensor Contraction Expressions in Quantum Chemistry Computations * Gerald Baumgartner, Ohio State University Daniel Cociorva,

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

Parallel Sessions: Compilers Moderator: Quinlan Panelists: Milind Kulkarni (Purdue), David Padua (UIUC), P. Sadayappan (Ohio State), Armando Solar-Lezama.

M. Mateen Yaqoob The University of Lahore Spring 2014.

CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.

CR18: Advanced Compilers L01 Introduction Tomofumi Yuki.

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

CMSD IT Governance Stakeholder Map Executive Leadership IT Alignment Work Group (Information Technology Alignment) SME Group / Super-Users End Users IT.

Job Scheduling P. (Saday) Sadayappan Ohio State University.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

Lecture 38: Compiling for Modern Architectures 03 May 02

CDSC/InTrans Review Oct , 2016 Student names: Martin Kong, OSU/Rice

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Unit 1. Sorting and Divide and Conquer

Dependence Analysis Important and difficult

Data Locality Analysis and Optimization

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Topics discussed in this section:

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Mahesh Ravishankar1, John Eisenlohr1,

مكتبة الإسكندرية ١٣ ديسمبر ٢٠٠٨ دكتور باسم أحمد عوض

High Performance Computing (CS 540)

Alice in Action with Java

نگرشي بر جريان نوظهور معنويت گرا

Benjamin Goldberg Compiler Verification and Optimization

GAMMA: An Efficient Distributed Shared Memory Toolbox for MATLAB

STUDY AND IMPLEMENTATION

GAMMA: An Efficient Distributed Shared Memory Toolbox for MATLAB

A Unified Framework for Schedule and Storage Optimization

S. Ramesh Mangala Gowri Nanda Slicing Concurrent Programs

Compile-time Frequency Scaling for CPU Energy and EDP Improvement

Print the following triangle, using nested loops

Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors Qingda Lu1, Christophe Alias2, Uday Bondhugula1, Thomas Henretty1,

Presentation transcript:

Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan 1 1 Ohio State University 2 Louisiana State University *Supported by US NSF

 A key loop transformation for: ◦ Efficient coarse-grained parallel execution ◦ Data locality optimization Loop Tiling i j i j for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j); for (it=1; it<=7; it+=Ti) for (jt=1; jt<=6; jt+=Tj) for (i=it; i<min(7,it+Ti-1); i++) for (j=jt; j<min(6,jt+Tj-1); j++) S(i,j); Inter-tile loops Intra-tile loops

Rectangular Tileability  Legality of rectangular tiling: ◦ Atomic execution of each tile ◦ No cyclic dependence between tiles Data dependence lexicographically positive in all space dimensions  Unimodular transformations (e.g., skewing) used as a pre-processing step to make rectangular tiling valid i j ij1ij1 i‘ j’ = Skewing

Parametric Tiling for (it=1; it<=N; it+=Ti) for (jt=1; jt<=N; jt+=Tj) for (i=it; i<min(N,it+Ti-1); i++) for (j=jt; j<min(N,jt+Tj-1); j++) S(i,j); for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j); Tile loop i with tile size Ti Tile loop j with tile size Tj  Performance of tiled code can vary greatly with choice of tile sizes → Model-driven and/or empirical search for best tile sizes  Parametric tile sizes ◦ Not fixed at compile time ◦ Runtime parameters ◦ Valuable for:  Auto-tuning systems  Generalized “ATLAS”

Approaches to Loop Tiling  TLOG and HiTLOG ◦ Handles only perfectly nested loops ◦ Tile sizes can be runtime parameters ◦ Does not address parallelism  Pluto ◦ Handles imperfectly nested loops ◦ Tile sizes must be fixed at compile time ◦ Addresses parallelism  PrimeTile ◦ Handles imperfectly nested loops ◦ Tile sizes can be runtime parameters ◦ Does not address parallelism DynTile and PTile (this work): S ystems with all positive features of existing tiling tools: ◦ Handle imperfectly nested loops ◦ Tile sizes can be runtime parameters ◦ Address parallelism ◦ Support multilevel tiling

Tiled Code Generation with Polyhedral Model >= Original loop: for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j); Tiled loop: for (it=0; it<=floord(N,32); it++) for (jt=0; jt<=floord(N,32); jt++) for (i=max(1,32*it); i<=min(N,32*it+31); i++) for (j=max(1,32*jt); j<=min(N,32*jt+31); j++) S(i,j); ijNijN i’ j’ ijN1ijN ≤ i i ≤ N 1 ≤ j j ≤ N i’ = i j’ = j. ij1ij1 = it jt ≤ i-32∙it i-32∙it ≤ 31 0 ≤ j-32∙jt j-32∙jt ≤ it’ jt’ it jt it’ = it jt’ = jt Statement domain: Affine schedule: Tile sizes = 32 x 32Assume: Rectangular tiling is valid. i j 1 N … 1 N … 2 2 i ≥ 1 i ≤ N j ≤ N j ≥ 1 Constraint of polyhedral model: Inequalities of the loop bounds must be linear in terms of loop iterators and problem sizes

PrimeTile: Approach to Sequential Parametric Tiling Recursive level-by-level generation of tiling loops by non-polyhedral AST processing j i Full tiles (loop i) for (i=lbi; i<=ubi; i++) for (j=lbj(i); j<=ubj(i); j++) S(i,j); Output pseudocode: Partial tile (loop i) for it { } [epilog i] [compute lbv] [compute ubv] if (lbv<ubv) { } else { [untiled j] } [prolog j] [full tiles j] [epilog j] No full tiles Full tiles

PrimeTile: Multi-Level Tiling j i  Essential for: ◦ Exploiting data locality in deep multi-level memory hierarchies  Approach: ◦ Boundary tiles can be recursively tiled using smaller tile sizes 1 levels of tiling2 levels of tiling3 levels of tiling

for i for j1=l2-1,l2-1 S1(i,j1) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3) DynTile: Parametric Tiling (Multi Statement Domains) for i S1(i) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3) Pre-processing to embed in common space One-trip loop

Convex Hull for i for j1 S1(i,j1) for j2 S2(i,j2) for j3 S3(i,j3) i j S1 S2 S3 /* Inter-tile loops */ for it { for jt { } /* Intra-tile loops*/ DynTile: Parametric Tiling (Multiple Statement Domains)

DynTile: Wave-front Parallelism i j i’ j’ ij1ij1 i‘ j’ = wavefront k wavefront k+1 wavefront k+2 wavefront k+3 wavefront k+4  After sequential tiling: 1.If no loop carried dependences exist, then each tiling loop is directly parallelizable 2.If none of the tiling loops is parallel, then wave-front parallelization is always possible (all points in the same wavefront are independent of each other)

for each bin w { #pragma omp parallel for for each tile in w { } /** Intra-tile loops (treated as a black box) */ w1w2w3w4w5 DynTile: Inspector Code for Dynamic Scheduled Parallel Execution it jt /** Inter-tile loops */ for it { for jt { } /** * Intra-tile loops * (treated as * a black box) */ Tile iteration space w1w2w3w4w5 Step 1: Count #wavefronts and #tiles in each wavefront Step 2: Allocate bins to store wavefronts Step 3: Fill the bins with its corresponding tiles Step 4: Execute in parallel all tiles in each bin #wavefronts = 5 w1 has 2 tiles w2 has 3 tiles w3 has 4 tiles w4 has 3 tiles w5 has 4 tiles

DynTile: Implementation Pluto Modified CLooG Parser + AST Generator Clan Convex Hull Generator (using ISL) Convex Hull Generator (using ISL) Tiling Transformer Inspector Code Specifier Code Generator Sequence of loop nests Pre-process Statement polyhedra + Affine transforms (for rectangular tileability) Tileable loop code (with preserved embedding information) Loop ASTs Statement polyhedra Convex-hull loop AST Tiled loop ASTs Parallel tiled loop ASTs Parallel tiled loop code DynTile

PTile: Loop Generation  Representation of Statement Domains ◦ Set of affine inequalities  S:  v 1, v 2, …, v n are loop variables (v 1 outermost and v n innermost)  p 1, p 2, …, p k are program parameters  Bounds of v i, r ≤ i ≤ n, r ≥ 1  max(f 1 (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c), …, f t (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c) ) ≤ v i ≤ min(g 1 (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c), …, g s (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c) )  Bounds are dependent on outer loop variables and parameters (row echelon form)

Loop Generation (cont.) B … 0 B 21 B 22 0 … 0. B n1 B n2 … B nn P 11 P 12 … P 1k P 21 P 22 … P 2k. P n1 P n2 … P nk c1c2...cnc1c2...cn v1.vnp1.pk1v1.vnp1.pk1 row echelon form row echelon form – suitable for generating loop code to scan iteration points represented by the system BP ≥ 0 B | P | C vp1vp1.. ≥ 0 C

Parametric Sequential Tiling  Tiling transformation ◦ Express each variable v j in terms of inter-tile (tile) co- ordinates t j, intra-tile co-ordinates u j and tile sizes s j  v j = s j.t j + u j and 0 ≤ u j ≤ s j -1  S ’ :  S ’ is equivalent to S Not in Row echelon form for t But in Row echelon form for u tups1tups1. ≥ 0 B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 I : Identity matrix

◦ To derive a system in row echelon form for all variables  Create a system S T with only tile variables, program parameters and tile sizes (also parameters)  Relaxed projection to eliminate intra-tile variables u j  In S T, B ij.u j =  All solutions to S’ also satisfy S T  S T :  B.s has same nonzero structure as B => Row echelon form for t, where s is a diagonal matrix of parametric tile sizes Parametric Sequential Tiling (cont.) 0 if B ij ≤ 0 B ij. (s j -1) if B ij > 0 ≥ 0 B.s | P | B + | C’ tps1tps1.

Parametric Sequential Tiling (cont.) ≥ 0 B.s | P | B + | C’ tps1tps1. S’: ST:ST: In row echelon form for t - To generate tile loops S T |S’ : ≥ 0 tups1tups1. B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 In row echelon form for t and u - To generate tile loops and intra-tile loops ≥ 0 tups1tups1. B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 B.s | 0 | P | B + | C’ In row echelon form for u - To generate intra-tile loops

Parallel Non-parameterized Tiling /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it= ⌈ -6/8 ⌉ ; it<= ⌊ N/8 ⌋ ; it++) for (jt= ⌈ -6/8 ⌉ ; jt<= ⌊ N/8 ⌋ ; jt++) for (kt= ⌈ (it*8-7)/8 ⌉ ; kt<= ⌊ N/8 ⌋ ; kt++) // intra-tile loops i,j,k Lower-bound constraintsUpper-bound constraints (1a): -6/8<=it (2a): -6/8<=jt (3a): (it*8-7)/8<=kt (1b): it<=N/8 (2b): jt<=N/8 (3b): kt<=N/8 (4a): w-it-jt<=kt(4b): kt<=w-it-jt (5a): Combine (4a) and (3b) w-it-jt<=N/8 (8*w-8*it-N)/8<=jt (5b): Combine (4b) and (3a) (it*8-7)/8<=w-it-jt jt<=(8*w-16*it+7)/8 (6a): Combine (5a) and (2b) (8*w-8*it-N)/8<=N/8 (4*w-N)/4<=it (6b): Combine (5b) and (2a) -6/8<=(8*w-16*it+7)/8 it<=(8*w+7)/16 (7a): Combine (6b) and (1a) -6/8<=(8*w+13)/16 -7/8<=w (7b): Combine (6a) and (1b) (4*w-N)/4<=N/8 w<=3*N/8 Tiling (8x8x8 tile sizes) Introduce new wavefront constraints (for loop kt) Original loop constraints Use Fourier Motzkin Elimination to derive new wavefront constraints (for loops w,it,jt ) w = it+jt+kt

Parallel Non-parameterized Tiling (cont.) /* Parallel tiled loops */ for (w= ⌈ -7/8 ⌉ ; w<= ⌊ 3*N/8 ⌋ ; w++) /* sequential */ for (it=max( ⌈ -6/8 ⌉, ⌈ (4*w-N)/4 ⌉ ); it<=min( ⌊ N/8 ⌋, ⌊ (8*w+7)/16 ⌋ ); it++) /* parallel */ for (jt=max( ⌈ -6/8 ⌉, ⌈ (8*w-8*it-N)/8 ⌉ ); jt<=min( ⌊ N/8 ⌋, ⌊ (8*w-16*it+7)/8 ⌋ ); jt++) /* parallel */ for (kt=max( ⌈ (it*8-7)/8 ⌉, w-it-jt); kt<=min( ⌊ N/8 ⌋, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k  This works when tile sizes are fixed  When tile sizes are parametric, Fourier Motzkin Elimination becomes problematic ◦ Sign of the coefficient in the combined inequalities can be indeterminate  impossible to determine whether the new inequality is a lower-bound or upper-bound inequality

Parallel Parametric Tiling 1. Introduce an outermost wavefront loop 2. Optimize the innermost iterator using wavefront inequalities w-t 1 -…-t n-1 ≤ t n ≤ w-t 1 -…-t n-1 /* Parallel tiled loops */ for (w=w min ; w<=w max ; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k

Static Determination of Lowest and Highest Wavefront Numbers  The outermost tiling loop enumerates the wavefront numbers from lowest (w min ) to highest (w max )  The values of w min and w max can be determined at compile time using ILP solvers such as PIP/PipLib   Similarly, parametric bound values of each tiling loop variable (t j min and t j max for 1 ≤ j ≤ n) can also be computed using ILP solver. Original point loops (affine inequalities) Global parameter values (affine inequalities) Lexicographic minimum point in each loop level, e.g., 1, 1 Lexicographic maximal point in each loop level, e.g., 200,2*N Lowest wavefront number e.g., w min = ⌊ 1/Ti ⌋ + ⌊ 1/Tj ⌋ Highest wavefront number e.g., w max = ⌊ 200/Ti ⌋ + ⌊ (2*N)/Tj ⌋ ILP Solver

Parallel Parametric Tiling 1. Introduce an outermost wavefront loop  Utilize ILP solver to derive w min and w max 2. Optimize the innermost iterator using wavefront inequalities w-t 1 -…-t n-1 ≤ t n ≤ w-t 1 -…-t n-1 /* Parallel tiled loops */ for (w=w min ; w<=w max ; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Correct code, but may visit many empty tiles

Parallel Parametric Tiling (cont.) 3. Optimize using bounded wavefront inequalities  Utilize ILP solver to derive parametric bound values t j min, t j max for 1 ≤ j ≤ n /* Parallel tiled loops */ for (w=w min ; w<=w max ; w++) /* sequential */ for (it=max(lbit, w-jt max -kt max ); it<=min(ubit, w-jt min -kt min ); it++) /* parallel */ for (jt=max(lbjt, w-it-kt max ); jt<=min(ubjt, w-it-kt min ); jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Tighter loop bounds, but may still visit empty tiles

Parallel Parametric Tiling (cont.) 4. Optimize using Relaxed Symbolic Fourier Motzkin Elimination (RSFME) Lower-bound constraintsUpper-bound constraints (1a): w min <=w(1b): w max <=w (2a): (1-Ti+1)/Ti<=it (3a): (1-Tj+1)/Tj<=jt (4a): (it*Ti-Tk+1)/Tk<=kt (2b): it<=N/Ti (3b): jt<=N/Tj (4b): kt<=N/Tk (5a): w-it-jt<=kt(5b): kt<=w-it-jt (6a): Combine (5a) and (4b) w-it-jt<=N/Tk w-it-N/Tk<=jt (w*Tk-it*Tk-N)/Tk<=jt (6b): Combine (5b) and (4a) (it*Ti-Tk+1)/Tk<=w-it-jt jt<=w-it-it*Ti/Tk+1-1/Tk jt<=(w*Tk-it*Tk-it*Ti+Tk-1)/Tk (7a): Combine (6a) and (3b) w-it-N/Tk<=N/Tj w-N/Tj-N/Tk<=it (w*Tj*Tk-N*Tk- N*Tj)/Tj*Tk<=it (7b): Combine (6b) and (3a) 2/Tj-1<=w-it-it*Ti/Tk+1-1/Tk it+it*Ti/Tk<=w+2-2/Tj-1/Tk it<=(w*Tj*Tk^2+2*Tj*Tk^2- Tj*Tk-2*Tk^2) / (Ti*Tj*Tk+Tj*Tk^2) Very tight loop bounds, with negligible overhead of scanning empty tiles /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it= ⌈ (1-Ti+1)/Ti ⌉ ; it<= ⌊ N/Ti ⌋ ; it++) for (jt= ⌈ (1-Tj+1)/Tj ⌉ ; jt<= ⌊ N/Tj ⌋ ; jt++) for (kt= ⌈ (it*Ti-Tk+1)/Tk ⌉ ; kt<= ⌊ N/Tk ⌋ ; kt++) // intra-tile loops i,j,k No ambiguous signs encountered

Ambiguous Sign Resolution  Resolving ambiguous sign in RSFME  Relaxation step ◦ Replace the tile loop variables with their parametric bounded values (t j min and t j max ) Lower-bound constraintsUpper-bound constraints (1a): w min <=w(1b): w max <=w (2a): (1-Ti+1)/Ti<=it (3a): (1-Tj+1)/Tj<=jt (4a): (it*Ti-Tk+1)/Tk<=kt (2b): it<=N/Ti (3b): jt<=(N-it*Ti)/Tj (4b): kt<=N/Tk (5a): w-it-jt<=kt(5b): kt<=w-it-jt (6a): Combine (5a) and (4b) w-it-jt<=N/Tk w-it-N/Tk<=jt (w*Tk-it*Tk-N)/Tk<=jt (6b): Combine (5b) and (4a) (it*Ti-Tk+1)/Tk<=w-it-jt jt<=w-it-it*Ti/Tk+1-1/Tk jt<=(w*Tk-it*Tk-it*Ti+Tk-1)/Tk (7a): Combine (6a) and (3b) w-it-N/Tk<=N/Tj-it*Ti/Tj w-N/Tj-N/Tk<= it*(1-Ti/Tj) (7b): Combine (6b) and (3a) 2/Tj-1<=w-it-it*Ti/Tk+1-1/Tk it+it*Ti/Tk<=w+2-2/Tj-1/Tk it<=(w*Tj*Tk^2+2*Tj*Tk^2- Tj*Tk-2*Tk^2) / (Ti*Tj*Tk+Tj*Tk^2) /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N-i; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it= ⌈ (1-Ti+1)/Ti ⌉ ; it<= ⌊ N/Ti ⌋ ; it++) for (jt= ⌈ (1-Tj+1)/Tj ⌉ ; jt<= ⌊ (N-it*Ti)/Tj ⌋ ; jt++) for (kt= ⌈ (it*Ti-Tk+1)/Tk ⌉ ; kt<= ⌊ N/Tk ⌋ ; kt++) // intra-tile loops i,j,k Ambiguous sign encountered (7a.1) w-N/Tj-N/Tk+ it min *Ti/Tj<=it (w*Tj*Tk-N*Tj-N*Tk+ it min *Ti*Tk)/(Tj*Tk)<=it w-N/Tj-N/Tk<=it-it*Ti/Tj (7a.2) it*Ti/Tj<= it max -w+N/Tj+N/Tk it<=( it max *Tj*Tk-w*Tj*Tk+N*Tj+N*Tk)/(Ti*Tk) Use it min and it max to resolve sign ambiguity:

PTile: Prototype Implementation Pluto Modified CLooG Parser + AST Generator Clan Convex Hull Generator (using ISL) Convex Hull Generator (using ISL) Tiling Transformer Wavefront Parallelizer + RSFME Code Generator Sequence of loop nests Pre-process Statement polyhedra + Affine transforms (for rectangular tileability) Tileable loop code (with preserved embedding information) Loop ASTs Statement polyhedra Convex-hull loop AST Sequential tiled loop ASTs Parallel tiled loop ASTs Parallel tiled loop code PTile

PTile, DynTile, PrimeTile: Experiments  Main comparison: ◦ PTile, DynTile and PrimeTile  AMD Opteron 2380: ◦ Dual-socket quad-core AMD Opteron 2380 processors running at 2.6 GHz with KB L1 cache, 2 MB of L2 cache  Compilers: ◦ GCC version ◦ ICC version 11.0  Experiments: ◦ With and without vectorization ◦ For parallel runs, used OpenMP  Benchmarks: 2-D FDTD, Cholesky, DTRMM, LU

Results - 1 ◦ PTile:  The RSFME relaxation step was never needed in these and other benchmarks that we have tested with ◦ Control overhead :  PrimeTile has simple loop bounds but larger code size  PTile and DynTile generate more complex loop bounds  For 2D-FDTD, there is a 20% to 40% difference in execution time due to control overhead; for the other benchmarks, no significant difference

Results - 2 Sequential Parallel Bench Compiler PrimeTile DynTile PTile DynTile PTile 2d-fdtd gcc-novec 43.84s 49.19s 56.78s 9.32s 10.98s 2d-fdtd gcc-vec 43.82s 49.22s 56.85s 9.37s 10.98s 2d-fdtd icc-novec 40.27s 48.12s 54.29s 13.30s 12.96s 2d-fdtd icc-vec 40.52s 49.61s 54.63s 13.03s 13.18s cholesky gcc-novec 6.13s 10.50s 13.43s 1.91s 2.81s cholesky gcc-vec 6.08s 10.46s 13.45s 1.89s 2.82s cholesky icc-novec 5.63s 5.86s 8.19s 1.21s 2.40s cholesky icc-vec 5.36s 5.74s 8.22s 1.27s 2.61s dtrmm gcc-novec 9.29s 14.34s 18.99s 2.55s 4.50s dtrmm gcc-vec 9.25s 14.57s 18.99s 2.54s 3.69s dtrmm icc-novec 9.84s 9.19s 13.27s 2.17s 3.22s dtrmm icc-vec 9.91s 9.12s 13.44s 2.33s 3.27s lu gcc-novec 8.30s 9.15s 10.98s 2.56s 2.94s lu gcc-vec 8.29s 9.15s 10.98s 2.98s 2.43s lu icc-novec 6.30s 5.63s 7.49s 6.18s 1.60s lu icc-vec 6.36s 5.58s 6.52s 6.36s 1.62s

Results - 3  Sequential: ◦ PrimeTile performs best; DynTile is close ◦ gcc has more trouble optimizing code from DynTile than code from PrimeTile (difference between icc and gcc) ◦ PTile is slower because the order of execution of tiles impacts locality  Parallel: ◦ DynTile performs better than PTile (except for LU – we need to understand this better) ◦ All tiles in a waveftont are executed in parallel with DynTile, where as the OpenMP parallel pragma works only with the outermost tiled parallel loop in PTile  Vectorization : ◦ complexity of loop bounds in generated code appear to make it difficult for the compiler to vectorize

Summary  Developed DynTile and PTile, two parametric tiling systems with the following features ◦ Handle imperfectly nested loops ◦ Allow tile sizes to be run time parameters ◦ Address parallelism ◦ Support multi-level tiling  Ongoing: Much more extensive set of experiments to understand and improve the efficiency of the approaches for generation of parallel parametrically tiled code