CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

Slides:



Advertisements
Similar presentations
Beyond Shared Memory Loop Parallelism in the Polyhedral Model Tomofumi Yuki Ph.D Dissertation 10/
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
AlphaZ: A System for Design Space Exploration in the Polyhedral Model
Hadi Goudarzi and Massoud Pedram
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Derivation of Efficient FSM from Polyhedral Loop Nests Tomofumi Yuki, Antoine Morvan, Steven Derrien INRIA/Université de Rennes 1 1.
Types of Algorithms.
HST 952 Computing for Biomedical Scientists Lecture 10.
Feasibility, uncertainty and interpolation J. A. Rossiter (Sheffield, UK)
Chapter 4: Divide and Conquer The Design and Analysis of Algorithms.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
SPIE Vision Geometry - July '99 Even faster point set pattern matching in 3-d Niagara University and SUNY - Buffalo Laurence Boxer Research.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Parallelizing Compilers Presented by Yiwei Zhang.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.
Cliff Rhyne and Jerry Fu June 5, 2007 Parallel Image Segmenter CSE 262 Spring 2007 Project Final Presentation.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Fundamental Techniques
Pointer analysis. Pointer Analysis Outline: –What is pointer analysis –Intraprocedural pointer analysis –Interprocedural pointer analysis Andersen and.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Package Transportation Scheduling Albert Lee Robert Z. Lee.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Analysis of Algorithms
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
CS4432: Database Systems II Query Processing- Part 2.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015.
Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
TensorFlow– A system for large-scale machine learning
Analysis of Algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Introduction to Algorithms
Algorithm Analysis CSE 2011 Winter September 2018.
Sorting atomic items Chapter 5 Distribution based sorting paradigms.
Evaluation of Relational Operations
EE 193: Parallel Computing
Objective of This Course
A Unified Framework for Schedule and Storage Optimization
Pointer analysis.
Algorithms: the big picture
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1

Code Generation Completing the transformation loop Problem: how to generate code to scan a polyhedron? a union of polyhedra? how to generate tiled code? how to generate parametrically tiled code? 2

Evolution of Code Gen Ancourt & Irigoin 1991 single polyhedron scanning LooPo: Griebl & Lengauer st step to unions of polyhedra scan bounding box + guards Omega Code Gen 1995 generate inefficient code (convex hull + guards) then try to remove inefficiencies 3

Evolution of Code Gen LoopGen: Quilleré-Rajopadhye-Wilde 2000 efficiently scanning unions of polyhedra CLooG: Bastoul 2004 improvements to QRW algorithm robust and well maintained implementation AST Generation: Grosser 2015 Polyhedral AST generation is more than scanning polyhedra scanning is not enough! 4

Scanning a Polyhedron Scanning Polyhedra with DO Loops [1991] Problem: generate bounds on loops outermost loop: constants and params inner loop: + surrounding iterators Approach: Fourier-Motzkin elimination projecting out variables 5

Single Polyhedron Example What is the loop nest for lex. scan? 6 i j i≤N j≥0 i-j≥0 for i = 0.. N for j = 0.. i S; for i = 0.. N for j = 0.. i S;

Single Polyhedron Example What is the loop nest for permuted case? j as the outer loop 7 i j i≤N j≥0 i-j≥0 for j = 0.. N for i = j.. N S; for j = 0.. N for i = j.. N S;

Scanning Unions of Polyhedra Consider scanning two statements Naïve approach: bounding box 8 S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i+5] : 0≤i<N } S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i+5] : 0≤i<N } S1: [N]->{ [i] : 0≤i≤N } S2: [N]->{ [i] : 5≤i≤N+5 } S1: [N]->{ [i] : 0≤i≤N } S2: [N]->{ [i] : 5≤i≤N+5 } CoB for (i=0.. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2; for (i=0.. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2;

Slightly Better than BBox Make disjoint domains But this is also problematic: code size can quickly grow 9 for (i=0.. i<=4) S1; for (i=4.. i<=N) S1; S2; for (i=N+1.. i<=N+5) S2; for (i=0.. i<=4) S1; for (i=4.. i<=N) S1; S2; for (i=N+1.. i<=N+5) S2; S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i] : 0≤i<M } S1: [N]->{ S1[i]->[i] : 0≤i<N } S2: [N]->{ S2[i]->[i] : 0≤i<M }

QRW Algorithm Key: Recursive Splitting Given a set of n-D domains to scan start at d=1 and context=parameters 1. Restrict the domains to the context 2. Project the domains to outer d-dimensions 3. Make the projections disjoint 4. Recurse for each disjoint projection d=d+1, context=a piece of the projection 5. Sort the resulting loops 10

Example Scan the following domains 11 i j S1 S2 d=1 context=universe Step1: Projection Step2: Separation Step3: Recurse Step4: Sort

Example Scan the following domains 12 i j S1 S2 d=1 context=universe Step1: Projection Step2: Separation Step3: Recurse Step4: Sort

Example Scan the following domains 13 i j S1 S2 d=1 context=universe Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=0..1)... for (i=0..1)... for (i=2..6)... for (i=2..6)...

Example Scan the following domains 14 i j S1 S2 d=2 context=0≤i≤2 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=0..1)... for (i=0..1)... for (i=0..1) for (j=0..4) S1; for (i=0..1) for (j=0..4) S1;

Example Scan the following domains 15 i j S1 S2 d=2 context=2≤i≤6 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=2..6)... for (i=2..6)...

Example Scan the following domains 16 i j S1 S2 d=2 context=2≤i≤6 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=2..6)... for (i=2..6)... L2 L1 L4 L3

Example Scan the following domains 17 i j S1 S2 d=2 context=2≤i≤6 Step1: Projection Step2: Separation Step3: Recurse Step4: Sort for (i=2..6) L2 L1 L3 L4 for (i=2..6) L2 L1 L3 L4 L2 L1 L4 L3

CLooG: Chunky Loop Generator A few problems in QRW Algorithm high complexity code size is not controlled CLooG uses: pattern matching to avoid costly polyhedral operations during separation may stop the recursion at some depth and generate loops with guards to reduce size 18

Tiled Code Generation Tiling with fixed size we did this earlier Tiling with parametric size problem: non-affine! 19

Tiling Review What does the tiled code look like? 20 for (i=0; i<=N; i++) for (j=0; j<=N; j++) S; for (i=0; i<=N; i++) for (j=0; j<=N; j++) S; for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S for (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S for (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S with tile size ts

Two Approaches Use fix sized tiling if the tile is a constant, stays affine pragmatic choice by many tools Use non-polyhedral code generation much better for tuning tile sizes make sense for semi-automatic tools 21

Difficulties in Tiled Code Gen This is still a very simplified view In practice, we tile after transformation skewing, etc. Let’s see the tiled iteration space with tvis 22 for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S

Full Tiles, Inset / Outset Partial tiles have a lot of control overhead Challenges for parametric tiled code gen make sure to scan the outset but also separate the inset use efficient point loops for inset All with out polyhedral analysis 23

Point Loops for Full/Partial Tile Full Tile Point Loop Partial/Empty Tile Point Loop 24 for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++)... for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++)... for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...)... for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...)...

Progression of Parametric Tiling Perfectly nested, single loop TLoG [Renganarayana et al. 2007] Multiple levels of tiling HiTLoG [Renganarayana et al. 2007] PrimeTile [Hartono 2009] Parallelizing the tiles DynTile [Hartono 2010] D-Tiling [Kim 2011] 25

Computing the Outset We start with some domain expand in each dimension by (symbolic) tile size – 1 except for upper bounds 26 {[i,j]: 0≤i≤10 and i≤j≤i+10} {[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10} {[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10}

Computing the Inset We start with some domain shrink in each dimension by (symbolic) tile size – 1 except for lower bounds 27 {[i,j]: 0≤i≤10 and i≤j≤i+10} {[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)} {[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)}

Syntactic Manipulation We cannot use polyhedral code generators so back to modifying AST Modify the loop bounds to get loops that visit outset get guards to switch point-loops Up to here is HiTLoG/PrimeTile 28

Problem: Parallelization After tiling, there is parallelism However, it requires skewing of tiles We need non-polyehdral skewing The key equation: where d: number of tiled dimensions ti: tile origins ts: tile sizes 29

D-Tiling The equation enables skewing of tiles If one of time or tile origins are unknown, can be computed from the others Generated Code: (tix is d-1th tile origin) 30 for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid } for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid }

Distributed Memory Parallelization Problems implicitly handled by the shared memory now need explicit treatment Communication Which processors need to send/receive? Which data to send/receive? How to manage communication buffers? Data partitioning How do you allocate memory across nodes? 31

MPI Code Generator Distributed Memory Parallelization Tiling based Parameterized tile sizes C+MPI implementation Uniform dependences as key enabler Many affine dependences can be uniformized Shared memory performance carried over to distributed memory Scales as well as PLuTo but to multiple nodes 32

Related Work (Polyhedral) Polyhedral Approaches Initial idea [Amarasinghe1993] Analysis for fixed sized tiling [Claßen2006] Further optimization [Bondhugula2011] “Brute Force” polyhedral analysis for handling communication No hope of handling parametric tile size Can handle arbitrarily affine programs 33

Outline Introduction “Uniform-ness” of Affine Programs Uniformization Uniform-ness of PolyBench MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with PolyBench Conclusions and Future Work 34

Affine vs Uniform Affine Dependences: f = Ax+b Examples (i,j->j,i) (i,j->i,i) (i->0) Uniform Dependences: f = Ix+b Examples (i,j->i-1,j) (i->i-1) 35

Uniformization (i->0) (i->i-1) 36

Uniformization Uniformization is a classic technique “solved” in the 1980’s has been “forgotten” in the multi-core era Any affine dependence can be uniformized by adding a dimension [Roychowdhury1988] Nullspace pipelining simple technique for uniformization many dependences are uniformized 37

Uniformization and Tiling Uniformization does not influence tilability 38

PolyBench [Pouchet2010] Collection of 30 polyhedral kernels Proposed by Pouchet as a benchmark for polyhedral compilation Goal: Small enough benchmark so that individual results are reported; no averages Kernels from: data mining linear algebra kernels, solvers dynamic programming stencil computations 39

Uniform-ness of PolyBench 5 of them are “incorrect” and are excluded Embedding: Match dimensions of statements Phase Detection: Separate program into phases Output of a phase is used as inputs to the other StageUniform at Start After Embeddin g After Pipelining After Phase Detection Number of Fully Uniform Programs 8/25 (32%) 13/25 (52%) 21/25 (84%) 24/25 (96%) 40

Outline Introduction Uniform-ness of Affine Programs Uniformization Uniform-ness of PolyBench MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with PolyBench Conclusions and Future Work 41

Basic Strategy: Tiling We focus on tilable programs 42

Dependences in Tilable Space All in the non-positive direction 43

Wave-front Parallelization All tiles with the same color can run in parallel 44

Assumptions Uniform in at least one of the dimensions The uniform dimension is made outermost Tilable space is fully permutable One-dimensional processor allocation Large enough tile sizes Dependences do not span multiple tiles Then, communication is extremely simplified 45

Processor Allocation Outermost tile loop is distributed P0 P1 P2P3 i1 i2 46

Values to be Communicated Faces of the tiles (may be thicker than 1) i1 i2 P0 P1 P2P3 47

Naïve Placement of Send and Receive Codes Receiver is the consumer tile of the values i1 i2 P0 P1 P2P3 S S S S S S R R R R R R 48

Problems in Naïve Placement Receiver is in the next wave-front time i1 i2 P0 P1 P2P3 S S S S S S R R R R R R t=0 t=1 t=2 t=3 49

Problems in Naïve Placement Receiver is in the next wave-front time Number of communications “in-flight” = amount of parallelism MPI_Send will deadlock May not return control if system buffer is full Asynchronous communication is required Must manage your own buffer required buffer = amount of parallelism i.e., number of virtual processors 50

Proposed Placement of Send and Receive codes Receiver is one tile below the consumer i1 i2 P0 P1 P2P3 S S S S S S R R R R R R 51

Placement within a Tile Naïve Placement: Receive -> Compute -> Send Proposed Placement: Issue asynchronous receive (MPI_Irecv) Compute Issue asynchronous send (MPI_Isend) Wait for values to arrive Overlap of computation and communication Only two buffers per physical processor Overlap Recv Buffer Send Buffer 52

Evaluation Compare performance with PLuTo Shared memory version with same strategy Cray: 24 cores per node, up to 96 cores Goal: Similar scaling as PLuTo Tile sizes are searched with educated guesses PolyBench 7 are too small 3 cannot be tiled or have limited parallelism 9 cannot be used due to PLuTo/PolyBench issue 53

Performance Results 54  Linear extrapolation from speed up of 24 cores Broadcast cost at most 2.5 seconds