Affine Partitioning for Parallelism & Locality Amy Lim Stanford University

Slides:



Advertisements
Similar presentations
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Advertisements

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Compiler Challenges for High Performance Architectures
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Numerical Algorithms • Matrix multiplication
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Data Locality CS 524 – High-Performance Computing.
Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
CSE 830: Design and Theory of Algorithms
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Dependences CS 524 – High-Performance Computing.
1 Friday, November 03, 2006 “The greatest obstacle to discovery is not ignorance, but the illusion of knowledge.” -D. Boorstin.
Data Locality CS 524 – High-Performance Computing.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Basic PRAM algorithms Problem 1. Min of n numbers Problem 2. Computing a position of the first one in the sequence of 0’s and 1’s.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium
Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Systems of Linear Equations and Inequalities (Chapter 3)
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.
1 Partitioning Loops with Variable Dependence Distances Yijun Yu and Erik D’Hollander Department of Electronics and Information Systems University of Ghent,
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Introduction to Complexity Analysis. Computer Science, Silpakorn University 2 Complexity of Algorithm algorithm คือ ขั้นตอนการคำนวณ ที่ถูกนิยามไว้อย่างชัดเจนโดยจะ.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Dependence Analysis and Loops CS 3220 Spring 2016.
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Parallelizing Loops Moreno Marzolla
Nuclear Norm Heuristic for Rank Minimization
Linear Systems Chapter 3.
STUDY AND IMPLEMENTATION
Numerical Algorithms • Parallelizing matrix multiplication
A Unified Framework for Schedule and Storage Optimization
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors Qingda Lu1, Christophe Alias2, Uday Bondhugula1, Thomas Henretty1,
Optimizing single thread performance
Presentation transcript:

Affine Partitioning for Parallelism & Locality Amy Lim Stanford University

Useful Transforms for Parallelism&Locality INTERCHANGEFOR i FOR j FOR j FOR i A[i,j]=A[i,j] = REVERSALFOR i= 1 to nFOR i= n downto 1 A[i]= A[i] = SKEWINGFOR i=1 TO nFOR i=1 TO n FOR j=1 TO nFOR k=i+1 to i+n A[i,j] = A[i,k-i] = FUSION/FISSIONFOR i = 1 TO nFOR i = 1 TO n A[i] =A[i] = FOR i = 1 TO n B[i] = B[i] = REINDEXINGFOR i = 1 to nA[1] = B[0] A[i] = B[i-1] FOR i = 1 to n-1 C[i] = A[i+1]A[i+1] = B[i] C[i] = A[i+1] C[n] = A[n+1] Traditional approach: is it legal & desirable to apply one transform?

Question: How to combine the transformations?  Affine mappings [Lim & Lam, POPL 97, ICS 99] zDomain: arbitrary loop nesting, affine loop indices; instruction optimized separately zUnifies yPermutation ySkewing yReversal yFusion yFission yStatement reordering zSupports blocking across all (non-perfectly nested) loops zOptimal: Max. deg. of parallelism & min. deg. of synchronization zMinimize communication by aligning the computation and pipelining

Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz processors

Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A B EPSS L L L

Results with Affine Partitioning + Blocking

A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j];(S 1 ) B[i,j] = A[i,j-1]*B[i,j];(S 2 ) i j S1S1 S2S2

Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S 2 ) for i 1 = max(1,1+p) to min(n,n-1+p) do A[i 1,i 1 -p] = A[i 1,i 1 -p] + B[i 1 -1,i 1 -p];(S 1 ) B[i 1,i 1 -p+1] = A[i 1,i 1 -p] * B[i 1,i 1 -p+1];(S 2 ) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S 1 ) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.

Maximum Parallelism & No Communication Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find C j which maps an instance of statement j to a processor:  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) with the objective of maximizing the rank of C j Loops Array Processor ID F1 (i1)F1 (i1) F2 (i2)F2 (i2) C1 (i1)C1 (i1) C2 (i2)C2 (i2)

Algorithm  i j, i k B j i j  0, B k i k  0 F xj ( i j ) = F xk ( i k )  C j ( i j ) = C k ( i k ) zRewrite partition constraints as systems of linear equations  use affine form of Farkas Lemma to rewrite constraints as systems of linear inequalities in C and  use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0 zFind solutions using linear algebra techniques  the null space for matrix A is a solution of C with maximum rank.

Pipelining Alternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N(parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))

Finding the Maximum Degree of Pipelining Let F xj be an access to array x in statement j, i j be an iteration index for statement j, B j i j  0 represent loop bound constraints for statement j, Find T j which maps an instance of statement j to a time stage:  i j, i k B j i j  0, B k i k  0 ( i j  i k )  ( F xj ( i j ) = F xk ( i k ))  T j ( i j )  T k ( i k ) lexicographically with the objective of maximizing the rank of T j Loops Array Time Stage F1 (i1)F1 (i1) F2 (i2)F2 (i2) T1 (i1)T1 (i1) T2 (i2)T2 (i2)

Key Insight zChoice in time mapping => (pipelined) parallelism zDegrees of parallelism = rank(T) - 1

Putting it All Together zFind maximum outer-loop parallelism with minimum synchronization yDivide into strongly connected components yApply processor mapping algorithm (no communication) to program yIf no parallelism found, xApply time mapping algorithm to find pipelining xIf no pipelining found (found outer sequential loop) Repeat process on inner loops zMinimize communication yUse a greedy method to order communicating pairs yTry to find communication-free, or neighborhood only communication by solving similar equations zAggregate computations of consecutive data to improve spatial locality

Use of Affine Partitioning in Locality Opt. zPromotes array contraction yFinds independent threads and shortens the live ranges of variables zSupports blocking of imperfectly nested loops yFinds largest fully permutable loop nest via affine partitioning yFully permutable loop nest -> blockable