Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
CS 201 Compiler Construction
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Parallelizing Compilers Presented by Yiwei Zhang.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli.
Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
ECE 1747H : Parallel Programming Lecture 1-2: Overview.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
High-Level Transformations for Embedded Computing
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Memory-Aware Compilation Philip Sweany 10/20/2011.
10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
Lecture 38: Compiling for Modern Architectures 03 May 02
Automatic Thread Extraction with Decoupled Software Pipelining
Conception of parallel algorithms
Dependence Analysis Important and difficult
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Register Pressure Guided Unroll-and-Jam
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
October 18, 2018 Kit Barton, IBM Canada
Lecture 19: Code Optimisation
Optimization.
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy

Optimizing Compilers for Modern Architectures Review SMP machines have multiple processors all accessing a central memory. The processors are unrelated, and can run separate processes. Starting processes and synchonrization between proccesses is expensive.

Optimizing Compilers for Modern Architectures Synchonrization A basic synchonrization element is the barrier. A barrier in a program forces all processes to reach a certain point before execution continues. Bus contention can cause slowdowns.

Optimizing Compilers for Modern Architectures DO I == 1,N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO PARALLEL DO I = 1,N PRIVATE t S1 t = A(I) S2 A(I) = B(I) S3 B(I) = t ENDDO Single Loops The analog of scalar expansion is privatization. Temporaries can be given separate namespaces for each iteration.

Optimizing Compilers for Modern Architectures Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x. Privatizability can be stated as a data-flow problem: NOTE: We can also do this by declaring a variable x private if its SSA form doesn’t contain a phi function at the entry. Privatization

Optimizing Compilers for Modern Architectures Loop Distribution Loop distribution eliminates carried dependencies. Consequently, it often creates opportunity for outer-loop parallelism. We must add extra barriers to keep dependent loops from executing out of order, so the overhead may override the parallel savings. Attempt other transformations before attempting this one.

Optimizing Compilers for Modern Architectures DO I = 2,N A(I) = B(I)+C(I) D(I) = A(I-1)*2.0 ENDDO DO I = 1,N+1 IF (I.GT. 1) A(I) = B(I)+C(I) IF (I.LE. N) D(I+1) = A(I)*2.0 ENDDO Alignment Many carried dependencies are due to array alignment issues. If we can align all references, then dependencies would go away, and parallelism is possible.

Optimizing Compilers for Modern Architectures Loop Fusion Loop distribution was a method for separating parallel parts of a loop. Our solution attempted to find the maximal loop distribution. The maximal distribution often finds parallelizable components to small for efficient parallelizing. Two obvious solutions: — Strip mine large loops to create larger granularity. — Perform maximal distribution, and fuse together parallelizable loops.

Optimizing Compilers for Modern Architectures Definition: A loop-independent dependence between statements S1 and S2 in loops L1 and L2 respectively is fusion-preventing if fusing L1 and L2 causes the dependence to be carried by the combined loop in the opposite direction. DO I = 1,N S1A(I) = B(I)+C ENDDO DO I = 1,N S2D(I) = A(I+1)+E ENDDO DO I = 1,N S1A(I) = B(I)+C S2D(I) = A(I+1)+E ENDDO Fusion Safety

Optimizing Compilers for Modern Architectures Fusing L1 with L3 violates the ordering constraint. {L1,L3} must occur both before and after the node L2. Fusion Safety We shouldn’t fuse loops if the fusing will violate ordering of the dependence graph. Ordering Constraint: Two loops can’t be validly fused if there exists a path of loop-independent dependencies between them containing a loop or statement not being fused with them.

Optimizing Compilers for Modern Architectures Parallel loops should generally not be merged with sequential loops. Definition: An edge between two statements in loops L1 and L2 respectively is said to be parallelism-inhibiting if after merging L1 and L2, the dependence is carried by the combined loop. DO I = 1,N S1 A(I+1) = B(I) + C ENDDO DO I = 1,N S2D(I) = A(I) + E ENDDO DO I = 1,N S1A(I+1) = B(I) + C S2D(I) = A(I) + E ENDDO Fusion Profitability

Optimizing Compilers for Modern Architectures Loop Interchange Moves dependence-free loops to outermost level Theorem —In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only ‘=‘ entries Vectorization moves loops to innermost level

Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO OK for vectorization Problematic for parallelization

Optimizing Compilers for Modern Architectures Loop Interchange PARALLEL DO J = 1, N DO I = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO END PARALLEL DO

Optimizing Compilers for Modern Architectures Loop Interchange Working with direction matrix —Move loops with all “=“ entries into outermost position and parallelize it and remove the column from the matrix —Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences —Repeat step 1

Optimizing Compilers for Modern Architectures = Loop Reversal DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO

Optimizing Compilers for Modern Architectures Loop Reversal DO K = L, 1, -1 PARALLEL DO I = 2, N+1 PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO Increase the range of options available for loop selection heuristics

Optimizing Compilers for Modern Architectures Pipeline Parallelism Fortran command DOACROSS Useful where parallelization is not available High synchronization costs DO I = 2, N-1 DO J = 2, N-1 A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)) ENDDO

Optimizing Compilers for Modern Architectures Pipeline Parallelism DOACROSS I = 2, N-1 POST (EV(1)) DO J = 2, N-1 WAIT(EV(J-1)) A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)) POST (EV(J)) ENDDO

Optimizing Compilers for Modern Architectures Pipeline Parallelism

Optimizing Compilers for Modern Architectures Pipeline Parallelism DOACROSS I = 2, N-1 POST (E(1)) K = 0 DO J = 2, N-1, 2 K = K+1 WAIT(EV(K)) DO j = J, MAX(J+1, N-1) A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1) ENDDO POST (EV(K+1)) ENDDO

Optimizing Compilers for Modern Architectures Pipeline Parallelism