Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Advertisements

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

March 14, CMPUT680 - Winter 2006 Topic C: Loop Fusion Kit Barton

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Overview Structural Testing Introduction – General Concepts

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

CS 201 Compiler Construction

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Introduction to Graph “theory”

Parallell Processing Systems1 Chapter 4 Vector Processors.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.

Optimizing Compilers for Modern Architectures More Interprocedural Analysis Chapter 11, Sections to end.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Parallelizing Compilers Presented by Yiwei Zhang.

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

High-Level Transformations for Embedded Computing

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Memory-Aware Compilation Philip Sweany 10/20/2011.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Lecture 38: Compiling for Modern Architectures 03 May 02

Conception of parallel algorithms

Dependence Analysis Important and difficult

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 19: Code Optimisation

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Introduction to Optimization

Presentation transcript:

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy

Optimizing Compilers for Modern Architectures Introduction Previously, our transformations targeted vector and superscalar architectures. In this lecture, we worry about transformations for symmetric multiprocessor machines. The difference between these transformations tends to be one of granularity.

Optimizing Compilers for Modern Architectures Review SMP machines have multiple processors all accessing a central memory. The processors are unrelated, and can run separate processes. Starting processes and synchonrization between proccesses is expensive.

Optimizing Compilers for Modern Architectures Synchonrization A basic synchonrization element is the barrier. A barrier in a program forces all processes to reach a certain point before execution continues. Bus contention can cause slowdowns.

Optimizing Compilers for Modern Architectures DO I == 1,N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO PARALLEL DO I = 1,N PRIVATE t S1 t = A(I) S2 A(I) = B(I) S3 B(I) = t ENDDO Single Loops The analog of scalar expansion is privatization. Temporaries can be given separate namespaces for each iteration.

Optimizing Compilers for Modern Architectures Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x. Privatizability can be stated as a data-flow problem: We can also do this by declaring a variable x private if its SSA graph doesn’t contain a phi function at the entry. Privatization

Optimizing Compilers for Modern Architectures We need to privatize array variables. For iteration J, upwards exposed variables are those exposed due to loop body without variables defined earlier. DO I = 1,100 S0 T(1)=X L1 DO J = 2,N S1 T(J) = T(J-1)+B(I,J) S2 A(I,J) = T(J) ENDDO So for this fragment, T(1) is the only exposed variable. Array Privatization

Optimizing Compilers for Modern Architectures PARALLEL DO I = 1,100 PRIVATE t S0 t(1) = X L1 DO J = 2,N S1 t(J) = t(J-1)+B(I,J) S2 A(I,J)=t(J) ENDDO Array Privatization Using this analysis, we get the following code:

Optimizing Compilers for Modern Architectures Loop Distribution Loop distribution eliminates carried dependencies. Consequently, it often creates opportunity for outer-loop parallelism. We must add extra barriers to keep dependent loops from executing out of order, so the overhead may override the parallel savings. Attempt other transformations before attempting this one.

Optimizing Compilers for Modern Architectures DO I = 2,N A(I) = B(I)+C(I) D(I) = A(I-1)*2.0 ENDDO DO I = 1,N+1 IF (I.GT. 1) A(I) = B(I)+C(I) IF (I.LE. N) D(I+1) = A(I)*2.0 ENDDO Alignment Many carried dependencies are due to array alignment issues. If we can align all references, then dependencies would go away, and parallelism is possible.

Optimizing Compilers for Modern Architectures DO I = 2,N J = MOD(I+N-4,N-1)+2 A(J) = B(J)+C D(I)=A(I-1)*2.0 ENDDO D(2) = A(1)*2.0 DO I = 2,N-1 A(I) = B(I)+C(I) D(I+1) = A(I)*2.0 ENDDO A(N) = B(N)+C(N) Alignment There are other ways to align the loop:

Optimizing Compilers for Modern Architectures DO I = 1,N A(I+1) = B(I)+C X(I) = A(I+1)+A(I) ENDDO DO I = 1,N A(I+1) = B(I)+C ! Replicated Statement IF (I.EQ 1) THEN t = A(I) ELSE t = B(I-1)+C END IF X(I) = A(I+1)+t ENDDO Alignment If an array is involved in a recurrence, then alignment isn’t possible. If two dependencies between the same statements have different dependency distances, then alignment doesn’t work. We can fix the second case by replicating code:

Optimizing Compilers for Modern Architectures Theorem: Alignment, replication, and statement reordering are sufficient to eliminate all carried dependencies in a single loop containing no recurrence, and in which the distance of each dependence is a constant independent of the loop index We can establish this constructively. Let G = (V,E,  ) be a weighted graph. v  V is a statement, and  (v1, v2) is the dependence distance between v1 and v2. Let o: V  Z give the offset of vertices. G is said to be carry free if o(v1) +  (v1, v2) = o(v2). Alignment

Optimizing Compilers for Modern Architectures procedure Align(V,E, ,0) While V is not empty remove element v from V for each (w,v)  E if w  V W  W  {w} o(w)  o(v) -  (w,v) else if o(w) != o(v) -  (w,v) create vertex w’ replace (w,v) with (w’,v) replicate all edges into w onto w’ W  W  {w’} o(w)’  o(v) -  (w,v) for each (v,w)  E if w  V W  W  {w} o(w)  o(v) +  (v,w) else if o(w) != o(v) +  (v,w) create vertex v’ replace (v,w) with (v’,w) replicate edges into v onto v’ W  W  {v’} o(v’)  o(w) -  (v,w) end align Alignment Procedure

Optimizing Compilers for Modern Architectures Loop Fusion Loop distribution was a method for separating parallel parts of a loop. Our solution attempted to find the maximal loop distribution. The maximal distribution often finds parallelizable components to small for efficient parallelizing. Two obvious solutions: — Strip mine large loops to create larger granularity. — Perform maximal distribution, and fuse together parallelizable loops.

Optimizing Compilers for Modern Architectures Definition: A loop-independent dependence between statements S1 and S2 in loops L1 and L2 respectively is fusion-preventing if fusing L1 and L2 causes the dependence to be carried by the combined loop in the opposite direction. DO I = 1,N S1A(I) = B(I)+C ENDDO DO I = 1,N S2D(I) = A(I+1)+E ENDDO DO I = 1,N S1A(I) = B(I)+C S2D(I) = A(I+1)+E ENDDO Fusion Safety

Optimizing Compilers for Modern Architectures Fusing L1 with L3 violates the ordering constraint. {L1,L3} must occur both before and after the node L2. Fusion Safety We shouldn’t fuse loops if the fusing will violate ordering of the dependence graph. Ordering Constraint: Two loops can’t be validly fused if there exists a path of loop-independent dependencies between them containing a loop or statement not being fused with them.

Optimizing Compilers for Modern Architectures Parallel loops should generally not be merged with sequential loops. Definition: An edge between two statements in loops L1 and L2 respectively is said to be parallelism-inhibiting if after merging L1 and L2, the dependence is carried by the combined loop. DO I = 1,N S1 A(I+1) = B(I) + C ENDDO DO I = 1,N S2D(I) = A(I) + E ENDDO DO I = 1,N S1A(I+1) = B(I) + C S2D(I) = A(I) + E ENDDO Fusion Profitability

Optimizing Compilers for Modern Architectures Typed Fusion We start off by classifying loops into two types: parallel and sequential. We next gather together all edges that inhibit efficient fusion, and call them bad edges. Given a loop dependency graph (V,E), we want to obtain a graph (V’,E’) by merging vertices of V subject to the following constraints: — Bad Edge Constraint: vertices joined by a bad edge aren’t fused. — Ordering Constraint: vertices joined by path containing non-parallel vertex aren’t fused

Optimizing Compilers for Modern Architectures procedure TypedFusion(G,T,type,B,t0) Initialize all variables to zero Set count[n] to be the in-degree of node n Initialize W with all nodes with in-degree zero while W isn’t empty remove element n with type t from W if t = t0 if maxBadPrev[n] = 0 then p  fused else p  next[maxBadPrev[n]] if p != 0 then x  node[p] num[n]  num[x] update_successors(n,t) fuse x and n and call the result n else create_new_fused_node(n) update_successors(n,t) else create_new_node(n) update_successors(n,t) end TypedFusion Typed Fusion Procedure

Optimizing Compilers for Modern Architectures Original loop graph Graph annotated (maxBadPrev,p)  num After fusing parallel loops After fusing sequential loops Typed Fusion Example

Optimizing Compilers for Modern Architectures Given an outer loop containing some number of inner loops, we want to be able to run some inner loops in parallel. We can do this as follows: Run TypedFusion with B = {fusion-preventing edges, parallelism-inhibiting edges, and edges between a parallel loop and a sequential loop} Put a barrier at the end of each identified cohort Run TypedFusion again to fuse the parallel loops in each cohort Cohort Fusion